reference:
UCL Course on RL

lecture 5: Model-Free Control

Introduction

Optimise the value function of an unknown MDP

Model-free control can solve these problems, either:

MDP model is unknown, but experience can be sampled
MDP model is known, but is too big to use, except by samples

On and Off-Policy Learning

On-policy learning
- Learn on the job
- Learn about policy $\pi$ from experience sampled from $\pi$
Off-policy learning
- Look over someone’s shoulder
- Learn about policy $\pi$ from experience sampled from $\mu$

On-Policy Monte-Carlo Control

Generalised Policy Iteration

Generalised Policy Iteration(Refresher)
Policy evaluation: Estimate $v_\pi$, Iterative policy evaluation
Policy improvement: Generate $\pi’ \ge \pi$, Greedy policy improvement

Generalised Policy Iteration With Monte-Carlo Evaluation
Policy evaluation: Monte-Carlo policy evaluation, $V=v_\pi$
Policy improvement: Greedy policy improvement

Model-Free Policy Iteration Using Action-Value Function

Greedy policy improvement over $V(s)$ requires model of MDP
$$ \pi’(s) = \arg\max_{a \in A} R_s^a +P_{ss’}^a V(s’)$$
Greedy policy improvement over $Q(s,a)$ is model-free
$$ \pi’(s) = \arg\max_{a \in A} Q(s,a) $$

Generalised Policy Iteration with Action-Value Function
Policy evaluation: Monte-Carlo policy evaluation, $Q=q_\pi$
Policy improvement: Greedy policy improvement

$\epsilon$-Greedy Exploration

Simplest idea for ensuring continual exploration
All $m$ actions are tried with non-zero probability
With probability $1-\epsilon$ choose the greedy action
With probability $\epsilon$ choose an action at random
$$ \pi(a|s) = \begin{cases} \epsilon/m + 1 -\epsilon \quad &\text{if } a^* = \arg\max_{a\in A} Q(s,a) \\ \epsilon/m &\text{otherwise}\end{cases}$$

Theorem
For any $\epsilon$-greedy policy $\pi$, the $\epsilon$-greedy policy $\pi’$ with respect to $q_\pi$ is an improvement, $v_{\pi’} (s) \ge v_\pi (s) $

Monte-Carlo Policy Iteration

Policy evaluation Monte-Carlo policy evaluation, $Q=q_\pi$
Policy improvement $\epsilon$-greedy policy improvement

Monte-Carlo Control
Every episode

Policy evaluation Monte-Carlo policy evaluation $Q\approx q_\pi$
Policy improvement $\epsilon$-greedy policy improvement

GLIE

Definition
Greedy in the Limit with Infinite Exploration(GLIE)

All state-action pairs are explored infinitely many times
$$ \lim_{k\rightarrow \infty} N_k (s,a) =\infty $$

The policy converges on a greedy policy
$$ \lim_{k \rightarrow \infty} \pi_k (a|s) = 1(a = \arg\max_{a’ \in A} Q_k (s,a’) )$$

GLIE Monte-Carlo Control

Sample $k$th episode using $\pi$: $(S_1,A_1,R_2,\dots,S_T) \sim \pi$
For each state $S_t$ and action $A_t$ in the episode
$$ N(S_t,A_t) \leftarrow N(S_t,A_t) +1 $$
$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \frac{1}{N(S_t,A_t)}(G_t - Q(S_t,A_t)) $$
Improve policy based on new action-value function
$$ \epsilon \leftarrow 1/k\qquad \pi \leftarrow \epsilon-\text{greedy}(Q) $$

Theorem
GLIE Monte-Carlo control converges to the optimal action-value function, $Q(s,a) \rightarrow q_* (s,a)$

On-policy Temporal-Difference Learning

MC vs. TD Control

TD learning has several advantages over Monte-Carlo(MC)
- Lower variance
- Online
- Incomplete sequences
Natural idea: use TD instead of MC in our control loop
- Apply TD to $Q(S,A)$
- Use $\epsilon$-greedy policy improvement
- Update every time-step

Sarsa($\lambda$)

Updating Action-Value Functions with Sarsa
$$ Q(S,A) \leftarrow Q(S,A) + \alpha(R+\gamma Q(S’,A’) -Q(S,A)) $$

On-Policy Control With Sarsa
Every time-step:
Policy evaluation Sarsa, $Q\approx q_{\pi} $
Policy improvement $\epsilon$-greedy policy improvement

Sarsa Algorithm for On-Policy Control

Convergence of Sarsa

Theorem
Sarsa converges to the optimal action-value function, $Q(s,a) \rightarrow q_* (s,a)$, under the following conditions:

GLIE sequence of policies $\pi_t (a|s)$

Robbins-Monro sequence of step-sizes $\alpha_t$
$$ \sum_{t=1}^{\infty} \alpha_t = \infty \qquad \sum_{t=1}^{\infty} \alpha_t^2 < \infty $$

n-Step Sarsa

Define the n-step Q-return
$$ q_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n Q(S_{t+n}) $$
n-step Sarsa updates $Q(s,a)$ towards the n-step Q-return
$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha (q_t^{(n)} - Q(S_t,A_t)) $$

Forward View Sarsa($\lambda$)

The $q^\lambda$ return combines all n-step Q-return $q_t^{(n)}$
Using weight $(1-\lambda)\lambda^{n-1}$
$$ q_t^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} q_t^{(n)} $$
Forward-view Sarsa($\lambda$)
$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha (q_t^\lambda - Q(S_t,A_t)) $$

Backward View Sarsa($\lambda$)

Just like TD($\lambda$), we use eligibility traces in an online algorithm
But Sarsa($\lambda$) has one eligibility trace for each state-action pair
$$ E_0 (s,a) = 0 \qquad E_t(s,a) = \gamma \lambda E_{t-1} (s,a) + 1(S_t = s,A_t =a) $$
$Q(s,a)$ is updated for every state $s$ and action $a$
In proportion to TD-error $\delta_t$ and eligibility trace $E_t(s,a)$
$$ \delta_t = R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) -Q(S_t,A_t)$$
$$ Q(s,a) \leftarrow Q(s,a) + \alpha \delta_t E_t(s,a) $$

Sarsa($\lambda$) Algorithm
$Sarsa($\lambda$) Algorithm$

Off-Policy Learning

Evaluate target policy $\pi(a|s)$ to compute $v_\pi (s)$ or $q_\pi (s,a)$
While following behaviour policy $\mu(a|s)$
$$ (S_1,A_1,R_2,\dots,S_T) \sim \mu $$
Why is this important?
- Learn from observing humans or other agents
- Re-use experience generated from old policies $\pi_1,\pi_2,\dots,\pi_{t-1}$
- Learn about optimal policy while following exploratory policy
- Learn about multiple policies while following one policy

Importance Sampling

Estimate the expectation of a different distribution
$$ E_{X\sim P} [f(X)] = \sum P(X) f(X) = \sum Q(X) \frac{P(X)}{Q(X)} f(X) = E_{X\sim Q} \left[ \frac{P(X)}{Q(X)} f(X)\right] $$

Importance Sampling for Off-Policy Monte-Carlo

Use returns generated from $\mu$ to evaluate $\pi$
Weight return $G_t$ according to similarity between policies
Multiply importance sampling corrections along whole episode
$$ G_t^{\pi/\mu} = \frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1}) \dots \pi(A_T|S_T)}{\mu(A_t|S_t) \mu(A_{t+1}|S_{t+1}) \dots \mu(A_t|S_t)}G_t $$
Update value towards corrected return
$$ V(S_t) \leftarrow V(S_t) + \alpha (G_t^{\pi/\mu} - V(S_t)) $$
Cannot use if $\mu$ is zero when $\pi$ is non-zero
Importance sampling can dramatically increase variance

On the other hand, we can get lower variance using TD as follow:

Use TD target generated from $\mu$ to evaluate $\pi$
Weight TD target $R+\gamma V(S’)$ by importance sampling
Only need a single importance sampling correction
$$ V(S_t) \leftarrow V(S_t) + \alpha \left( \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1} + \gamma V(S_{t+1})) -V(S_t) \right)$$
Much lower variance than Monte-Carlo importance sampling
Policies only need to be similar over a single step

Q-learning

We now consider off-policy learning of action-values $Q(s,a)$
No importance sampling is required
Next action is chosen using behaviour policy $A_{t+1} \sim \mu (\cdot |S_t) $
But we consider alternative successor action $A’ \sim \pi(\cdot| S_t)$
And update $Q(S_t,A_t)$ towards value of alternative action
$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha (R_{t+1} + \gamma Q(S_{t+1},A’) - Q(S_t,A_t)) $$

Off-policy Control with Q-learning

We now allow both behaviour and target policies to improve
The target policy $\pi$ is greedy w.r.t $Q(s,a)$
$$ \pi (S_{t+1}) = \arg\max_{a’} Q(S_{t+1},a’) $$
The behaviour policy $\mu$ is e.g. $\epsilon$ w.r.t $Q(s,a)$
The Q-learning target then simplifies
$$ R_{t+1} + \gamma Q(S_{t+1},A’) = R_{t+1} + \gamma Q(S_{t+1},\arg\max_{a’} Q(S_{t+1},a’) ) = R_{t+1}+ \max_{a’} \gamma Q(S_{t+1},a’) $$