reference:
UCL Course on RL
lecture 5: Model-Free Control
Introduction
Optimise the value function of an unknown MDP
Model-free control can solve these problems, either:
- MDP model is unknown, but experience can be sampled
- MDP model is known, but is too big to use, except by samples
On and Off-Policy Learning
- On-policy learning
- Learn on the job
- Learn about policy $\pi$ from experience sampled from $\pi$
- Off-policy learning
- Look over someone’s shoulder
- Learn about policy $\pi$ from experience sampled from $\mu$
On-Policy Monte-Carlo Control
Generalised Policy Iteration
Generalised Policy Iteration(Refresher)
Policy evaluation: Estimate $v_\pi$, Iterative policy evaluation
Policy improvement: Generate $\pi’ \ge \pi$, Greedy policy improvement
Generalised Policy Iteration With Monte-Carlo Evaluation
Policy evaluation: Monte-Carlo policy evaluation, $V=v_\pi$
Policy improvement: Greedy policy improvement
Model-Free Policy Iteration Using Action-Value Function
- Greedy policy improvement over $V(s)$ requires model of MDP
$$ \pi’(s) = \arg\max_{a \in A} R_s^a +P_{ss’}^a V(s’)$$ - Greedy policy improvement over $Q(s,a)$ is model-free
$$ \pi’(s) = \arg\max_{a \in A} Q(s,a) $$
Generalised Policy Iteration with Action-Value Function
Policy evaluation: Monte-Carlo policy evaluation, $Q=q_\pi$
Policy improvement: Greedy policy improvement
$\epsilon$-Greedy Exploration
- Simplest idea for ensuring continual exploration
- All $m$ actions are tried with non-zero probability
- With probability $1-\epsilon$ choose the greedy action
- With probability $\epsilon$ choose an action at random
$$ \pi(a|s) = \begin{cases} \epsilon/m + 1 -\epsilon \quad &\text{if } a^* = \arg\max_{a\in A} Q(s,a) \\ \epsilon/m &\text{otherwise}\end{cases}$$
Theorem
For any $\epsilon$-greedy policy $\pi$, the $\epsilon$-greedy policy $\pi’$ with respect to $q_\pi$ is an improvement, $v_{\pi’} (s) \ge v_\pi (s) $
Monte-Carlo Policy Iteration
- Policy evaluation Monte-Carlo policy evaluation, $Q=q_\pi$
- Policy improvement $\epsilon$-greedy policy improvement
Monte-Carlo Control
Every episode
- Policy evaluation Monte-Carlo policy evaluation $Q\approx q_\pi$
- Policy improvement $\epsilon$-greedy policy improvement
GLIE
Definition
Greedy in the Limit with Infinite Exploration(GLIE)
- All state-action pairs are explored infinitely many times
$$ \lim_{k\rightarrow \infty} N_k (s,a) =\infty $$- The policy converges on a greedy policy
$$ \lim_{k \rightarrow \infty} \pi_k (a|s) = 1(a = \arg\max_{a’ \in A} Q_k (s,a’) )$$
GLIE Monte-Carlo Control
- Sample $k$th episode using $\pi$: $(S_1,A_1,R_2,\dots,S_T) \sim \pi$
- For each state $S_t$ and action $A_t$ in the episode
$$ N(S_t,A_t) \leftarrow N(S_t,A_t) +1 $$
$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \frac{1}{N(S_t,A_t)}(G_t - Q(S_t,A_t)) $$ - Improve policy based on new action-value function
$$ \epsilon \leftarrow 1/k\qquad \pi \leftarrow \epsilon-\text{greedy}(Q) $$Theorem
GLIE Monte-Carlo control converges to the optimal action-value function, $Q(s,a) \rightarrow q_* (s,a)$
On-policy Temporal-Difference Learning
MC vs. TD Control
- TD learning has several advantages over Monte-Carlo(MC)
- Lower variance
- Online
- Incomplete sequences
- Natural idea: use TD instead of MC in our control loop
- Apply TD to $Q(S,A)$
- Use $\epsilon$-greedy policy improvement
- Update every time-step
Sarsa($\lambda$)
Updating Action-Value Functions with Sarsa
$$ Q(S,A) \leftarrow Q(S,A) + \alpha(R+\gamma Q(S’,A’) -Q(S,A)) $$
On-Policy Control With Sarsa
Every time-step:
Policy evaluation Sarsa, $Q\approx q_{\pi} $
Policy improvement $\epsilon$-greedy policy improvement
Sarsa Algorithm for On-Policy Control
Convergence of Sarsa
Theorem
Sarsa converges to the optimal action-value function, $Q(s,a) \rightarrow q_* (s,a)$, under the following conditions:
- GLIE sequence of policies $\pi_t (a|s)$
- Robbins-Monro sequence of step-sizes $\alpha_t$
$$ \sum_{t=1}^{\infty} \alpha_t = \infty \qquad \sum_{t=1}^{\infty} \alpha_t^2 < \infty $$
n-Step Sarsa
- Define the n-step Q-return
$$ q_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n Q(S_{t+n}) $$ - n-step Sarsa updates $Q(s,a)$ towards the n-step Q-return
$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha (q_t^{(n)} - Q(S_t,A_t)) $$
Forward View Sarsa($\lambda$)
- The $q^\lambda$ return combines all n-step Q-return $q_t^{(n)}$
- Using weight $(1-\lambda)\lambda^{n-1}$
$$ q_t^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} q_t^{(n)} $$ - Forward-view Sarsa($\lambda$)
$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha (q_t^\lambda - Q(S_t,A_t)) $$
Backward View Sarsa($\lambda$)
- Just like TD($\lambda$), we use eligibility traces in an online algorithm
- But Sarsa($\lambda$) has one eligibility trace for each state-action pair
$$ E_0 (s,a) = 0 \qquad E_t(s,a) = \gamma \lambda E_{t-1} (s,a) + 1(S_t = s,A_t =a) $$ - $Q(s,a)$ is updated for every state $s$ and action $a$
- In proportion to TD-error $\delta_t$ and eligibility trace $E_t(s,a)$
$$ \delta_t = R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) -Q(S_t,A_t)$$
$$ Q(s,a) \leftarrow Q(s,a) + \alpha \delta_t E_t(s,a) $$
Sarsa($\lambda$) Algorithm
Off-Policy Learning
- Evaluate target policy $\pi(a|s)$ to compute $v_\pi (s)$ or $q_\pi (s,a)$
- While following behaviour policy $\mu(a|s)$
$$ (S_1,A_1,R_2,\dots,S_T) \sim \mu $$ - Why is this important?
- Learn from observing humans or other agents
- Re-use experience generated from old policies $\pi_1,\pi_2,\dots,\pi_{t-1}$
- Learn about optimal policy while following exploratory policy
- Learn about multiple policies while following one policy
Importance Sampling
Estimate the expectation of a different distribution
$$ E_{X\sim P} [f(X)] = \sum P(X) f(X) = \sum Q(X) \frac{P(X)}{Q(X)} f(X) = E_{X\sim Q} \left[ \frac{P(X)}{Q(X)} f(X)\right] $$
Importance Sampling for Off-Policy Monte-Carlo
- Use returns generated from $\mu$ to evaluate $\pi$
- Weight return $G_t$ according to similarity between policies
- Multiply importance sampling corrections along whole episode
$$ G_t^{\pi/\mu} = \frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1}) \dots \pi(A_T|S_T)}{\mu(A_t|S_t) \mu(A_{t+1}|S_{t+1}) \dots \mu(A_t|S_t)}G_t $$ - Update value towards corrected return
$$ V(S_t) \leftarrow V(S_t) + \alpha (G_t^{\pi/\mu} - V(S_t)) $$ - Cannot use if $\mu$ is zero when $\pi$ is non-zero
- Importance sampling can dramatically increase variance
On the other hand, we can get lower variance using TD as follow:
- Use TD target generated from $\mu$ to evaluate $\pi$
- Weight TD target $R+\gamma V(S’)$ by importance sampling
- Only need a single importance sampling correction
$$ V(S_t) \leftarrow V(S_t) + \alpha \left( \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1} + \gamma V(S_{t+1})) -V(S_t) \right)$$ - Much lower variance than Monte-Carlo importance sampling
- Policies only need to be similar over a single step
Q-learning
- We now consider off-policy learning of action-values $Q(s,a)$
- No importance sampling is required
- Next action is chosen using behaviour policy $A_{t+1} \sim \mu (\cdot |S_t) $
- But we consider alternative successor action $A’ \sim \pi(\cdot| S_t)$
- And update $Q(S_t,A_t)$ towards value of alternative action
$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha (R_{t+1} + \gamma Q(S_{t+1},A’) - Q(S_t,A_t)) $$
Off-policy Control with Q-learning
- We now allow both behaviour and target policies to improve
- The target policy $\pi$ is greedy w.r.t $Q(s,a)$
$$ \pi (S_{t+1}) = \arg\max_{a’} Q(S_{t+1},a’) $$ - The behaviour policy $\mu$ is e.g. $\epsilon$ w.r.t $Q(s,a)$
- The Q-learning target then simplifies
$$ R_{t+1} + \gamma Q(S_{t+1},A’) = R_{t+1} + \gamma Q(S_{t+1},\arg\max_{a’} Q(S_{t+1},a’) ) = R_{t+1}+ \max_{a’} \gamma Q(S_{t+1},a’) $$
Theorem
Q-learning control converges to the optimal action-value function $Q(s,a) \rightarrow q_* (s,a) $
Q-learning Algorithm for Off-Policy Control
Summary
Relationship between DP and TD
where $x \leftarrow^{\alpha} y \iff x \leftarrow x + \alpha (y-x) $