RL5 Model-Free Control

reference:
UCL Course on RL

lecture 5: Model-Free Control

Introduction

Optimise the value function of an unknown MDP

Model-free control can solve these problems, either:

  1. MDP model is unknown, but experience can be sampled
  2. MDP model is known, but is too big to use, except by samples

On and Off-Policy Learning

  • On-policy learning
    • Learn on the job
    • Learn about policy $\pi$ from experience sampled from $\pi$
  • Off-policy learning
    • Look over someone’s shoulder
    • Learn about policy $\pi$ from experience sampled from $\mu$

On-Policy Monte-Carlo Control

Generalised Policy Iteration

Generalised Policy Iteration(Refresher)
Policy evaluation: Estimate $v_\pi$, Iterative policy evaluation
Policy improvement: Generate $\pi’ \ge \pi$, Greedy policy improvement

Generalised Policy Iteration With Monte-Carlo Evaluation
Policy evaluation: Monte-Carlo policy evaluation, $V=v_\pi$
Policy improvement: Greedy policy improvement

Model-Free Policy Iteration Using Action-Value Function

  • Greedy policy improvement over $V(s)$ requires model of MDP
    $$ \pi’(s) = \arg\max_{a \in A} R_s^a +P_{ss’}^a V(s’)$$
  • Greedy policy improvement over $Q(s,a)$ is model-free
    $$ \pi’(s) = \arg\max_{a \in A} Q(s,a) $$

Generalised Policy Iteration with Action-Value Function
Policy evaluation: Monte-Carlo policy evaluation, $Q=q_\pi$
Policy improvement: Greedy policy improvement

$\epsilon$-Greedy Exploration

  • Simplest idea for ensuring continual exploration
  • All $m$ actions are tried with non-zero probability
  • With probability $1-\epsilon$ choose the greedy action
  • With probability $\epsilon$ choose an action at random
    $$ \pi(a|s) = \begin{cases} \epsilon/m + 1 -\epsilon \quad &\text{if } a^* = \arg\max_{a\in A} Q(s,a) \\ \epsilon/m &\text{otherwise}\end{cases}$$

Theorem
For any $\epsilon$-greedy policy $\pi$, the $\epsilon$-greedy policy $\pi’$ with respect to $q_\pi$ is an improvement, $v_{\pi’} (s) \ge v_\pi (s) $

Monte-Carlo Policy Iteration

  • Policy evaluation Monte-Carlo policy evaluation, $Q=q_\pi$
  • Policy improvement $\epsilon$-greedy policy improvement

Monte-Carlo Control
Every episode

  • Policy evaluation Monte-Carlo policy evaluation $Q\approx q_\pi$
  • Policy improvement $\epsilon$-greedy policy improvement

GLIE

Definition
Greedy in the Limit with Infinite Exploration(GLIE)

  • All state-action pairs are explored infinitely many times
    $$ \lim_{k\rightarrow \infty} N_k (s,a) =\infty $$
  • The policy converges on a greedy policy
    $$ \lim_{k \rightarrow \infty} \pi_k (a|s) = 1(a = \arg\max_{a’ \in A} Q_k (s,a’) )$$

GLIE Monte-Carlo Control

  • Sample $k$th episode using $\pi$: $(S_1,A_1,R_2,\dots,S_T) \sim \pi$
  • For each state $S_t$ and action $A_t$ in the episode
    $$ N(S_t,A_t) \leftarrow N(S_t,A_t) +1 $$
    $$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \frac{1}{N(S_t,A_t)}(G_t - Q(S_t,A_t)) $$
  • Improve policy based on new action-value function
    $$ \epsilon \leftarrow 1/k\qquad \pi \leftarrow \epsilon-\text{greedy}(Q) $$

    Theorem
    GLIE Monte-Carlo control converges to the optimal action-value function, $Q(s,a) \rightarrow q_* (s,a)$

On-policy Temporal-Difference Learning

MC vs. TD Control

  • TD learning has several advantages over Monte-Carlo(MC)
    • Lower variance
    • Online
    • Incomplete sequences
  • Natural idea: use TD instead of MC in our control loop
    • Apply TD to $Q(S,A)$
    • Use $\epsilon$-greedy policy improvement
    • Update every time-step

Sarsa($\lambda$)

Updating Action-Value Functions with Sarsa
$$ Q(S,A) \leftarrow Q(S,A) + \alpha(R+\gamma Q(S’,A’) -Q(S,A)) $$

On-Policy Control With Sarsa
Every time-step:
Policy evaluation Sarsa, $Q\approx q_{\pi} $
Policy improvement $\epsilon$-greedy policy improvement

Sarsa Algorithm for On-Policy Control
Sarsa Algorithm for On-Policy Control

Convergence of Sarsa

Theorem
Sarsa converges to the optimal action-value function, $Q(s,a) \rightarrow q_* (s,a)$, under the following conditions:

  • GLIE sequence of policies $\pi_t (a|s)$
  • Robbins-Monro sequence of step-sizes $\alpha_t$
    $$ \sum_{t=1}^{\infty} \alpha_t = \infty \qquad \sum_{t=1}^{\infty} \alpha_t^2 < \infty $$

n-Step Sarsa

  • Define the n-step Q-return
    $$ q_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n Q(S_{t+n}) $$
  • n-step Sarsa updates $Q(s,a)$ towards the n-step Q-return
    $$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha (q_t^{(n)} - Q(S_t,A_t)) $$

Forward View Sarsa($\lambda$)

  • The $q^\lambda$ return combines all n-step Q-return $q_t^{(n)}$
  • Using weight $(1-\lambda)\lambda^{n-1}$
    $$ q_t^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} q_t^{(n)} $$
  • Forward-view Sarsa($\lambda$)
    $$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha (q_t^\lambda - Q(S_t,A_t)) $$

Backward View Sarsa($\lambda$)

  • Just like TD($\lambda$), we use eligibility traces in an online algorithm
  • But Sarsa($\lambda$) has one eligibility trace for each state-action pair
    $$ E_0 (s,a) = 0 \qquad E_t(s,a) = \gamma \lambda E_{t-1} (s,a) + 1(S_t = s,A_t =a) $$
  • $Q(s,a)$ is updated for every state $s$ and action $a$
  • In proportion to TD-error $\delta_t$ and eligibility trace $E_t(s,a)$
    $$ \delta_t = R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) -Q(S_t,A_t)$$
    $$ Q(s,a) \leftarrow Q(s,a) + \alpha \delta_t E_t(s,a) $$

Sarsa($\lambda$) Algorithm
Sarsa($\lambda$) Algorithm

Off-Policy Learning

  • Evaluate target policy $\pi(a|s)$ to compute $v_\pi (s)$ or $q_\pi (s,a)$
  • While following behaviour policy $\mu(a|s)$
    $$ (S_1,A_1,R_2,\dots,S_T) \sim \mu $$
  • Why is this important?
    • Learn from observing humans or other agents
    • Re-use experience generated from old policies $\pi_1,\pi_2,\dots,\pi_{t-1}$
    • Learn about optimal policy while following exploratory policy
    • Learn about multiple policies while following one policy

Importance Sampling

Estimate the expectation of a different distribution
$$ E_{X\sim P} [f(X)] = \sum P(X) f(X) = \sum Q(X) \frac{P(X)}{Q(X)} f(X) = E_{X\sim Q} \left[ \frac{P(X)}{Q(X)} f(X)\right] $$

Importance Sampling for Off-Policy Monte-Carlo

  • Use returns generated from $\mu$ to evaluate $\pi$
  • Weight return $G_t$ according to similarity between policies
  • Multiply importance sampling corrections along whole episode
    $$ G_t^{\pi/\mu} = \frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1}) \dots \pi(A_T|S_T)}{\mu(A_t|S_t) \mu(A_{t+1}|S_{t+1}) \dots \mu(A_t|S_t)}G_t $$
  • Update value towards corrected return
    $$ V(S_t) \leftarrow V(S_t) + \alpha (G_t^{\pi/\mu} - V(S_t)) $$
  • Cannot use if $\mu$ is zero when $\pi$ is non-zero
  • Importance sampling can dramatically increase variance

On the other hand, we can get lower variance using TD as follow:

  • Use TD target generated from $\mu$ to evaluate $\pi$
  • Weight TD target $R+\gamma V(S’)$ by importance sampling
  • Only need a single importance sampling correction
    $$ V(S_t) \leftarrow V(S_t) + \alpha \left( \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1} + \gamma V(S_{t+1})) -V(S_t) \right)$$
  • Much lower variance than Monte-Carlo importance sampling
  • Policies only need to be similar over a single step

Q-learning

  • We now consider off-policy learning of action-values $Q(s,a)$
  • No importance sampling is required
  • Next action is chosen using behaviour policy $A_{t+1} \sim \mu (\cdot |S_t) $
  • But we consider alternative successor action $A’ \sim \pi(\cdot| S_t)$
  • And update $Q(S_t,A_t)$ towards value of alternative action
    $$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha (R_{t+1} + \gamma Q(S_{t+1},A’) - Q(S_t,A_t)) $$

Off-policy Control with Q-learning

  • We now allow both behaviour and target policies to improve
  • The target policy $\pi$ is greedy w.r.t $Q(s,a)$
    $$ \pi (S_{t+1}) = \arg\max_{a’} Q(S_{t+1},a’) $$
  • The behaviour policy $\mu$ is e.g. $\epsilon$ w.r.t $Q(s,a)$
  • The Q-learning target then simplifies
    $$ R_{t+1} + \gamma Q(S_{t+1},A’) = R_{t+1} + \gamma Q(S_{t+1},\arg\max_{a’} Q(S_{t+1},a’) ) = R_{t+1}+ \max_{a’} \gamma Q(S_{t+1},a’) $$

Theorem
Q-learning control converges to the optimal action-value function $Q(s,a) \rightarrow q_* (s,a) $

Q-learning Algorithm for Off-Policy Control

Q-learning Algorithm for Off-Policy Control

Summary

Relationship between DP and TD

Realtionship between DP and TD

Relationship between DP and TD(2)
where $x \leftarrow^{\alpha} y \iff x \leftarrow x + \alpha (y-x) $

Donate comment here