RL4 Model-Free Prediction

reference:
UCL Course on RL

Lecture 4 Model-Free Prediction

Estimate the value function of an unknown MDP

Monte-Carlo Reinforcement learning

  • MC methods learn directly from episodes of experience
  • MC is model-free: no knowledge of MDP transition/rewards
  • MC learns from complete episodes: no bootstrapping
  • MC uses the simplest possible idea: value = mean return
  • Caveat: can only apply MC to episodic MDPs
    • All episodes must terminate

MC Policy Evaluation

  • Goal: learn $v_{\pi}$ from episodes of experience under policy $\pi$, $S_1,A_1,R_2,\dots,S_k \sim \pi $
  • Recall that the return is the total discounted reward: $G_t = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{T-1} R_{T} $
  • Recall that the value function is the expected return: $v_{\pi} (s) = E_{\pi} [G_t|S_t =s]$
  • MC policy evaluation uses empirical mean return instead of expected return

First-Visit MC Policy Evaluation

  • To evaluate state $s$
  • The first time-step $t$ that state $s$ is visited in an episode
  • Increment counter $N(s) \longleftarrow N(s) +1 $
  • Increment total return $S(s) \longleftarrow S(s) +G_t$
  • Value is estimated by mean return $V(s) = S(s)/N(s) $
  • By law of large numbers, $V(s) \rightarrow v_{\pi} (s) $ as $N(s) \rightarrow \infty$

Every-Visit MC Policy Evaluation

  • To evaluate state $s$
  • Every time-step $t$ that state $s$ is visited in an episode
  • Increment counter $N(s) \longleftarrow N(s) +1 $
  • Increment total return $S(s) \longleftarrow S(s) +G_t$
  • Value is estimated by mean return $V(s) = S(s)/N(s) $
  • By law of large numbers, $V(s) \rightarrow v_{\pi} (s) $ as $N(s) \rightarrow \infty$

Incremental MC

Incremental Mean
The mean $\mu_1,\mu_2,\dots$ of a sequence $x_1,x_2,\dots$ can be computed incrementally,
$$\mu_k = \mu_{k-1} + \frac{1}{k} (x_k - \mu_{k-1}) $$

Incremental MC updates

  • Update $V(s)$ incrementally after episode $S_1,A_1,R_1,\dots,S_T$
  • For each state $S_t$ with return $G_t$.
    $$ N(S_t) \leftarrow N(S_t) + 1 $$
    $$ V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t)} (G_t - V(S_t)) $$
  • In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes
    $$ V(S_t) \leftarrow V(S_t) + \alpha (G-V(S_t)) $$

$\alpha$ is the update rate.

Temporal-Difference Learning

  • TD method learn directly from episodes of experience
  • TD is model-free: no knowledge of MDP transitions/rewards
  • TD learns from incomplete episodes, by bootstrapping
  • TD updates a guess towards a guess

TD algorithm

  • Goal: learn $v_{\pi}$ online from experience under policy $\pi$
  • Incremental every-visit MC
    • Update value $V(S_t)$ toward actual return $G_t$
      $$ V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t)) $$
  • Simplest temporal-difference learning algorithm: TD(0)
    • Update value $V(S_t)$ toward estimated return $R_{t+1} + \gamma V(S_{t+1})$
      $$ V(S_t) \leftarrow V(S_t) + \alpha (R_{t+1} + \gamma V(S_{t+1}) -V(S_t)) $$
    • $R_{t+1}+\gamma V(S_{t+1})$ is called the TD target
    • $\delta_t = R_{t+1} + \gamma V(S_{t+1}) -V(S_t)$ is called the TD error

Bias/Variance Trade-off

  • Return $G_t = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{T-1} R_T$ is unbiased estimate of $v_{\pi} (S_t)$
  • True TD target $R_{t+1} + \gamma v_{\pi} (S_{t+1}) $ is unbiased estimate of $v_{\pi}(S_t)$
  • TD target $R_{t+1} + \gamma V(S_{t+1})$ is biased estimate of $v_{\pi} (S_t)$
  • TD target is much lower variance than the return
    • Return depends on many random actions, transitions, rewards
    • TD target depends on one random action, transition, reward

Batch MC and TD

  • MC and TD converge: $V(s) \rightarrow v_{\pi} (s)$ as experience $\rightarrow \infty$
  • But what about batch solution for finite experience?
    • e.g. Repeatedly sample episode $k\in [1,K] $
    • Apply MC or TD(0) to episode $k$

Certainty Equivalence

  • MC converges to solution with minimum mean-squared error

    • Best fit to the observed returns
      $$ \sum_{k=1}^{K} \sum_{t=1}^{T_k} (G_t^k -V(s_t^k))^2 $$
  • TD(0) converges to solution of max likelihood Markov model

    • Solution to the MDP $(S,A,\hat{P},\hat{R},\gamma)$ that best fits the data
      $$ \hat{P}_{s,s’}^a = \frac{1}{N(s,a)} \sum_{k=1}^{K} \sum_{t=1}^{T_k} 1(s_t^k,a_t^k,s_{t+1}^k = s,a,s’) $$
      $$ \hat{R}_s^a = \frac{1}{N(s,a)} \sum_{k=1}^{K} \sum_{t=1}^{T_k} 1(s_t^k,a_t^k = s,a) r_t^k $$

      Advantages and Disadvantages of MC vs. TD

  • TD can learn before knowing the final outcome
    • TD can learn online after every step
    • MC must wait until end of episode before return is known
  • TD can learn without the final outcome
    • TD can learn from incomplete sequences
    • MC can only learn from complete sequences
    • TD works in continuing (non-terminating) environments
    • MC only works for episodic (terminating) environments
  • MC has high variance, zero bias
    • Good convergence properties
    • (even with function approximation)
    • Not every sensitive to initial value
    • Very simple to understand and use
  • TD has low variance, some bias
    • Usually more efficient than MC
    • TD(0) converges to $v_{\pi} (s)$
    • (but not always with function approximation)
    • More sensitive to initial value
  • TD exploits Markov property
    • Usually more efficient in Markov environments
  • MC does not exploit Markov property
    • Usually more efficient in non-Markov environments

Unified View

Monte-Carlo Backup
MC Backup
Temporal-Difference Backup
TD Backup
Dynamic Programming Backup
DP Backup
Unified View of Reinforcement Learning
Unified View of RL

Bootstrapping and Sampling

  • Bootstrapping: update involves an estimate
    • MC dose not bootstrap
    • DP/TD bootstraps
  • Sampling: update samples an expectation
    • MC/TD samples
    • DP does not sample

TD($\lambda$)

n-Step Return

  • Consider the following n-step returns for $n=1,2,\dots$
    • n=1 (TD) $G_t^{(1)}=R_{t+1} + \gamma V(S_{t+1})$
    • n=2 $G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2})$
    • n=$\infty$, $G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{T-1} R_T $
  • Define the n-step return
    $$ G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^{n} V(S_{t+n})$$
  • n-step temporal-difference learning
    $$ V(S_t) \leftarrow V(S_t) + \alpha (G_{(n)} - V(S_t)) $$

Averaging n-Step Returns

  • We can average n-step returns over different n
  • e.g average the 2-step and 4-step returns $\frac{1}{2} G^{(2)} + \frac{1}{2} G^{(4)}$
  • Combines information from two different time-steps
  • Can we efficiently combine information from all time-steps

$\lambda$ return

  • The $\lambda-$return $G_t^{\lambda}$ combines all n-step return $G_t^{(n)}$
  • Using weight $(1-\lambda)\lambda^{n-1}$
    $$ G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)} $$
  • Forward-view TD($\lambda$)
    $$ V(S_t) \leftarrow V(S_t) + \alpha (G_t^{\lambda} -V(S_t)) $$

Forward-view TD($\lambda$)

$$ G_{t}^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)} $$

  • Update value function towards the $\lambda$-return
  • Forward-view looks into the future to compute $G_t^{\lambda}$
  • Like MC, can only be computed from complete episodes

Backward-view TD($\lambda$)

  • Forward view provides theory
  • Backward view provides mechanism
  • Update online, every step, from incomplete sequences

Relationship Between Forward and Backward TD

TD($\lambda$) and TD(0)

  • when $\lambda=0$, only current state is updated
    $$ E_t (s) = 1(S_t =s) \qquad V(s) \leftarrow V(s) + \alpha \delta_t E_t(s)$$
  • This is exactly equivalent to TD(0) update
    $$ V(S_t) \leftarrow V(S_t) + \alpha \delta_t $$

TD($\lambda$) and MC

  • When $\lambda=1$, credit is deferred until end of episode
  • Consider episodic environments with offline updates
  • Over the course of an episode, total update for TD(1) is the same as total update for MC

    Theorem
    The sum of offline updates is identical for forward-view and backward-view TD($\lambda$)
    $$ \sum_{t=1}^{T} \alpha \delta_t E_t(s) = \sum_{t=1}^T \alpha (G_t^{\lambda} - V(S_t) ) 1(S_t =s) $$

Forward and Backward Equivalence

MC and TD(1)

  • Consider an episode where $s$ is visited once at time-step $k$.
  • TD(1) eligibility trace discounts time since visit,
    $$ E_t(s) = \gamma E_{t-1} (s) + 1(S_t = s) = \begin{cases} 0 & \text{if}~ t<k \\ \gamma^{t-k} & \text{if} ~ t\ge k \end{cases}$$
  • TD(1) updates accumulate error online
    $$ \sum_{t=1}^{T-1} \alpha \delta_t E_t(s) = \alpha \sum_{t=k}^{T-1} \gamma^{t-k} \delta_t = \alpha (G_k - V(S_k)) $$
  • By end of episode it accumulates total error
    $$ \delta_k + \gamma \delta_{k+1} + \gamma^2 \delta_{k+2} + \dots + \gamma^{T-1-k} \delta_{T-1} $$

TD(\lambda) and TD(1)

  • TD(1) is roughly equivalent to every-visit Monte-Carlo
  • Error is accumulated online, step-by-step
  • If value function is only updated offline at end of episode
  • Then total update is exactly the same as MC

Forward and Backwards TD($\lambda$)

  • Consider an episode where $s$ is visited once at time-step $k$
  • TD($\lambda$) eligibility trace discounts time since visit
    $$ E_t(s) = \gamma\lambda E_{t-1} (s) + 1(S_t = s) = \begin{cases} 0 & \text{if}~ t<k \\ (\gamma\lambda)^{t-k} & \text{if} ~ t\ge k \end{cases}$$
  • Backward TD($\lambda$) updates accumulate error online
    $$ \sum_{t=1}^{T} \alpha \delta_t E_t(s) = \alpha \sum_{t=k}^T (\gamma\lambda)^{t-k} \delta_t = \alpha(G_k^\lambda - V(S_k)) $$
  • By end of episode it accumulates total error for $\lambda$-return
  • For multiple visits to $s$, $E_t(s)$ accumulates many errors

Offline Equivalence of Forward and Backward TD
Offline updates

  • Updates are accumulated within episode
  • but applied in batch at the end of episode

Online updates

  • TD($\lambda$) updates are applied online at each step within episode
  • Forward and backward-view TD($\lambda$) are slightly different
  • New: Exact online TD($\lambda$) achieves perfect equivalence
  • By using a slightly differently form of eligibility trace

Summary of Forward and Backward TD($\lambda$)

Summary of Forward and Backward TD($\lambda$)

Donate comment here