reference:
UCL Course on RL
Lecture 4 Model-Free Prediction
Estimate the value function of an unknown MDP
Monte-Carlo Reinforcement learning
- MC methods learn directly from episodes of experience
- MC is model-free: no knowledge of MDP transition/rewards
- MC learns from complete episodes: no bootstrapping
- MC uses the simplest possible idea: value = mean return
- Caveat: can only apply MC to episodic MDPs
- All episodes must terminate
MC Policy Evaluation
- Goal: learn $v_{\pi}$ from episodes of experience under policy $\pi$, $S_1,A_1,R_2,\dots,S_k \sim \pi $
- Recall that the return is the total discounted reward: $G_t = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{T-1} R_{T} $
- Recall that the value function is the expected return: $v_{\pi} (s) = E_{\pi} [G_t|S_t =s]$
- MC policy evaluation uses empirical mean return instead of expected return
First-Visit MC Policy Evaluation
- To evaluate state $s$
- The first time-step $t$ that state $s$ is visited in an episode
- Increment counter $N(s) \longleftarrow N(s) +1 $
- Increment total return $S(s) \longleftarrow S(s) +G_t$
- Value is estimated by mean return $V(s) = S(s)/N(s) $
- By law of large numbers, $V(s) \rightarrow v_{\pi} (s) $ as $N(s) \rightarrow \infty$
Every-Visit MC Policy Evaluation
- To evaluate state $s$
- Every time-step $t$ that state $s$ is visited in an episode
- Increment counter $N(s) \longleftarrow N(s) +1 $
- Increment total return $S(s) \longleftarrow S(s) +G_t$
- Value is estimated by mean return $V(s) = S(s)/N(s) $
- By law of large numbers, $V(s) \rightarrow v_{\pi} (s) $ as $N(s) \rightarrow \infty$
Incremental MC
Incremental Mean
The mean $\mu_1,\mu_2,\dots$ of a sequence $x_1,x_2,\dots$ can be computed incrementally,
$$\mu_k = \mu_{k-1} + \frac{1}{k} (x_k - \mu_{k-1}) $$
Incremental MC updates
- Update $V(s)$ incrementally after episode $S_1,A_1,R_1,\dots,S_T$
- For each state $S_t$ with return $G_t$.
$$ N(S_t) \leftarrow N(S_t) + 1 $$
$$ V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t)} (G_t - V(S_t)) $$ - In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes
$$ V(S_t) \leftarrow V(S_t) + \alpha (G-V(S_t)) $$
$\alpha$ is the update rate.
Temporal-Difference Learning
- TD method learn directly from episodes of experience
- TD is model-free: no knowledge of MDP transitions/rewards
- TD learns from incomplete episodes, by bootstrapping
- TD updates a guess towards a guess
TD algorithm
- Goal: learn $v_{\pi}$ online from experience under policy $\pi$
- Incremental every-visit MC
- Update value $V(S_t)$ toward actual return $G_t$
$$ V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t)) $$
- Update value $V(S_t)$ toward actual return $G_t$
- Simplest temporal-difference learning algorithm: TD(0)
- Update value $V(S_t)$ toward estimated return $R_{t+1} + \gamma V(S_{t+1})$
$$ V(S_t) \leftarrow V(S_t) + \alpha (R_{t+1} + \gamma V(S_{t+1}) -V(S_t)) $$ - $R_{t+1}+\gamma V(S_{t+1})$ is called the TD target
- $\delta_t = R_{t+1} + \gamma V(S_{t+1}) -V(S_t)$ is called the TD error
- Update value $V(S_t)$ toward estimated return $R_{t+1} + \gamma V(S_{t+1})$
Bias/Variance Trade-off
- Return $G_t = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{T-1} R_T$ is unbiased estimate of $v_{\pi} (S_t)$
- True TD target $R_{t+1} + \gamma v_{\pi} (S_{t+1}) $ is unbiased estimate of $v_{\pi}(S_t)$
- TD target $R_{t+1} + \gamma V(S_{t+1})$ is biased estimate of $v_{\pi} (S_t)$
- TD target is much lower variance than the return
- Return depends on many random actions, transitions, rewards
- TD target depends on one random action, transition, reward
Batch MC and TD
- MC and TD converge: $V(s) \rightarrow v_{\pi} (s)$ as experience $\rightarrow \infty$
- But what about batch solution for finite experience?
- e.g. Repeatedly sample episode $k\in [1,K] $
- Apply MC or TD(0) to episode $k$
Certainty Equivalence
MC converges to solution with minimum mean-squared error
- Best fit to the observed returns
$$ \sum_{k=1}^{K} \sum_{t=1}^{T_k} (G_t^k -V(s_t^k))^2 $$
- Best fit to the observed returns
TD(0) converges to solution of max likelihood Markov model
- Solution to the MDP $(S,A,\hat{P},\hat{R},\gamma)$ that best fits the data
$$ \hat{P}_{s,s’}^a = \frac{1}{N(s,a)} \sum_{k=1}^{K} \sum_{t=1}^{T_k} 1(s_t^k,a_t^k,s_{t+1}^k = s,a,s’) $$
$$ \hat{R}_s^a = \frac{1}{N(s,a)} \sum_{k=1}^{K} \sum_{t=1}^{T_k} 1(s_t^k,a_t^k = s,a) r_t^k $$Advantages and Disadvantages of MC vs. TD
- Solution to the MDP $(S,A,\hat{P},\hat{R},\gamma)$ that best fits the data
- TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
- TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
- MC has high variance, zero bias
- Good convergence properties
- (even with function approximation)
- Not every sensitive to initial value
- Very simple to understand and use
- TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to $v_{\pi} (s)$
- (but not always with function approximation)
- More sensitive to initial value
- TD exploits Markov property
- Usually more efficient in Markov environments
- MC does not exploit Markov property
- Usually more efficient in non-Markov environments
Unified View
Monte-Carlo Backup
Temporal-Difference Backup
Dynamic Programming Backup
Unified View of Reinforcement Learning
Bootstrapping and Sampling
- Bootstrapping: update involves an estimate
- MC dose not bootstrap
- DP/TD bootstraps
- Sampling: update samples an expectation
- MC/TD samples
- DP does not sample
TD($\lambda$)
n-Step Return
- Consider the following n-step returns for $n=1,2,\dots$
- n=1 (TD) $G_t^{(1)}=R_{t+1} + \gamma V(S_{t+1})$
- n=2 $G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2})$
- n=$\infty$, $G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{T-1} R_T $
- Define the n-step return
$$ G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^{n} V(S_{t+n})$$ - n-step temporal-difference learning
$$ V(S_t) \leftarrow V(S_t) + \alpha (G_{(n)} - V(S_t)) $$
Averaging n-Step Returns
- We can average n-step returns over different n
- e.g average the 2-step and 4-step returns $\frac{1}{2} G^{(2)} + \frac{1}{2} G^{(4)}$
- Combines information from two different time-steps
- Can we efficiently combine information from all time-steps
$\lambda$ return
- The $\lambda-$return $G_t^{\lambda}$ combines all n-step return $G_t^{(n)}$
- Using weight $(1-\lambda)\lambda^{n-1}$
$$ G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)} $$ - Forward-view TD($\lambda$)
$$ V(S_t) \leftarrow V(S_t) + \alpha (G_t^{\lambda} -V(S_t)) $$
Forward-view TD($\lambda$)
$$ G_{t}^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)} $$
- Update value function towards the $\lambda$-return
- Forward-view looks into the future to compute $G_t^{\lambda}$
- Like MC, can only be computed from complete episodes
Backward-view TD($\lambda$)
- Forward view provides theory
- Backward view provides mechanism
- Update online, every step, from incomplete sequences
Relationship Between Forward and Backward TD
TD($\lambda$) and TD(0)
- when $\lambda=0$, only current state is updated
$$ E_t (s) = 1(S_t =s) \qquad V(s) \leftarrow V(s) + \alpha \delta_t E_t(s)$$ - This is exactly equivalent to TD(0) update
$$ V(S_t) \leftarrow V(S_t) + \alpha \delta_t $$
TD($\lambda$) and MC
- When $\lambda=1$, credit is deferred until end of episode
- Consider episodic environments with offline updates
- Over the course of an episode, total update for TD(1) is the same as total update for MC
Theorem
The sum of offline updates is identical for forward-view and backward-view TD($\lambda$)
$$ \sum_{t=1}^{T} \alpha \delta_t E_t(s) = \sum_{t=1}^T \alpha (G_t^{\lambda} - V(S_t) ) 1(S_t =s) $$
Forward and Backward Equivalence
MC and TD(1)
- Consider an episode where $s$ is visited once at time-step $k$.
- TD(1) eligibility trace discounts time since visit,
$$ E_t(s) = \gamma E_{t-1} (s) + 1(S_t = s) = \begin{cases} 0 & \text{if}~ t<k \\ \gamma^{t-k} & \text{if} ~ t\ge k \end{cases}$$ - TD(1) updates accumulate error online
$$ \sum_{t=1}^{T-1} \alpha \delta_t E_t(s) = \alpha \sum_{t=k}^{T-1} \gamma^{t-k} \delta_t = \alpha (G_k - V(S_k)) $$ - By end of episode it accumulates total error
$$ \delta_k + \gamma \delta_{k+1} + \gamma^2 \delta_{k+2} + \dots + \gamma^{T-1-k} \delta_{T-1} $$
TD(\lambda) and TD(1)
- TD(1) is roughly equivalent to every-visit Monte-Carlo
- Error is accumulated online, step-by-step
- If value function is only updated offline at end of episode
- Then total update is exactly the same as MC
Forward and Backwards TD($\lambda$)
- Consider an episode where $s$ is visited once at time-step $k$
- TD($\lambda$) eligibility trace discounts time since visit
$$ E_t(s) = \gamma\lambda E_{t-1} (s) + 1(S_t = s) = \begin{cases} 0 & \text{if}~ t<k \\ (\gamma\lambda)^{t-k} & \text{if} ~ t\ge k \end{cases}$$ - Backward TD($\lambda$) updates accumulate error online
$$ \sum_{t=1}^{T} \alpha \delta_t E_t(s) = \alpha \sum_{t=k}^T (\gamma\lambda)^{t-k} \delta_t = \alpha(G_k^\lambda - V(S_k)) $$ - By end of episode it accumulates total error for $\lambda$-return
- For multiple visits to $s$, $E_t(s)$ accumulates many errors
Offline Equivalence of Forward and Backward TD
Offline updates
- Updates are accumulated within episode
- but applied in batch at the end of episode
Online updates
- TD($\lambda$) updates are applied online at each step within episode
- Forward and backward-view TD($\lambda$) are slightly different
- New: Exact online TD($\lambda$) achieves perfect equivalence
- By using a slightly differently form of eligibility trace