reference:
UCL Course on RL

Lecture 4 Model-Free Prediction

Estimate the value function of an unknown MDP

Monte-Carlo Reinforcement learning

MC methods learn directly from episodes of experience
MC is model-free: no knowledge of MDP transition/rewards
MC learns from complete episodes: no bootstrapping
MC uses the simplest possible idea: value = mean return
Caveat: can only apply MC to episodic MDPs
- All episodes must terminate

MC Policy Evaluation

Goal: learn $v_{\pi}$ from episodes of experience under policy $\pi$, $S_1,A_1,R_2,\dots,S_k \sim \pi $
Recall that the return is the total discounted reward: $G_t = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{T-1} R_{T} $
Recall that the value function is the expected return: $v_{\pi} (s) = E_{\pi} [G_t|S_t =s]$
MC policy evaluation uses empirical mean return instead of expected return

First-Visit MC Policy Evaluation

To evaluate state $s$
The first time-step $t$ that state $s$ is visited in an episode
Increment counter $N(s) \longleftarrow N(s) +1 $
Increment total return $S(s) \longleftarrow S(s) +G_t$
Value is estimated by mean return $V(s) = S(s)/N(s) $
By law of large numbers, $V(s) \rightarrow v_{\pi} (s) $ as $N(s) \rightarrow \infty$

Every-Visit MC Policy Evaluation

To evaluate state $s$
Every time-step $t$ that state $s$ is visited in an episode
Increment counter $N(s) \longleftarrow N(s) +1 $
Increment total return $S(s) \longleftarrow S(s) +G_t$
Value is estimated by mean return $V(s) = S(s)/N(s) $
By law of large numbers, $V(s) \rightarrow v_{\pi} (s) $ as $N(s) \rightarrow \infty$

Incremental MC

Incremental Mean
The mean $\mu_1,\mu_2,\dots$ of a sequence $x_1,x_2,\dots$ can be computed incrementally,
$$\mu_k = \mu_{k-1} + \frac{1}{k} (x_k - \mu_{k-1}) $$

Incremental MC updates

Update $V(s)$ incrementally after episode $S_1,A_1,R_1,\dots,S_T$
For each state $S_t$ with return $G_t$.
$$ N(S_t) \leftarrow N(S_t) + 1 $$
$$ V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t)} (G_t - V(S_t)) $$
In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes
$$ V(S_t) \leftarrow V(S_t) + \alpha (G-V(S_t)) $$

$\alpha$ is the update rate.

Temporal-Difference Learning

TD method learn directly from episodes of experience
TD is model-free: no knowledge of MDP transitions/rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess

TD algorithm

Goal: learn $v_{\pi}$ online from experience under policy $\pi$
Incremental every-visit MC
- Update value $V(S_t)$ toward actual return $G_t$
  $$ V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t)) $$
Simplest temporal-difference learning algorithm: TD(0)
- Update value $V(S_t)$ toward estimated return $R_{t+1} + \gamma V(S_{t+1})$
  $$ V(S_t) \leftarrow V(S_t) + \alpha (R_{t+1} + \gamma V(S_{t+1}) -V(S_t)) $$
- $R_{t+1}+\gamma V(S_{t+1})$ is called the TD target
- $\delta_t = R_{t+1} + \gamma V(S_{t+1}) -V(S_t)$ is called the TD error

Bias/Variance Trade-off

Return $G_t = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{T-1} R_T$ is unbiased estimate of $v_{\pi} (S_t)$
True TD target $R_{t+1} + \gamma v_{\pi} (S_{t+1}) $ is unbiased estimate of $v_{\pi}(S_t)$
TD target $R_{t+1} + \gamma V(S_{t+1})$ is biased estimate of $v_{\pi} (S_t)$
TD target is much lower variance than the return
- Return depends on many random actions, transitions, rewards
- TD target depends on one random action, transition, reward

Batch MC and TD

MC and TD converge: $V(s) \rightarrow v_{\pi} (s)$ as experience $\rightarrow \infty$
But what about batch solution for finite experience?
- e.g. Repeatedly sample episode $k\in [1,K] $
- Apply MC or TD(0) to episode $k$

Certainty Equivalence

MC converges to solution with minimum mean-squared error
- Best fit to the observed returns
  $$ \sum_{k=1}^{K} \sum_{t=1}^{T_k} (G_t^k -V(s_t^k))^2 $$
TD(0) converges to solution of max likelihood Markov model
- Solution to the MDP $(S,A,\hat{P},\hat{R},\gamma)$ that best fits the data
  $$ \hat{P}_{s,s’}^a = \frac{1}{N(s,a)} \sum_{k=1}^{K} \sum_{t=1}^{T_k} 1(s_t^k,a_t^k,s_{t+1}^k = s,a,s’) $$
  $$ \hat{R}_s^a = \frac{1}{N(s,a)} \sum_{k=1}^{K} \sum_{t=1}^{T_k} 1(s_t^k,a_t^k = s,a) r_t^k $$
  Advantages and Disadvantages of MC vs. TD
TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
MC has high variance, zero bias
- Good convergence properties
- (even with function approximation)
- Not every sensitive to initial value
- Very simple to understand and use
TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to $v_{\pi} (s)$
- (but not always with function approximation)
- More sensitive to initial value
TD exploits Markov property
- Usually more efficient in Markov environments
MC does not exploit Markov property
- Usually more efficient in non-Markov environments

Unified View

Monte-Carlo Backup
MC Backup
Temporal-Difference Backup
TD Backup
Dynamic Programming Backup
DP Backup
Unified View of Reinforcement Learning
Unified View of RL

Bootstrapping and Sampling

Bootstrapping: update involves an estimate
- MC dose not bootstrap
- DP/TD bootstraps
Sampling: update samples an expectation
- MC/TD samples
- DP does not sample

TD($\lambda$)

n-Step Return

Consider the following n-step returns for $n=1,2,\dots$
- n=1 (TD) $G_t^{(1)}=R_{t+1} + \gamma V(S_{t+1})$
- n=2 $G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2})$
- n=$\infty$, $G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{T-1} R_T $
Define the n-step return
$$ G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^{n} V(S_{t+n})$$
n-step temporal-difference learning
$$ V(S_t) \leftarrow V(S_t) + \alpha (G_{(n)} - V(S_t)) $$

Averaging n-Step Returns

We can average n-step returns over different n
e.g average the 2-step and 4-step returns $\frac{1}{2} G^{(2)} + \frac{1}{2} G^{(4)}$
Combines information from two different time-steps
Can we efficiently combine information from all time-steps

$\lambda$ return

The $\lambda-$return $G_t^{\lambda}$ combines all n-step return $G_t^{(n)}$
Using weight $(1-\lambda)\lambda^{n-1}$
$$ G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)} $$
Forward-view TD($\lambda$)
$$ V(S_t) \leftarrow V(S_t) + \alpha (G_t^{\lambda} -V(S_t)) $$

Forward-view TD($\lambda$)

$$ G_{t}^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)} $$

Update value function towards the $\lambda$-return
Forward-view looks into the future to compute $G_t^{\lambda}$
Like MC, can only be computed from complete episodes

Backward-view TD($\lambda$)

Forward view provides theory
Backward view provides mechanism
Update online, every step, from incomplete sequences

Relationship Between Forward and Backward TD

TD($\lambda$) and TD(0)

when $\lambda=0$, only current state is updated
$$ E_t (s) = 1(S_t =s) \qquad V(s) \leftarrow V(s) + \alpha \delta_t E_t(s)$$
This is exactly equivalent to TD(0) update
$$ V(S_t) \leftarrow V(S_t) + \alpha \delta_t $$

TD($\lambda$) and MC

When $\lambda=1$, credit is deferred until end of episode
Consider episodic environments with offline updates
Over the course of an episode, total update for TD(1) is the same as total update for MC

Theorem
The sum of offline updates is identical for forward-view and backward-view TD($\lambda$)
$$ \sum_{t=1}^{T} \alpha \delta_t E_t(s) = \sum_{t=1}^T \alpha (G_t^{\lambda} - V(S_t) ) 1(S_t =s) $$

Forward and Backward Equivalence

MC and TD(1)

Consider an episode where $s$ is visited once at time-step $k$.
TD(1) eligibility trace discounts time since visit,
$$ E_t(s) = \gamma E_{t-1} (s) + 1(S_t = s) = \begin{cases} 0 & \text{if}~ t<k \\ \gamma^{t-k} & \text{if} ~ t\ge k \end{cases}$$
TD(1) updates accumulate error online
$$ \sum_{t=1}^{T-1} \alpha \delta_t E_t(s) = \alpha \sum_{t=k}^{T-1} \gamma^{t-k} \delta_t = \alpha (G_k - V(S_k)) $$
By end of episode it accumulates total error
$$ \delta_k + \gamma \delta_{k+1} + \gamma^2 \delta_{k+2} + \dots + \gamma^{T-1-k} \delta_{T-1} $$

TD(\lambda) and TD(1)

TD(1) is roughly equivalent to every-visit Monte-Carlo
Error is accumulated online, step-by-step
If value function is only updated offline at end of episode
Then total update is exactly the same as MC

Forward and Backwards TD($\lambda$)

Consider an episode where $s$ is visited once at time-step $k$
TD($\lambda$) eligibility trace discounts time since visit
$$ E_t(s) = \gamma\lambda E_{t-1} (s) + 1(S_t = s) = \begin{cases} 0 & \text{if}~ t<k \\ (\gamma\lambda)^{t-k} & \text{if} ~ t\ge k \end{cases}$$
Backward TD($\lambda$) updates accumulate error online
$$ \sum_{t=1}^{T} \alpha \delta_t E_t(s) = \alpha \sum_{t=k}^T (\gamma\lambda)^{t-k} \delta_t = \alpha(G_k^\lambda - V(S_k)) $$
By end of episode it accumulates total error for $\lambda$-return
For multiple visits to $s$, $E_t(s)$ accumulates many errors

Offline Equivalence of Forward and Backward TD
Offline updates

Updates are accumulated within episode
but applied in batch at the end of episode

Online updates

TD($\lambda$) updates are applied online at each step within episode
Forward and backward-view TD($\lambda$) are slightly different
New: Exact online TD($\lambda$) achieves perfect equivalence
By using a slightly differently form of eligibility trace

Summary of Forward and Backward TD($\lambda$)

$Summary of Forward and Backward TD($\lambda$)$