reference:
UCL Course on RL

lecture 6 Value Function Approximation

Introduction

Large-Scale Reinforcement Learning
Reinforcement learning can be used to solve large problems

Value Function Approximation
Value Function by a lookup table

Every state $s$ has an entry Q(s,a)
Or every state-action pair $s,a$ has an entry $Q(s,a)$
Problem with larger MDPs
There are many states and/or actions to store in memory
It is too slow to learn the value of each state individually
Solution for large MDPs
Estimate value functions with function approximation $ \hat{v} (s,w) \approx v_{\pi}(s) $ or $ \hat{q} (s,a,w) \approx q_\pi (s,a) $
Generalise from seen states to unseen states
Update parameters w using MC or TD learning

Types of value function approximation

types of value function approximation
We consider differentiable function approximators

Linear combinations of features
Neural network
Decision tree
Nearest neighbour
Fourier/wavelet bases
$\dots$
Furthermore, we require a training method that is suitable for non-stationary, non-iid data

Incremental Methods

Stochastic Gradient Descent

Goal: find parameters vector $w$ minimising mean-squared error between approximate value fn $\hat{v}(s,w)$ and true value fn $v_{\pi} (s) $
$$ J(w) = E_\pi [(v_\pi (S) -\hat{v} (S,w) )^2] $$
Gradient descent finds a local minimum
$$ \Delta w = -\frac{1}{2} \alpha \nabla_w J(w)= \alpha E_\pi \left[ (v_\pi(S) -\hat{v}(S,w))\nabla_w \hat{v} (S,w) \right] $$
Stochastic gradient descent samples the gradient
$$ \Delta w = \alpha (v_\pi (S) - \hat{v} (S,w)) \nabla_w \hat{v} (S,w) $$
Expected update is equal to full gradient update

Linear Function Approximation

Linear Value Function Approximation

Represent value function by a linear combination of features
$$ \hat{S,w} = x(S)^T w = \sum_{j=1}^{n} x_j (S) w_j $$
Objective function is quadratic in parameters $w$
$$ J(w) = E_\pi \left[ (v_\pi(S) - x(S)^T w)^2\right] $$
Stochastic gradient descent converges on global optimum
Update rule is particularly simple
Updata = step-size $\times$ prediction error $\times$ feature value

Table Lookup Features

Table lookup is special case of linear value function approximation
Using table lookup features
$$ x^{table} (S) = \begin{bmatrix} 1(S=s_1) \\ \vdots \\ 1(S=s_n) \end{bmatrix} $$
Parameter vector $w$ gives value of each individual state
$$ \hat{v} (S,w) = \begin{bmatrix} 1(S=s_1) \\ \vdots \\ 1(S=s_n) \end{bmatrix} \cdot \begin{bmatrix} w_1 \\ \vdots \\ w_n \end{bmatrix} $$

Incremental Prediction Algorithms

Have assumed true value function $v_{\pi} (s)$ given by supervisor
But in RL there is no supervisor, only rewards
In practice, we substitute a target for $v_{\pi} (s)$
- For MC, the target is the return $G_t$
  $$ \Delta w = \alpha (G_t - \hat{v} (S_t,w)) \nabla_{w} \hat{v} (S_t,w) $$
- For TD(0), the target is the TD target $R_{t+1} + \gamma \hat{v} (S_{t+1},w) $
  $$ \Delta w = \alpha (R_{t+1} + \gamma \hat{v} (S_{t+1},w) - \hat{v} (S_t,w)) \nabla_{w} \hat{v} (S_t,w) $$
- For TD($\lambda$), the target is the $\lambda$-return $G_t^\lambda$
  $$ \Delta w = \alpha (G_t^\lambda - \hat{v} (S_t,w)) \nabla_{w} \hat{v} (S_t,w) $$

MC with Value Function Approximation

Return $G_t$ is an unbiased, noisy sample of true value $v_{pi} (S_t)$
Can therefore apply supervised learning to “training data”:
$$ (S_1, G_1), (S_2, G_2), \dots, (S_T, G_T) $$
For example, using linear MC policy evaluation
$$ \Delta w = \alpha (G_t - \hat{v} (S_t,w)) \Delta_w \hat{v} (S_t,w) = \alpha(G_t - \hat{v} (S_t,w))x(S_t) $$
Monte-Carlo evaluation converges to a local optimum
Even when using non-linear value function approximation

TD Learning with Value Function Approximation

The TD-target $R_{t+1} + \gamma \hat{v} (S_{t+1},w)$ is a biased sample of true value $v_\pi (S_t)$
Can still apply supervised learning to “training data”:
$$ (S_1, R_2 + \gamma \hat{v} (S_2,w) ), (S_2, R_3 + \gamma \hat{v} (S_3,w)), \dots, (R_{T-1},R_T) $$
For example, using linear TD(0)
$$ \Delta w = \alpha (R+\gamma \hat{v} (S’,w) - \hat{v} (S,w)) \Delta_w \hat{v} (S,w) = \alpha \delta x(S) $$
Linear TD(0) converges (close) to global optimum

Incremental Control Algorithm

Control with Value function Approximation
Policy evaluation: Approximate policy evaluation, $\hat{q}(\cdot,\cdot,w) \approx q_\pi$
Policy improvement: $\epsilon$-greedy policy improvement

Action-Value Function Approximation

Approximate the action-value function $\hat{q}(S,A,w) \approx q_{\pi} (S,A)$
Minimise mean-squared error between approximate action-value fn $\hat{q}(S,A,w)$ and true action-value fn $q_\pi (S,A)$
$$ J(w) = E_\pi \left[(q_\pi (S,A) - \hat{q} (S,A,w))^2 \right] $$
Use stochastic gradient descent to find a local minimum
$$ -\frac{1}{2} \nabla_w J(w) = ( q_{\pi} (S,A) -\hat{q} (S,A,w)) \nabla_w \hat{q} (S,A,w) \\ \Delta w = \alpha (q_{\pi} (S,A) -\hat{q} (S,A,w))\nabla_w \hat{q} (S,A,w) $$

Linear Action-Value Funtion Approximation

Represent state and action by a feature vector
$$ X(S,A) = \begin{pmatrix} x_1 (S,A) \\ \vdots \\ x_n (S,A) \end{pmatrix} $$
Represent action-value fn by linear combination of features
$$ \hat{q} (S,A,w) = x(S,A)^T w = \sum_{j=1}^n x_j (S,A)w_j $$
Stochastic gradient descent update
$$ \nabla_w \hat{q} (S,A,w) = x(S,A) \\ \Delta w =\alpha (q_\pi (S,A) - \hat{q} (S,A,w)) x(S,A) $$

Incremental Control Algorithms

Like prediction, we must substitute a target for $q_\pi (S,A)$
- For MC, the target is the return $G_t$
  $$ \Delta w = \alpha (G_t - \hat{q} (S_t,A_t,w)) \nabla_w \hat{q} (S_t,A_t,w) $$
- For TD(0), the target is the TD target $R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) $
  $$ \Delta w = \alpha (R_{t+1}+\gamma\hat{q} (S_{t+1},A_{t+1},w) - \hat{q} (S_t,A_t,w)) \nabla_w \hat{q} (S_t,A_t,w) $$
- For forward-view TD($\lambda$), target is the action-value $\lambda$-return
  $$ \Delta w = \alpha (q_t^\lambda - \hat{q} (S_t,A_t,w)) \nabla_w \hat{q} (S_t,A_t,w) $$
- For backward-view TD($\lambda$), equivalent update is
  $$ \delta_t = R_{t+1}+\gamma\hat{q} (S_{t+1},A_{t+1},w) - \hat{q} (S_t,A_t,w) \\ E_t =\gamma \lambda E_{t-1} + \Delta_w \hat{q} (S_t,A_t,w) \\ \Delta w = \alpha \delta_t E_t $$

Convergence

Convergence of Prediction Algorithms

Gradient Temporal-Difference Learning

TD does not follow the gradient of any objective function
This is why TD can diverge when off-policy or using non-linear function approximation
Gradient TD follows true gradient of projected Bellman error

Convergence of Control Algorithms

Batch Methods

Batch Reinforcement Learning

Gradient descent is simple and appealing
But it is not sample efficient
Batch methods seek to find the best fitting value function
Given the agent’s experience (‘training data”)

Least Squares Prediction

Given value function approximation $\hat{v}(s,w) \approx v_\pi (s) $
And experience D consisting of (state,value) pairs
$$ D = ((s_1,v_1^\pi),(s_2,v_2^\pi),\dots,(s_T,v_T^\pi)) $$
Which parameters $w$ given the best fitting value fn $\hat{v} (s,w)$?
Least squares algorithms find parameters vector $w$ minimising sum-squared error between $\hat{v} (s_t,w)$ and target values $v_t^\pi$
$$ LS(w) = \sum_{t=1}^{T} (v_t^T - \hat{v} (s_t,w))^2 = E_D \left[ (v^\pi - \hat{v} (s,w))^2\right] $$

Stochastic Gradient Descent with Experience Replay
Repeat:

Sample state, value from experience $(s,v^\pi)\sim D$
Apply stochastic gradient descent update $\Delta w = \alpha (v^\pi -\hat{v} (s,w)) \nabla_w \hat{v} (s,w) $
Converges to least squares solution
$$ w^\pi = \arg\min_w LS(w) $$

Experience Replay in Deep Q-Network(DQN)
DQN uses experience replay and fixed Q-targets

Take action $a_t$ according to $\epsilon$-greedy policy
Store transition $(s_t,a_t,r_{t+1},s_{t+1})$ in replay memory $D$
Sample random mini-batch of transitions $(s,a,r,s’)$ from $D$
Compute Q-learning targets w.r.t old, fixed parameters $w^-$
Optimise MSE between Q-network and Q-learning targets
$$ L_i (w_i) = E_{s,a,r,s’ \sim D_i} \left[ \left( r + \gamma \max_{a’} Q(s’,a’;w_i^-) - Q(s,a;w_i) \right)^2 \right] $$
Using variant of stochastic gradient descent

Linear Least Squares Prediction

Experience replay finds least squares solution
But it may take many iterations
Using linear value function approximation $\hat{v} (s,w) = x(s)^T w $
We can solve the least squares solution directly
- At minimum of $LS(w)$, the expected update must be zero, $E_D (\Delta w) = 0$
  $$ \alpha \sum_{t=1}^{T} x(s_t) (v_t^\pi -x(s_t)^T w) = 0 \\ w =\left( \sum_{t=1}^T x(s_t)x(s_t)^T \right)^{-1} \sum_{t=1}^T x(s_t) v_t^\pi $$
- For $N$ features, direct solution time is $O(n^3)$
- Incremental solution time is $O(n^2)$ using Shermann-Morrison

Linear Least Squares Prediction Algorithms

We don’t know true values $v_t^\pi$
In practice, our “training data” must use noisy or biased sample of $v_t^\pi$
- LSMC Least Squares MC uses return $v_t^\pi \approx G_t$
- LSTD Least Squares TD uses TD target $v_t^\pi \approx R_{t+1} + \gamma \hat{v}(S_{t_1},w) $
- LSTD($\lambda$) Least Squares TD($\lambda$) use $\lambda$-return $v_t^\pi \approx G_t^\lambda$
In each case solve directly for fixed point of MC/TD/TD($\lambda$)

Convergence of Linear Least Squares Prediction Algorithms
Convergence of LS prediction algorithms

Least Squares Control

Least Squares Policy Iteration
Policy evaluation Policy evaluation by least squares Q-learning
Policy improvement Greedy policy improvement

Least Squares Action-Value Function Approximation

Approximate action-value function $q_\pi (s,a)$
using linear combination of features $x(s,a)$
$$ \hat{q} (s,a,w) = x(s,a)^T w \approx q_\pi (s,a) $$
Minimise least squares error between $\hat{q} (s,a,w)$ and $q_\pi (s,a)$
form experience generated using policy $\pi$
consisting of $<(state,action),value>$ pairs
$$ D = \{ < (s_1,a_1),v_1^\pi >, <(s_2,a_2),v_2^\pi>,\dots,<(s_T,a_T),v_T^\pi> \} $$

Least Squares Control

For policy evaluation, we want to efficiently use all experience
For control, we also want to improve the policy
This experience is generated from many policies
So to evaluate $q_pi (S,A)$ we must learn off-policy
We use the same idea as Q-learning:
- Use experience generated by old policy, $ S_t,A_t,R_{t+1},S_{t+1} \sim \pi_{old} $
- Consider alternative successor action $ A’ = \pi_{new} (S_{t+1}) $
- Update $\hat{q} (S_t,A_t,w) $ towards value of alternative action $R_{t+1} + \gamma \hat{q} (S_{t+1},A’,w)$

Least Squares Q-Learning

Consider the following linear Q-learning update
$$ \delta = R_{t+1} + \gamma \hat{q} (S_{t+1},\pi(S_{t+1}),w) - \hat{q} (S_t,A_t,w) \\ \Delta w =\alpha \delta x(S_t,A_t) $$
LSTDQ alorithm: solve for total update = zero
$$ 0 = \sum_{t=1}^T \alpha (R_{t+1} +\gamma \hat{q} (S_{t+1},\pi(S_{t+1}),w) -\hat{q} (S_t,A_t,w)) x(S_t,A_t) \\ w = \left( \sum_{t=1}^T x(S_t,A_t) (x(S_t,A_t)-\gamma x(S_{t+1},\pi(S_{t+1})))^T \right)^{-1} \sum_{t=1}^T x(S_t,A_t) R_{t+1} $$

Least Squares Policy Iteration Algorithm