reference:
UCL Course on RL
lecture 6 Value Function Approximation
Introduction
Large-Scale Reinforcement Learning
Reinforcement learning can be used to solve large problems
Value Function Approximation
Value Function by a lookup table
- Every state $s$ has an entry Q(s,a)
- Or every state-action pair $s,a$ has an entry $Q(s,a)$
Problem with larger MDPs - There are many states and/or actions to store in memory
- It is too slow to learn the value of each state individually
Solution for large MDPs - Estimate value functions with function approximation $ \hat{v} (s,w) \approx v_{\pi}(s) $ or $ \hat{q} (s,a,w) \approx q_\pi (s,a) $
- Generalise from seen states to unseen states
- Update parameters w using MC or TD learning
Types of value function approximation
We consider differentiable function approximators
- Linear combinations of features
- Neural network
- Decision tree
- Nearest neighbour
- Fourier/wavelet bases
- $\dots$
Furthermore, we require a training method that is suitable for non-stationary, non-iid data
Incremental Methods
Stochastic Gradient Descent
- Goal: find parameters vector $w$ minimising mean-squared error between approximate value fn $\hat{v}(s,w)$ and true value fn $v_{\pi} (s) $
$$ J(w) = E_\pi [(v_\pi (S) -\hat{v} (S,w) )^2] $$ - Gradient descent finds a local minimum
$$ \Delta w = -\frac{1}{2} \alpha \nabla_w J(w)= \alpha E_\pi \left[ (v_\pi(S) -\hat{v}(S,w))\nabla_w \hat{v} (S,w) \right] $$ - Stochastic gradient descent samples the gradient
$$ \Delta w = \alpha (v_\pi (S) - \hat{v} (S,w)) \nabla_w \hat{v} (S,w) $$ - Expected update is equal to full gradient update
Linear Function Approximation
Linear Value Function Approximation
- Represent value function by a linear combination of features
$$ \hat{S,w} = x(S)^T w = \sum_{j=1}^{n} x_j (S) w_j $$ - Objective function is quadratic in parameters $w$
$$ J(w) = E_\pi \left[ (v_\pi(S) - x(S)^T w)^2\right] $$ - Stochastic gradient descent converges on global optimum
- Update rule is particularly simple
Updata = step-size $\times$ prediction error $\times$ feature value
Table Lookup Features
- Table lookup is special case of linear value function approximation
- Using table lookup features
$$ x^{table} (S) = \begin{bmatrix} 1(S=s_1) \\ \vdots \\ 1(S=s_n) \end{bmatrix} $$ - Parameter vector $w$ gives value of each individual state
$$ \hat{v} (S,w) = \begin{bmatrix} 1(S=s_1) \\ \vdots \\ 1(S=s_n) \end{bmatrix} \cdot \begin{bmatrix} w_1 \\ \vdots \\ w_n \end{bmatrix} $$
Incremental Prediction Algorithms
- Have assumed true value function $v_{\pi} (s)$ given by supervisor
- But in RL there is no supervisor, only rewards
- In practice, we substitute a target for $v_{\pi} (s)$
- For MC, the target is the return $G_t$
$$ \Delta w = \alpha (G_t - \hat{v} (S_t,w)) \nabla_{w} \hat{v} (S_t,w) $$ - For TD(0), the target is the TD target $R_{t+1} + \gamma \hat{v} (S_{t+1},w) $
$$ \Delta w = \alpha (R_{t+1} + \gamma \hat{v} (S_{t+1},w) - \hat{v} (S_t,w)) \nabla_{w} \hat{v} (S_t,w) $$ - For TD($\lambda$), the target is the $\lambda$-return $G_t^\lambda$
$$ \Delta w = \alpha (G_t^\lambda - \hat{v} (S_t,w)) \nabla_{w} \hat{v} (S_t,w) $$
- For MC, the target is the return $G_t$
MC with Value Function Approximation
- Return $G_t$ is an unbiased, noisy sample of true value $v_{pi} (S_t)$
- Can therefore apply supervised learning to “training data”:
$$ (S_1, G_1), (S_2, G_2), \dots, (S_T, G_T) $$ - For example, using linear MC policy evaluation
$$ \Delta w = \alpha (G_t - \hat{v} (S_t,w)) \Delta_w \hat{v} (S_t,w) = \alpha(G_t - \hat{v} (S_t,w))x(S_t) $$ - Monte-Carlo evaluation converges to a local optimum
- Even when using non-linear value function approximation
TD Learning with Value Function Approximation
- The TD-target $R_{t+1} + \gamma \hat{v} (S_{t+1},w)$ is a biased sample of true value $v_\pi (S_t)$
- Can still apply supervised learning to “training data”:
$$ (S_1, R_2 + \gamma \hat{v} (S_2,w) ), (S_2, R_3 + \gamma \hat{v} (S_3,w)), \dots, (R_{T-1},R_T) $$ - For example, using linear TD(0)
$$ \Delta w = \alpha (R+\gamma \hat{v} (S’,w) - \hat{v} (S,w)) \Delta_w \hat{v} (S,w) = \alpha \delta x(S) $$ - Linear TD(0) converges (close) to global optimum
Incremental Control Algorithm
Control with Value function Approximation
Policy evaluation: Approximate policy evaluation, $\hat{q}(\cdot,\cdot,w) \approx q_\pi$
Policy improvement: $\epsilon$-greedy policy improvement
Action-Value Function Approximation
- Approximate the action-value function $\hat{q}(S,A,w) \approx q_{\pi} (S,A)$
- Minimise mean-squared error between approximate action-value fn $\hat{q}(S,A,w)$ and true action-value fn $q_\pi (S,A)$
$$ J(w) = E_\pi \left[(q_\pi (S,A) - \hat{q} (S,A,w))^2 \right] $$ - Use stochastic gradient descent to find a local minimum
$$ -\frac{1}{2} \nabla_w J(w) = ( q_{\pi} (S,A) -\hat{q} (S,A,w)) \nabla_w \hat{q} (S,A,w) \\ \Delta w = \alpha (q_{\pi} (S,A) -\hat{q} (S,A,w))\nabla_w \hat{q} (S,A,w) $$
Linear Action-Value Funtion Approximation
- Represent state and action by a feature vector
$$ X(S,A) = \begin{pmatrix} x_1 (S,A) \\ \vdots \\ x_n (S,A) \end{pmatrix} $$ - Represent action-value fn by linear combination of features
$$ \hat{q} (S,A,w) = x(S,A)^T w = \sum_{j=1}^n x_j (S,A)w_j $$ - Stochastic gradient descent update
$$ \nabla_w \hat{q} (S,A,w) = x(S,A) \\ \Delta w =\alpha (q_\pi (S,A) - \hat{q} (S,A,w)) x(S,A) $$
Incremental Control Algorithms
- Like prediction, we must substitute a target for $q_\pi (S,A)$
- For MC, the target is the return $G_t$
$$ \Delta w = \alpha (G_t - \hat{q} (S_t,A_t,w)) \nabla_w \hat{q} (S_t,A_t,w) $$ - For TD(0), the target is the TD target $R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) $
$$ \Delta w = \alpha (R_{t+1}+\gamma\hat{q} (S_{t+1},A_{t+1},w) - \hat{q} (S_t,A_t,w)) \nabla_w \hat{q} (S_t,A_t,w) $$ - For forward-view TD($\lambda$), target is the action-value $\lambda$-return
$$ \Delta w = \alpha (q_t^\lambda - \hat{q} (S_t,A_t,w)) \nabla_w \hat{q} (S_t,A_t,w) $$ - For backward-view TD($\lambda$), equivalent update is
$$ \delta_t = R_{t+1}+\gamma\hat{q} (S_{t+1},A_{t+1},w) - \hat{q} (S_t,A_t,w) \\ E_t =\gamma \lambda E_{t-1} + \Delta_w \hat{q} (S_t,A_t,w) \\ \Delta w = \alpha \delta_t E_t $$
- For MC, the target is the return $G_t$
Convergence
Convergence of Prediction Algorithms
Gradient Temporal-Difference Learning
- TD does not follow the gradient of any objective function
- This is why TD can diverge when off-policy or using non-linear function approximation
- Gradient TD follows true gradient of projected Bellman error
Convergence of Control Algorithms
Batch Methods
Batch Reinforcement Learning
- Gradient descent is simple and appealing
- But it is not sample efficient
- Batch methods seek to find the best fitting value function
- Given the agent’s experience (‘training data”)
Least Squares Prediction
- Given value function approximation $\hat{v}(s,w) \approx v_\pi (s) $
- And experience D consisting of (state,value) pairs
$$ D = ((s_1,v_1^\pi),(s_2,v_2^\pi),\dots,(s_T,v_T^\pi)) $$ - Which parameters $w$ given the best fitting value fn $\hat{v} (s,w)$?
- Least squares algorithms find parameters vector $w$ minimising sum-squared error between $\hat{v} (s_t,w)$ and target values $v_t^\pi$
$$ LS(w) = \sum_{t=1}^{T} (v_t^T - \hat{v} (s_t,w))^2 = E_D \left[ (v^\pi - \hat{v} (s,w))^2\right] $$
Stochastic Gradient Descent with Experience Replay
Repeat:
- Sample state, value from experience $(s,v^\pi)\sim D$
- Apply stochastic gradient descent update $\Delta w = \alpha (v^\pi -\hat{v} (s,w)) \nabla_w \hat{v} (s,w) $
Converges to least squares solution
$$ w^\pi = \arg\min_w LS(w) $$
Experience Replay in Deep Q-Network(DQN)
DQN uses experience replay and fixed Q-targets
- Take action $a_t$ according to $\epsilon$-greedy policy
- Store transition $(s_t,a_t,r_{t+1},s_{t+1})$ in replay memory $D$
- Sample random mini-batch of transitions $(s,a,r,s’)$ from $D$
- Compute Q-learning targets w.r.t old, fixed parameters $w^-$
- Optimise MSE between Q-network and Q-learning targets
$$ L_i (w_i) = E_{s,a,r,s’ \sim D_i} \left[ \left( r + \gamma \max_{a’} Q(s’,a’;w_i^-) - Q(s,a;w_i) \right)^2 \right] $$ - Using variant of stochastic gradient descent
Linear Least Squares Prediction
- Experience replay finds least squares solution
- But it may take many iterations
- Using linear value function approximation $\hat{v} (s,w) = x(s)^T w $
- We can solve the least squares solution directly
- At minimum of $LS(w)$, the expected update must be zero, $E_D (\Delta w) = 0$
$$ \alpha \sum_{t=1}^{T} x(s_t) (v_t^\pi -x(s_t)^T w) = 0 \\ w =\left( \sum_{t=1}^T x(s_t)x(s_t)^T \right)^{-1} \sum_{t=1}^T x(s_t) v_t^\pi $$ - For $N$ features, direct solution time is $O(n^3)$
- Incremental solution time is $O(n^2)$ using Shermann-Morrison
- At minimum of $LS(w)$, the expected update must be zero, $E_D (\Delta w) = 0$
Linear Least Squares Prediction Algorithms
- We don’t know true values $v_t^\pi$
- In practice, our “training data” must use noisy or biased sample of $v_t^\pi$
- LSMC Least Squares MC uses return $v_t^\pi \approx G_t$
- LSTD Least Squares TD uses TD target $v_t^\pi \approx R_{t+1} + \gamma \hat{v}(S_{t_1},w) $
- LSTD($\lambda$) Least Squares TD($\lambda$) use $\lambda$-return $v_t^\pi \approx G_t^\lambda$
- In each case solve directly for fixed point of MC/TD/TD($\lambda$)
Convergence of Linear Least Squares Prediction Algorithms
Least Squares Control
Least Squares Policy Iteration
Policy evaluation Policy evaluation by least squares Q-learning
Policy improvement Greedy policy improvement
Least Squares Action-Value Function Approximation
- Approximate action-value function $q_\pi (s,a)$
- using linear combination of features $x(s,a)$
$$ \hat{q} (s,a,w) = x(s,a)^T w \approx q_\pi (s,a) $$ - Minimise least squares error between $\hat{q} (s,a,w)$ and $q_\pi (s,a)$
- form experience generated using policy $\pi$
- consisting of $<(state,action),value>$ pairs
$$ D = \{ < (s_1,a_1),v_1^\pi >, <(s_2,a_2),v_2^\pi>,\dots,<(s_T,a_T),v_T^\pi> \} $$
Least Squares Control
- For policy evaluation, we want to efficiently use all experience
- For control, we also want to improve the policy
- This experience is generated from many policies
- So to evaluate $q_pi (S,A)$ we must learn off-policy
- We use the same idea as Q-learning:
- Use experience generated by old policy, $ S_t,A_t,R_{t+1},S_{t+1} \sim \pi_{old} $
- Consider alternative successor action $ A’ = \pi_{new} (S_{t+1}) $
- Update $\hat{q} (S_t,A_t,w) $ towards value of alternative action $R_{t+1} + \gamma \hat{q} (S_{t+1},A’,w)$
Least Squares Q-Learning
- Consider the following linear Q-learning update
$$ \delta = R_{t+1} + \gamma \hat{q} (S_{t+1},\pi(S_{t+1}),w) - \hat{q} (S_t,A_t,w) \\ \Delta w =\alpha \delta x(S_t,A_t) $$ - LSTDQ alorithm: solve for total update = zero
$$ 0 = \sum_{t=1}^T \alpha (R_{t+1} +\gamma \hat{q} (S_{t+1},\pi(S_{t+1}),w) -\hat{q} (S_t,A_t,w)) x(S_t,A_t) \\ w = \left( \sum_{t=1}^T x(S_t,A_t) (x(S_t,A_t)-\gamma x(S_{t+1},\pi(S_{t+1})))^T \right)^{-1} \sum_{t=1}^T x(S_t,A_t) R_{t+1} $$
Least Squares Policy Iteration Algorithm
- The following pseudocode uses LSTDQ for policy evaluation
- It repeatedly re-evaluates experience $D$ with difference policy
Convergence of Control Algorithms
$(\checkmark)$ = chatters around near-optimal value function