RL6 Value Function Approximation

reference:
UCL Course on RL

lecture 6 Value Function Approximation

Introduction

Large-Scale Reinforcement Learning
Reinforcement learning can be used to solve large problems

Value Function Approximation
Value Function by a lookup table

  • Every state $s$ has an entry Q(s,a)
  • Or every state-action pair $s,a$ has an entry $Q(s,a)$
    Problem with larger MDPs
  • There are many states and/or actions to store in memory
  • It is too slow to learn the value of each state individually
    Solution for large MDPs
  • Estimate value functions with function approximation $ \hat{v} (s,w) \approx v_{\pi}(s) $ or $ \hat{q} (s,a,w) \approx q_\pi (s,a) $
  • Generalise from seen states to unseen states
  • Update parameters w using MC or TD learning

Types of value function approximation

types of value function approximation
We consider differentiable function approximators

  1. Linear combinations of features
  2. Neural network
  3. Decision tree
  4. Nearest neighbour
  5. Fourier/wavelet bases
  6. $\dots$
    Furthermore, we require a training method that is suitable for non-stationary, non-iid data

Incremental Methods

Stochastic Gradient Descent

  • Goal: find parameters vector $w$ minimising mean-squared error between approximate value fn $\hat{v}(s,w)$ and true value fn $v_{\pi} (s) $
    $$ J(w) = E_\pi [(v_\pi (S) -\hat{v} (S,w) )^2] $$
  • Gradient descent finds a local minimum
    $$ \Delta w = -\frac{1}{2} \alpha \nabla_w J(w)= \alpha E_\pi \left[ (v_\pi(S) -\hat{v}(S,w))\nabla_w \hat{v} (S,w) \right] $$
  • Stochastic gradient descent samples the gradient
    $$ \Delta w = \alpha (v_\pi (S) - \hat{v} (S,w)) \nabla_w \hat{v} (S,w) $$
  • Expected update is equal to full gradient update

Linear Function Approximation

Linear Value Function Approximation

  • Represent value function by a linear combination of features
    $$ \hat{S,w} = x(S)^T w = \sum_{j=1}^{n} x_j (S) w_j $$
  • Objective function is quadratic in parameters $w$
    $$ J(w) = E_\pi \left[ (v_\pi(S) - x(S)^T w)^2\right] $$
  • Stochastic gradient descent converges on global optimum
  • Update rule is particularly simple
    Updata = step-size $\times$ prediction error $\times$ feature value

Table Lookup Features

  • Table lookup is special case of linear value function approximation
  • Using table lookup features
    $$ x^{table} (S) = \begin{bmatrix} 1(S=s_1) \\ \vdots \\ 1(S=s_n) \end{bmatrix} $$
  • Parameter vector $w$ gives value of each individual state
    $$ \hat{v} (S,w) = \begin{bmatrix} 1(S=s_1) \\ \vdots \\ 1(S=s_n) \end{bmatrix} \cdot \begin{bmatrix} w_1 \\ \vdots \\ w_n \end{bmatrix} $$

Incremental Prediction Algorithms

  • Have assumed true value function $v_{\pi} (s)$ given by supervisor
  • But in RL there is no supervisor, only rewards
  • In practice, we substitute a target for $v_{\pi} (s)$
    • For MC, the target is the return $G_t$
      $$ \Delta w = \alpha (G_t - \hat{v} (S_t,w)) \nabla_{w} \hat{v} (S_t,w) $$
    • For TD(0), the target is the TD target $R_{t+1} + \gamma \hat{v} (S_{t+1},w) $
      $$ \Delta w = \alpha (R_{t+1} + \gamma \hat{v} (S_{t+1},w) - \hat{v} (S_t,w)) \nabla_{w} \hat{v} (S_t,w) $$
    • For TD($\lambda$), the target is the $\lambda$-return $G_t^\lambda$
      $$ \Delta w = \alpha (G_t^\lambda - \hat{v} (S_t,w)) \nabla_{w} \hat{v} (S_t,w) $$

MC with Value Function Approximation

  • Return $G_t$ is an unbiased, noisy sample of true value $v_{pi} (S_t)$
  • Can therefore apply supervised learning to “training data”:
    $$ (S_1, G_1), (S_2, G_2), \dots, (S_T, G_T) $$
  • For example, using linear MC policy evaluation
    $$ \Delta w = \alpha (G_t - \hat{v} (S_t,w)) \Delta_w \hat{v} (S_t,w) = \alpha(G_t - \hat{v} (S_t,w))x(S_t) $$
  • Monte-Carlo evaluation converges to a local optimum
  • Even when using non-linear value function approximation

TD Learning with Value Function Approximation

  • The TD-target $R_{t+1} + \gamma \hat{v} (S_{t+1},w)$ is a biased sample of true value $v_\pi (S_t)$
  • Can still apply supervised learning to “training data”:
    $$ (S_1, R_2 + \gamma \hat{v} (S_2,w) ), (S_2, R_3 + \gamma \hat{v} (S_3,w)), \dots, (R_{T-1},R_T) $$
  • For example, using linear TD(0)
    $$ \Delta w = \alpha (R+\gamma \hat{v} (S’,w) - \hat{v} (S,w)) \Delta_w \hat{v} (S,w) = \alpha \delta x(S) $$
  • Linear TD(0) converges (close) to global optimum

Incremental Control Algorithm

Control with Value function Approximation
Policy evaluation: Approximate policy evaluation, $\hat{q}(\cdot,\cdot,w) \approx q_\pi$
Policy improvement: $\epsilon$-greedy policy improvement

Action-Value Function Approximation

  • Approximate the action-value function $\hat{q}(S,A,w) \approx q_{\pi} (S,A)$
  • Minimise mean-squared error between approximate action-value fn $\hat{q}(S,A,w)$ and true action-value fn $q_\pi (S,A)$
    $$ J(w) = E_\pi \left[(q_\pi (S,A) - \hat{q} (S,A,w))^2 \right] $$
  • Use stochastic gradient descent to find a local minimum
    $$ -\frac{1}{2} \nabla_w J(w) = ( q_{\pi} (S,A) -\hat{q} (S,A,w)) \nabla_w \hat{q} (S,A,w) \\ \Delta w = \alpha (q_{\pi} (S,A) -\hat{q} (S,A,w))\nabla_w \hat{q} (S,A,w) $$

Linear Action-Value Funtion Approximation

  • Represent state and action by a feature vector
    $$ X(S,A) = \begin{pmatrix} x_1 (S,A) \\ \vdots \\ x_n (S,A) \end{pmatrix} $$
  • Represent action-value fn by linear combination of features
    $$ \hat{q} (S,A,w) = x(S,A)^T w = \sum_{j=1}^n x_j (S,A)w_j $$
  • Stochastic gradient descent update
    $$ \nabla_w \hat{q} (S,A,w) = x(S,A) \\ \Delta w =\alpha (q_\pi (S,A) - \hat{q} (S,A,w)) x(S,A) $$

Incremental Control Algorithms

  • Like prediction, we must substitute a target for $q_\pi (S,A)$
    • For MC, the target is the return $G_t$
      $$ \Delta w = \alpha (G_t - \hat{q} (S_t,A_t,w)) \nabla_w \hat{q} (S_t,A_t,w) $$
    • For TD(0), the target is the TD target $R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) $
      $$ \Delta w = \alpha (R_{t+1}+\gamma\hat{q} (S_{t+1},A_{t+1},w) - \hat{q} (S_t,A_t,w)) \nabla_w \hat{q} (S_t,A_t,w) $$
    • For forward-view TD($\lambda$), target is the action-value $\lambda$-return
      $$ \Delta w = \alpha (q_t^\lambda - \hat{q} (S_t,A_t,w)) \nabla_w \hat{q} (S_t,A_t,w) $$
    • For backward-view TD($\lambda$), equivalent update is
      $$ \delta_t = R_{t+1}+\gamma\hat{q} (S_{t+1},A_{t+1},w) - \hat{q} (S_t,A_t,w) \\ E_t =\gamma \lambda E_{t-1} + \Delta_w \hat{q} (S_t,A_t,w) \\ \Delta w = \alpha \delta_t E_t $$

Convergence

Convergence of Prediction Algorithms
Convergence of Prediction Algorhms
Gradient Temporal-Difference Learning

  • TD does not follow the gradient of any objective function
  • This is why TD can diverge when off-policy or using non-linear function approximation
  • Gradient TD follows true gradient of projected Bellman error
    Gradient TD

Convergence of Control Algorithms
Convergence of Control Algorithms

Batch Methods

Batch Reinforcement Learning

  • Gradient descent is simple and appealing
  • But it is not sample efficient
  • Batch methods seek to find the best fitting value function
  • Given the agent’s experience (‘training data”)

Least Squares Prediction

  • Given value function approximation $\hat{v}(s,w) \approx v_\pi (s) $
  • And experience D consisting of (state,value) pairs
    $$ D = ((s_1,v_1^\pi),(s_2,v_2^\pi),\dots,(s_T,v_T^\pi)) $$
  • Which parameters $w$ given the best fitting value fn $\hat{v} (s,w)$?
  • Least squares algorithms find parameters vector $w$ minimising sum-squared error between $\hat{v} (s_t,w)$ and target values $v_t^\pi$
    $$ LS(w) = \sum_{t=1}^{T} (v_t^T - \hat{v} (s_t,w))^2 = E_D \left[ (v^\pi - \hat{v} (s,w))^2\right] $$

Stochastic Gradient Descent with Experience Replay
Repeat:

  1. Sample state, value from experience $(s,v^\pi)\sim D$
  2. Apply stochastic gradient descent update $\Delta w = \alpha (v^\pi -\hat{v} (s,w)) \nabla_w \hat{v} (s,w) $
    Converges to least squares solution
    $$ w^\pi = \arg\min_w LS(w) $$

Experience Replay in Deep Q-Network(DQN)
DQN uses experience replay and fixed Q-targets

  • Take action $a_t$ according to $\epsilon$-greedy policy
  • Store transition $(s_t,a_t,r_{t+1},s_{t+1})$ in replay memory $D$
  • Sample random mini-batch of transitions $(s,a,r,s’)$ from $D$
  • Compute Q-learning targets w.r.t old, fixed parameters $w^-$
  • Optimise MSE between Q-network and Q-learning targets
    $$ L_i (w_i) = E_{s,a,r,s’ \sim D_i} \left[ \left( r + \gamma \max_{a’} Q(s’,a’;w_i^-) - Q(s,a;w_i) \right)^2 \right] $$
  • Using variant of stochastic gradient descent

Linear Least Squares Prediction

  • Experience replay finds least squares solution
  • But it may take many iterations
  • Using linear value function approximation $\hat{v} (s,w) = x(s)^T w $
  • We can solve the least squares solution directly
    • At minimum of $LS(w)$, the expected update must be zero, $E_D (\Delta w) = 0$
      $$ \alpha \sum_{t=1}^{T} x(s_t) (v_t^\pi -x(s_t)^T w) = 0 \\ w =\left( \sum_{t=1}^T x(s_t)x(s_t)^T \right)^{-1} \sum_{t=1}^T x(s_t) v_t^\pi $$
    • For $N$ features, direct solution time is $O(n^3)$
    • Incremental solution time is $O(n^2)$ using Shermann-Morrison

Linear Least Squares Prediction Algorithms

  • We don’t know true values $v_t^\pi$
  • In practice, our “training data” must use noisy or biased sample of $v_t^\pi$
    • LSMC Least Squares MC uses return $v_t^\pi \approx G_t$
    • LSTD Least Squares TD uses TD target $v_t^\pi \approx R_{t+1} + \gamma \hat{v}(S_{t_1},w) $
    • LSTD($\lambda$) Least Squares TD($\lambda$) use $\lambda$-return $v_t^\pi \approx G_t^\lambda$
  • In each case solve directly for fixed point of MC/TD/TD($\lambda$)
    Direct solution for LS

Convergence of Linear Least Squares Prediction Algorithms
Convergence of LS prediction algorithms

Least Squares Control

Least Squares Policy Iteration
Policy evaluation Policy evaluation by least squares Q-learning
Policy improvement Greedy policy improvement

Least Squares Action-Value Function Approximation

  • Approximate action-value function $q_\pi (s,a)$
  • using linear combination of features $x(s,a)$
    $$ \hat{q} (s,a,w) = x(s,a)^T w \approx q_\pi (s,a) $$
  • Minimise least squares error between $\hat{q} (s,a,w)$ and $q_\pi (s,a)$
  • form experience generated using policy $\pi$
  • consisting of $<(state,action),value>$ pairs
    $$ D = \{ < (s_1,a_1),v_1^\pi >, <(s_2,a_2),v_2^\pi>,\dots,<(s_T,a_T),v_T^\pi> \} $$

Least Squares Control

  • For policy evaluation, we want to efficiently use all experience
  • For control, we also want to improve the policy
  • This experience is generated from many policies
  • So to evaluate $q_pi (S,A)$ we must learn off-policy
  • We use the same idea as Q-learning:
    • Use experience generated by old policy, $ S_t,A_t,R_{t+1},S_{t+1} \sim \pi_{old} $
    • Consider alternative successor action $ A’ = \pi_{new} (S_{t+1}) $
    • Update $\hat{q} (S_t,A_t,w) $ towards value of alternative action $R_{t+1} + \gamma \hat{q} (S_{t+1},A’,w)$

Least Squares Q-Learning

  • Consider the following linear Q-learning update
    $$ \delta = R_{t+1} + \gamma \hat{q} (S_{t+1},\pi(S_{t+1}),w) - \hat{q} (S_t,A_t,w) \\ \Delta w =\alpha \delta x(S_t,A_t) $$
  • LSTDQ alorithm: solve for total update = zero
    $$ 0 = \sum_{t=1}^T \alpha (R_{t+1} +\gamma \hat{q} (S_{t+1},\pi(S_{t+1}),w) -\hat{q} (S_t,A_t,w)) x(S_t,A_t) \\ w = \left( \sum_{t=1}^T x(S_t,A_t) (x(S_t,A_t)-\gamma x(S_{t+1},\pi(S_{t+1})))^T \right)^{-1} \sum_{t=1}^T x(S_t,A_t) R_{t+1} $$

Least Squares Policy Iteration Algorithm

  • The following pseudocode uses LSTDQ for policy evaluation
  • It repeatedly re-evaluates experience $D$ with difference policy
    LSPI algorithm

Convergence of Control Algorithms
Convergence of Control Algorithms
$(\checkmark)$ = chatters around near-optimal value function

Donate comment here