RL7 Policy Gradient

reference:
UCL Course on RL

lecture 7 Policy Gradient

Introduction

Policy-Based Reinforcement Learning

  • In the last lecture we approximated the value or action-value function using parameter $\theta$
  • A policy was generated directly from the value function
  • In this lecture, we will directly parametrise the policy
    $$ \pi_{\theta} (s,a) = P(a|s,\theta) $$
  • We will focus again on model-free reinforcement learning

Value-Based and Policy-Based RL

  • Value Based
    • Learnt Value Function
    • Implicit policy (e.g. $\epsilon$-greedy)
  • Policy Based
    • No Value Function
    • Learnt Policy
  • Actor-Critic
    • Learnt Value Function
    • Learnt Policy

Advantages of Policy-Based RL
Advantages:

  • Better convergence properties
  • Effective in high-dimensional or continuous action spaces
  • Can learn stochastic policies
    Disadvantages:
  • Typically converge to a local rather than global optimum
  • Evaluating a policy is typically inefficient and high variance

Policy Objective Functions

  • Goal: given policy $\pi_{\theta} (s,a) $ with parameters $\theta$, find best $\theta$.
  • But how do we measure the quality of a policy $\pi_theta$?
  • In episodic environments we can use the start value.
    $$ J_1 (\theta) = V^{\pi_\theta} (s_1) = E_{\pi_\theta} (v_1) $$
  • In continuing environments we can use the average value
    $$ J_{avV} (\theta) = \sum_{s} d^{\pi_\theta} (s) V^{\pi_\theta} (s) $$
  • Or the average reward per time-step
    $$ J_{avR} (\theta) = \sum_{s} d^{\pi_\theta} (s) \sum_{a} \pi_\theta (s,a) R_s^a $$
  • where $d^{\pi_\theta} (s)$ is stationary distribution of Markov chain for $\pi_\theta$

Policy Optimisation

  • Policy based reinforcement learning is an optimisation problem
  • Find $\theta$ that maximises $J(\theta)$
  • Some approaches do not use gradient
    • Hill climbing
    • Simplex/amoeba/Nelder Mead
    • Genetic algorithms
  • Greater efficiency often possible using gradient
    • Gradient descent
    • Conjugate gradient
    • Quasi-Newton
  • We focus on gradient descent, many extensions possible
  • And on methods that exploit sequential structure

Finite Difference Policy Gradient

Policy Gradient

  • Let $J(\theta)$ be any policy objective function
  • Policy gradient algorithms search for a local maximum in $J(\theta)$ by ascending the gradient of the policy, w.r.t parameters $\theta$
    $$ \Delta \theta = \alpha \nabla_\theta J(\theta) $$
  • Where $\Delta_\theta J(\theta)$ is the policy gradient, and $\alpha$ is a step-size parameter

Computing Gradients By Finite Differences

  • To evaluate policy gradient of $\pi_\theta (s,a)$
  • For each dimension $k \in [1,n]$
    • Estimate $k$th partial derivative of objective function w.r.t $\theta$
    • By perturbing $\theta$ by small amount $\epsilon$ in $k$th dimension
  • Uses $n$ evaluations to compute policy gradient in $n$ dimensions
  • Simple, noisy, inefficient - but sometimes effective
  • Works for arbitrary policies, even if policy is not differentiable

Monte-Carlo Policy Gradient

Likelihood Ratios

Score Function

  • We now compute the policy gradient analytically
  • Assume policy $\pi_\theta$ is differentiable whenever it is non-zero
  • and we know the gradient $\nabla_\theta \pi_\theta (s,a) $
  • Likelihood ratios exploit the following identity
    $$ \nabla_\theta \pi_\theta (s,a) = \pi_\theta (s,a) \frac{\nabla_\theta \pi_\theta (s,a)}{\pi_\theta (s,a)} = \pi_{\theta} (s,a) \nabla_\theta \log \pi_\theta (s,a) $$
  • The score function is $\nabla_\theta \log \pi_{\theta} (s,a)$

Softmax Policy

  • We will use a softmax policy as a running example
  • Weight actions using linear combination of features $\phi(s,a)^T \theta$
  • Probability of action is proportional to exponential weight
    $$ \pi_\theta (s,a) \propto e^{\phi(s,a)^T \theta} $$
  • The score function is
    $$ \nabla_\theta \log \pi_\theta (s,a) = \phi(s,a) - E_{\pi_\theta} (\phi(s,\cdot)) $$

Gaussian Policy

  • In continuous action spaces, a Gaussian policy is natural
  • Mean is a linear combination of state feature $\mu(s) = \phi(s)^T \theta$
  • Variance may be fixed $\sigma^2$, or can also parametrised
  • Policy is Gaussian, $a \sim N(\mu(s),\sigma^2)$
  • The score function is
    $$ \nabla_\theta \log \pi_{\theta} (s,a) = \frac{(a-\mu(s))\phi(s)}{\sigma^2} $$

Policy Gradient Theorem

One-Step MDPs

  • Consider a simple class of one-step MDPs
    • Starting in state $s\sim d(s)$
    • Terminating after one time-step with reward $r=R_{s,a}$
  • Use likelihood ratios to compute the policy gradient
    $$ J(\theta) = E_{\pi_\theta} (r) = \sum_{s \in S} d(s) \sum_{a \in A} \pi_\theta (s,a) R_{s,a} \\ \nabla_\theta J(\theta) = \sum_{s \in S} d(s) \sum_{a \in A} \pi_\theta (s,a) \nabla_\theta \log \pi_\theta (s,a) R_{s,a} = E_{\pi_\theta} (\nabla_\theta \log \pi_\theta (s,a) r) $$

Policy Gradient Theorem

  • The policy gradient theorem generalised the likelihood ratio approach to multi-step MDPs
  • Replaces instantaneous reward $r$ with long-term value $Q^\pi (s,a)$
  • Policy gradient theorem applies to start state objective, average reward and average value objective, average reward and average value objective

    Theorem
    For any differentiable policy $\pi_\theta(s,a)$, for any of the policy objective functions $J=J_1,J_{avR}$ or $\frac{1}{1-\gamma} J_{avV}$, the policy gradient is
    $$ \nabla_\theta J(\theta) = E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s,a) Q^{\pi_\theta} (s,a)\right] $$

Monte-Carlo Policy Gradient(REINFORCE)

  • Update parameters by stochastic gradient ascent
  • Using policy gradient theorem
  • Using return $v_t$ as an unbiased sample of $Q^{\pi_\theta} (s_t,a_t)$
    $$ \Delta \theta_t = \alpha \nabla_\theta \log \pi_\theta (s_t,a_t)v_t $$
    PG algorithm

Actor-Critic Policy Gradient

Introduction to AC

Reducing Variance Using a Critic

  • Monte-Carlo policy gradient still has high variance
  • We use a critic to estimate the action-value function, $ Q_w (s,a) \approx Q^{\pi_\theta} (s,a)$
  • Actor-critic algorithms maintain two sets of parameters
    • Critic Updates action-value function parameters $w$
    • Actor Updates policy parameters $\theta$, in direction suggested by critic
  • Actor-critic algorithms follow an approximate policy gradient
    $$ \nabla_\theta J(\theta) \approx E_{\pi_\theta} (\nabla_\theta \log \pi_\theta (s,a) Q_w (s,a) ) \\ \Delta \theta = \alpha \nabla_\theta \log \pi_\theta (s,a) Q_w (s,a) $$

Estimating the Action-Value Function

  • The critic is solving a familiar problem: policy evaluation
  • How good is policy $\pi_theta$ for current parameters $\theta$?
  • This problem was explored in previous two lectures, e.g.
    • Monte-Carlo policy evaluation
    • Temporal-Difference learning
    • TD($\lambda$)
  • Could also use e.g least-squares policy evaluation

Action-value Actor-Critic

  • Simple actor-critic algorithm based on action-value critic
  • Using linear value fn approx $Q_w (s,a) = \phi(s,a)^T w $
    • Critic Updates $w$ by linear TD(0)
    • Actor Updates $\theta$ by policy gradient
      Simple actor-critic algorithm

Compatible Function Approximation

Bias in Actor-Critic Algorithms

  • Approximating the policy gradient introduces bias
  • A biased policy gradient may not find the right solution
    • e.g if $Q_w(s,a)$ uses aliased features, can we solve gridworld example?
  • Luckily, if we choose value function approximation carefully
  • Then we can avoid introducing any bias
  • i.e We can still follow the exact policy gradient

Compatible Function Approximation

Theorem(Compatible Function Approximation Theorem)
If the following two conditions are satisfied

  1. Value function approximator is compatible to the policy
    $$ \nabla_w Q_w(s,a) = \nabla_\theta \log \pi_\theta (s,a) $$
  2. Value function parameters $w$ minimise the mean-squared error
    $$ \epsilon = E_{\pi_\theta} \left[ (Q^{\pi_\theta}(s,a) -Q_w(s,a))^2\right] $$
    Then the policy gradient is exact
    $$ \nabla_\theta J(\theta) = E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s,a) Q_w (s,a) \right] $$

Proof of Compatible Function Approximation Theorem
If $w$ is chosen to minimise mean-squared error, gradient of $\epsilon$ w.r.t $w$ must be zero
Proof
So $Q_w (s,a)$ can be substituted directly into the policy gradient
$$ \nabla_\theta J(\theta) = E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s,a) Q_w (s,a) \right] $$

Advantage Function Critic

Reducing Variance Using a Baseline

  • We subtract a baseline function $B(s)$ from the policy gradient
  • This can reduce variance, without changing expectation
    $$ E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s,a) B(s) \right] = \sum_{s \in S} d^{\pi_\theta} (s) \sum_s \nabla_\theta \pi_\theta (s,a) B(s) \\ = \sum_{s \in S} d^{\pi_\theta} B(s) \nabla_\theta \sum_{a \in A} \pi_\theta (s,a) = 0 $$
  • A good baseline is the state value function $B(s) = V^{\pi_\theta} (s)$
  • So we can rewrite the policy gradient using the advantage function $A^{\pi_\theta} (s,a)$
    $$ A^{\pi_\theta} (s,a) = Q^{\pi_\theta} (s,a) - V^{\pi_\theta} (s) \\ \nabla_\theta J(\theta) = E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s,a) A^{\pi_\theta} (s,a) \right] $$

Estimating the Advantage Function

  • The advantage function can significantly reduce variance of policy gradient
  • So the critic should really estimate the advantage function
  • For example, by estimating both $V^{\pi_\theta} (s)$ and $Q^{\pi_\theta} (s,a)$
  • Using two function approximating and two parameter vectors
    $$ V_v (s) \approx V^{\pi_\theta} (s) \\ Q_w (s,a) \approx Q^{\pi_\theta} (s,a) \\ A(s,a) = Q_w(s,a) - V_v (s) $$
  • And updating both value functions by e.g TD learning
  • For the true value function $V^{\pi_\theta} (s)$, the TD error $\delta^{\pi_\theta}$
    $$ \delta^{\pi_\theta} = r + \gamma V^{\pi_\theta} (s’) - V^{\pi_\theta} (s)$$
  • is an unbiased estimate of the advantage function
    $$ E_{\pi_\theta} \left[ \delta^{\pi_\theta} | s,a\right]= E_{\pi_\theta} \left[r+\gamma V^{\pi_\theta} (s’) | s,a \right] - V^{\pi_\theta} (s) \\ = Q^{\pi_\theta} (s,a) -V^{\pi_\theta} (s) = A^{\pi_\theta} (s,a) $$
  • So we can use the TD error to compute the policy gradient
    $$ \nabla_\theta J(\theta) = E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s,a) \delta^{\pi_\theta} \right] $$
  • In practice we can use an approximate TD error
    $$ \delta_v = r + \gamma V_v (s’) - V_v (s) $$
  • This approach only requires one set of critic parameters $v$

Eligibility Traces

Critics at Different Time-Scales

  • Critic can estimate value function $V_\theta(s)$ from many targets at different time-scales
    • For MC, the target is the return $v_t$
      $$ \Delta \theta = \alpha (v_t - V_\theta(s)) \phi(s) $$
    • For TD(0), the target is the TD target $r+ \gamma V(s’)$
      $$ \Delta \theta = \alpha (r + \gamma V(s’) - V_\theta(s)) \phi(s) $$
    • For forward-view TD($\lambda$), the target is the $\lambda$-return $v_t^\lambda$
      $$ \Delta \theta =\alpha (v_t^\lambda - V_\theta (s)) \phi(s) $$
    • For backward-view TD($\lambda$), we use eligibility traces
      $$ \delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \\ e_t = \gamma \lambda e_{t-1} + \phi(s_t) \\ \Delta \theta = \alpha \delta_t e_t $$

Policy Gradient with Eligibility Traces

  • Just like forward-view TD($\lambda$), we can mix over time-scales
    $$ \Delta \theta = \alpha (v_t^\lambda - V_v (s_t) ) \nabla_\theta \log \pi_\theta (s_t,a_t) $$
  • where $v_t^\lambda -V_v(s_t)$ is a biased estimate of advantage fn
  • Like backward-view TD($\lambda$), we can also use eligibility traces
    • By equivalence with TD($\lambda$), substituting $\phi(s) = \nabla_\theta \log \pi_\theta (s,a)$
      $$ \delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \\ e_{t+1} = \lambda e_{t} + \nabla_\theta \log \pi_\theta (s,a) \\ \Delta \theta = \alpha \delta_t e_t $$
  • This update can be applied online, to incomplete sequences

Natural Policy Gradient

Alternative Policy Gradient Direction

  • Gradient ascent algorithm can follow any ascent direction
  • A good ascent direction can significantly speed convergence
  • Also, a policy can often be reparametrised without changing action probabilities
  • For example, increasing score of all actions in a softmax policy
  • The vanilla gradient is sensitive to these reparametrisations

Natural Policy Gradient

  • The natural policy gradient is parametrisation independent
  • It finds ascent direction that is closet to vanilla gradient, when changing policy by a small, fixed amount
    $$ \nabla_\theta^{nat} \pi_\theta (s,a) = G_\theta^{-1} \nabla_\theta \pi_\theta (s,a) $$
  • where $G_\theta$ is the Fisher information matrix
    $$ G_\theta = E_\theta \left[ \nabla_\theta \log \pi_\theta (s,a) \nabla_\theta \log \pi_\theta (s,a)^T \right] $$

Natural Actor-Critic

  • Using compatible function approximation
    $$ \nabla_w A_w (s,a) = \nabla_\theta \log \pi_\theta (s,a) $$
  • So the natural policy gradient simplifies,
    $$ \nabla_\theta J(\theta) = E_{\theta_\pi} \left[ \nabla_\theta \log \pi_\theta (s,a) A^{\pi_\theta} (s,a) \right] \\ = E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s,a) \nabla_\theta \log \pi_\theta (s,a)^T w \right] = G_\theta w $$
    so we can get $ \nabla_\theta^{nat} J(\theta) = w $
  • update actor parameters in direction of critic parameters

Summary of Policy Gradient Algorithms

  • The policy gradient has many equivalent forms
    PG equivalent forms
  • Each leads a stochastic gradient ascent algorithm
  • Critic uses policy evaluation (e.g MC or TD learning) to estimate $Q^\pi (s,a), A^\pi (s,a)$ or $V^\pi (s)$
Donate comment here