reference:
UCL Course on RL
lecture 7 Policy Gradient
Introduction
Policy-Based Reinforcement Learning
- In the last lecture we approximated the value or action-value function using parameter $\theta$
- A policy was generated directly from the value function
- In this lecture, we will directly parametrise the policy
$$ \pi_{\theta} (s,a) = P(a|s,\theta) $$ - We will focus again on model-free reinforcement learning
Value-Based and Policy-Based RL
- Value Based
- Learnt Value Function
- Implicit policy (e.g. $\epsilon$-greedy)
- Policy Based
- No Value Function
- Learnt Policy
- Actor-Critic
- Learnt Value Function
- Learnt Policy
Advantages of Policy-Based RL
Advantages:
- Better convergence properties
- Effective in high-dimensional or continuous action spaces
- Can learn stochastic policies
Disadvantages: - Typically converge to a local rather than global optimum
- Evaluating a policy is typically inefficient and high variance
Policy Search
Policy Objective Functions
- Goal: given policy $\pi_{\theta} (s,a) $ with parameters $\theta$, find best $\theta$.
- But how do we measure the quality of a policy $\pi_theta$?
- In episodic environments we can use the start value.
$$ J_1 (\theta) = V^{\pi_\theta} (s_1) = E_{\pi_\theta} (v_1) $$ - In continuing environments we can use the average value
$$ J_{avV} (\theta) = \sum_{s} d^{\pi_\theta} (s) V^{\pi_\theta} (s) $$ - Or the average reward per time-step
$$ J_{avR} (\theta) = \sum_{s} d^{\pi_\theta} (s) \sum_{a} \pi_\theta (s,a) R_s^a $$ - where $d^{\pi_\theta} (s)$ is stationary distribution of Markov chain for $\pi_\theta$
Policy Optimisation
- Policy based reinforcement learning is an optimisation problem
- Find $\theta$ that maximises $J(\theta)$
- Some approaches do not use gradient
- Hill climbing
- Simplex/amoeba/Nelder Mead
- Genetic algorithms
- Greater efficiency often possible using gradient
- Gradient descent
- Conjugate gradient
- Quasi-Newton
- We focus on gradient descent, many extensions possible
- And on methods that exploit sequential structure
Finite Difference Policy Gradient
Policy Gradient
- Let $J(\theta)$ be any policy objective function
- Policy gradient algorithms search for a local maximum in $J(\theta)$ by ascending the gradient of the policy, w.r.t parameters $\theta$
$$ \Delta \theta = \alpha \nabla_\theta J(\theta) $$ - Where $\Delta_\theta J(\theta)$ is the policy gradient, and $\alpha$ is a step-size parameter
Computing Gradients By Finite Differences
- To evaluate policy gradient of $\pi_\theta (s,a)$
- For each dimension $k \in [1,n]$
- Estimate $k$th partial derivative of objective function w.r.t $\theta$
- By perturbing $\theta$ by small amount $\epsilon$ in $k$th dimension
- Uses $n$ evaluations to compute policy gradient in $n$ dimensions
- Simple, noisy, inefficient - but sometimes effective
- Works for arbitrary policies, even if policy is not differentiable
Monte-Carlo Policy Gradient
Likelihood Ratios
Score Function
- We now compute the policy gradient analytically
- Assume policy $\pi_\theta$ is differentiable whenever it is non-zero
- and we know the gradient $\nabla_\theta \pi_\theta (s,a) $
- Likelihood ratios exploit the following identity
$$ \nabla_\theta \pi_\theta (s,a) = \pi_\theta (s,a) \frac{\nabla_\theta \pi_\theta (s,a)}{\pi_\theta (s,a)} = \pi_{\theta} (s,a) \nabla_\theta \log \pi_\theta (s,a) $$ - The score function is $\nabla_\theta \log \pi_{\theta} (s,a)$
Softmax Policy
- We will use a softmax policy as a running example
- Weight actions using linear combination of features $\phi(s,a)^T \theta$
- Probability of action is proportional to exponential weight
$$ \pi_\theta (s,a) \propto e^{\phi(s,a)^T \theta} $$ - The score function is
$$ \nabla_\theta \log \pi_\theta (s,a) = \phi(s,a) - E_{\pi_\theta} (\phi(s,\cdot)) $$
Gaussian Policy
- In continuous action spaces, a Gaussian policy is natural
- Mean is a linear combination of state feature $\mu(s) = \phi(s)^T \theta$
- Variance may be fixed $\sigma^2$, or can also parametrised
- Policy is Gaussian, $a \sim N(\mu(s),\sigma^2)$
- The score function is
$$ \nabla_\theta \log \pi_{\theta} (s,a) = \frac{(a-\mu(s))\phi(s)}{\sigma^2} $$
Policy Gradient Theorem
One-Step MDPs
- Consider a simple class of one-step MDPs
- Starting in state $s\sim d(s)$
- Terminating after one time-step with reward $r=R_{s,a}$
- Use likelihood ratios to compute the policy gradient
$$ J(\theta) = E_{\pi_\theta} (r) = \sum_{s \in S} d(s) \sum_{a \in A} \pi_\theta (s,a) R_{s,a} \\ \nabla_\theta J(\theta) = \sum_{s \in S} d(s) \sum_{a \in A} \pi_\theta (s,a) \nabla_\theta \log \pi_\theta (s,a) R_{s,a} = E_{\pi_\theta} (\nabla_\theta \log \pi_\theta (s,a) r) $$
Policy Gradient Theorem
- The policy gradient theorem generalised the likelihood ratio approach to multi-step MDPs
- Replaces instantaneous reward $r$ with long-term value $Q^\pi (s,a)$
- Policy gradient theorem applies to start state objective, average reward and average value objective, average reward and average value objective
Theorem
For any differentiable policy $\pi_\theta(s,a)$, for any of the policy objective functions $J=J_1,J_{avR}$ or $\frac{1}{1-\gamma} J_{avV}$, the policy gradient is
$$ \nabla_\theta J(\theta) = E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s,a) Q^{\pi_\theta} (s,a)\right] $$
Monte-Carlo Policy Gradient(REINFORCE)
- Update parameters by stochastic gradient ascent
- Using policy gradient theorem
- Using return $v_t$ as an unbiased sample of $Q^{\pi_\theta} (s_t,a_t)$
$$ \Delta \theta_t = \alpha \nabla_\theta \log \pi_\theta (s_t,a_t)v_t $$
Actor-Critic Policy Gradient
Introduction to AC
Reducing Variance Using a Critic
- Monte-Carlo policy gradient still has high variance
- We use a critic to estimate the action-value function, $ Q_w (s,a) \approx Q^{\pi_\theta} (s,a)$
- Actor-critic algorithms maintain two sets of parameters
- Critic Updates action-value function parameters $w$
- Actor Updates policy parameters $\theta$, in direction suggested by critic
- Actor-critic algorithms follow an approximate policy gradient
$$ \nabla_\theta J(\theta) \approx E_{\pi_\theta} (\nabla_\theta \log \pi_\theta (s,a) Q_w (s,a) ) \\ \Delta \theta = \alpha \nabla_\theta \log \pi_\theta (s,a) Q_w (s,a) $$
Estimating the Action-Value Function
- The critic is solving a familiar problem: policy evaluation
- How good is policy $\pi_theta$ for current parameters $\theta$?
- This problem was explored in previous two lectures, e.g.
- Monte-Carlo policy evaluation
- Temporal-Difference learning
- TD($\lambda$)
- Could also use e.g least-squares policy evaluation
Action-value Actor-Critic
- Simple actor-critic algorithm based on action-value critic
- Using linear value fn approx $Q_w (s,a) = \phi(s,a)^T w $
- Critic Updates $w$ by linear TD(0)
- Actor Updates $\theta$ by policy gradient
Compatible Function Approximation
Bias in Actor-Critic Algorithms
- Approximating the policy gradient introduces bias
- A biased policy gradient may not find the right solution
- e.g if $Q_w(s,a)$ uses aliased features, can we solve gridworld example?
- Luckily, if we choose value function approximation carefully
- Then we can avoid introducing any bias
- i.e We can still follow the exact policy gradient
Compatible Function Approximation
Theorem(Compatible Function Approximation Theorem)
If the following two conditions are satisfied
- Value function approximator is compatible to the policy
$$ \nabla_w Q_w(s,a) = \nabla_\theta \log \pi_\theta (s,a) $$- Value function parameters $w$ minimise the mean-squared error
$$ \epsilon = E_{\pi_\theta} \left[ (Q^{\pi_\theta}(s,a) -Q_w(s,a))^2\right] $$
Then the policy gradient is exact
$$ \nabla_\theta J(\theta) = E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s,a) Q_w (s,a) \right] $$
Proof of Compatible Function Approximation Theorem
If $w$ is chosen to minimise mean-squared error, gradient of $\epsilon$ w.r.t $w$ must be zero
So $Q_w (s,a)$ can be substituted directly into the policy gradient
$$ \nabla_\theta J(\theta) = E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s,a) Q_w (s,a) \right] $$
Advantage Function Critic
Reducing Variance Using a Baseline
- We subtract a baseline function $B(s)$ from the policy gradient
- This can reduce variance, without changing expectation
$$ E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s,a) B(s) \right] = \sum_{s \in S} d^{\pi_\theta} (s) \sum_s \nabla_\theta \pi_\theta (s,a) B(s) \\ = \sum_{s \in S} d^{\pi_\theta} B(s) \nabla_\theta \sum_{a \in A} \pi_\theta (s,a) = 0 $$ - A good baseline is the state value function $B(s) = V^{\pi_\theta} (s)$
- So we can rewrite the policy gradient using the advantage function $A^{\pi_\theta} (s,a)$
$$ A^{\pi_\theta} (s,a) = Q^{\pi_\theta} (s,a) - V^{\pi_\theta} (s) \\ \nabla_\theta J(\theta) = E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s,a) A^{\pi_\theta} (s,a) \right] $$
Estimating the Advantage Function
- The advantage function can significantly reduce variance of policy gradient
- So the critic should really estimate the advantage function
- For example, by estimating both $V^{\pi_\theta} (s)$ and $Q^{\pi_\theta} (s,a)$
- Using two function approximating and two parameter vectors
$$ V_v (s) \approx V^{\pi_\theta} (s) \\ Q_w (s,a) \approx Q^{\pi_\theta} (s,a) \\ A(s,a) = Q_w(s,a) - V_v (s) $$ - And updating both value functions by e.g TD learning
- For the true value function $V^{\pi_\theta} (s)$, the TD error $\delta^{\pi_\theta}$
$$ \delta^{\pi_\theta} = r + \gamma V^{\pi_\theta} (s’) - V^{\pi_\theta} (s)$$ - is an unbiased estimate of the advantage function
$$ E_{\pi_\theta} \left[ \delta^{\pi_\theta} | s,a\right]= E_{\pi_\theta} \left[r+\gamma V^{\pi_\theta} (s’) | s,a \right] - V^{\pi_\theta} (s) \\ = Q^{\pi_\theta} (s,a) -V^{\pi_\theta} (s) = A^{\pi_\theta} (s,a) $$ - So we can use the TD error to compute the policy gradient
$$ \nabla_\theta J(\theta) = E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s,a) \delta^{\pi_\theta} \right] $$ - In practice we can use an approximate TD error
$$ \delta_v = r + \gamma V_v (s’) - V_v (s) $$ - This approach only requires one set of critic parameters $v$
Eligibility Traces
Critics at Different Time-Scales
- Critic can estimate value function $V_\theta(s)$ from many targets at different time-scales
- For MC, the target is the return $v_t$
$$ \Delta \theta = \alpha (v_t - V_\theta(s)) \phi(s) $$ - For TD(0), the target is the TD target $r+ \gamma V(s’)$
$$ \Delta \theta = \alpha (r + \gamma V(s’) - V_\theta(s)) \phi(s) $$ - For forward-view TD($\lambda$), the target is the $\lambda$-return $v_t^\lambda$
$$ \Delta \theta =\alpha (v_t^\lambda - V_\theta (s)) \phi(s) $$ - For backward-view TD($\lambda$), we use eligibility traces
$$ \delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \\ e_t = \gamma \lambda e_{t-1} + \phi(s_t) \\ \Delta \theta = \alpha \delta_t e_t $$
- For MC, the target is the return $v_t$
Policy Gradient with Eligibility Traces
- Just like forward-view TD($\lambda$), we can mix over time-scales
$$ \Delta \theta = \alpha (v_t^\lambda - V_v (s_t) ) \nabla_\theta \log \pi_\theta (s_t,a_t) $$ - where $v_t^\lambda -V_v(s_t)$ is a biased estimate of advantage fn
- Like backward-view TD($\lambda$), we can also use eligibility traces
- By equivalence with TD($\lambda$), substituting $\phi(s) = \nabla_\theta \log \pi_\theta (s,a)$
$$ \delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \\ e_{t+1} = \lambda e_{t} + \nabla_\theta \log \pi_\theta (s,a) \\ \Delta \theta = \alpha \delta_t e_t $$
- By equivalence with TD($\lambda$), substituting $\phi(s) = \nabla_\theta \log \pi_\theta (s,a)$
- This update can be applied online, to incomplete sequences
Natural Policy Gradient
Alternative Policy Gradient Direction
- Gradient ascent algorithm can follow any ascent direction
- A good ascent direction can significantly speed convergence
- Also, a policy can often be reparametrised without changing action probabilities
- For example, increasing score of all actions in a softmax policy
- The vanilla gradient is sensitive to these reparametrisations
Natural Policy Gradient
- The natural policy gradient is parametrisation independent
- It finds ascent direction that is closet to vanilla gradient, when changing policy by a small, fixed amount
$$ \nabla_\theta^{nat} \pi_\theta (s,a) = G_\theta^{-1} \nabla_\theta \pi_\theta (s,a) $$ - where $G_\theta$ is the Fisher information matrix
$$ G_\theta = E_\theta \left[ \nabla_\theta \log \pi_\theta (s,a) \nabla_\theta \log \pi_\theta (s,a)^T \right] $$
Natural Actor-Critic
- Using compatible function approximation
$$ \nabla_w A_w (s,a) = \nabla_\theta \log \pi_\theta (s,a) $$ - So the natural policy gradient simplifies,
$$ \nabla_\theta J(\theta) = E_{\theta_\pi} \left[ \nabla_\theta \log \pi_\theta (s,a) A^{\pi_\theta} (s,a) \right] \\ = E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta (s,a) \nabla_\theta \log \pi_\theta (s,a)^T w \right] = G_\theta w $$
so we can get $ \nabla_\theta^{nat} J(\theta) = w $ - update actor parameters in direction of critic parameters
Summary of Policy Gradient Algorithms
- The policy gradient has many equivalent forms
- Each leads a stochastic gradient ascent algorithm
- Critic uses policy evaluation (e.g MC or TD learning) to estimate $Q^\pi (s,a), A^\pi (s,a)$ or $V^\pi (s)$