UCL Course on RL
lecture 3 Planning by Dynamic Programming
Dynamic: sequential or temporal component to the problem
Programming: optimising a “program”, i.e. a policy
- A method for solving complex problems
- By breaking them down into sub-problems
- Solve the sub-problems
- Combine solutions to sub-problems
Dynamic Programming is a very general solution method for problems which have two properties:
- Optimal substructure
- Principle of optimality applies
- Optimal solution can be decomposed into sub-problems
- Overlapping sub-problems
- Sub-problems recur many times
- Solution can be cached and reused
Markov decision processes satisfy both properties
- Bellman equation gives recursive decomposition
- Value function stores and reuses solutions
Planning by DP
- DP assumes full knowledge of the MDP
- It is used for planning in an MDP
- For prediction
- Input: MDP (S,A,P,R,y) and policy $\pi$
- or: MRP $(S,P^{\pi},R^{\pi},y)$
- Output: value function $v_{\pi}$
- Or for control:
- Input: MDP (S,A,P,R,y)
- Output: optimal value function $v_*$
- and: optimal policy $\pi_*$
Policy Evaluation
Iterative Policy Evaluation
- Problem: evaluate a given policy $\pi$
- Solution: iterative application of Bellman expectation backup
- Using synchronous backups
- At each iteration $k+1$
- For all states $s\in S$
- Update $v_{k+1} (s)$ from $v_k(s’)$
- where $s’$ is a successor state of $s$
$$ v^{k+1} = R^{\pi} + y P^{\pi} v^k $$
Policy Iteration
Policy Improvement
- Given a policy $\pi$
- Evaluate the policy $\pi$
- Improve the policy by acting greedily with respect to $v_{\pi}$
- In general, need more iterations of improvement/evaluation
- But this process of policy iteration always converges to $\pi^*$
- Consider a deterministic policy, $a=\pi (s) $
- We can improve the policy by acting greedily
- This improves the value from any state $s$ over one step
- It therefore improves the value function, $v_{\pi’} (s) \ge v_{\pi} (s) $
- If improvements stop, $q_{\pi} (s,\pi’(s)) = v_{\pi} (s)$
- Then the Bellman optimality equation has been satisfied $v_{\pi} (s) = \max_{a\in A} q_{\pi} (s,a)$
- Therefore $v_{\pi} (s) = v_* (s) $ for all $s\in S$, so $\pi$ is an optimal policy
Extension to Policy Iteration
- Does policy evaluation need to converge to $v_{\pi}$
- Or should we introduce a stopping condition
- Or simply stop after k iterations of iterative policy evaluation
Value Iteration
Principle of Optimality
Any optimal policy can be subdivided into two components
- An optimal first action $A_*$
- Followed by an optimal policy from successor state $S’$
A policy $\pi (a|s)$ achieves the optimal value from state $s$, $v_{\pi} (s) = v_* (s)$, if and only if
- For any state $s’$ reachable from s
- $\pi$ achieves the optimal value from state $s’$
Deterministic Value Iteration
- If we know the solution to sub-problems $v_*(s’)$
- Then solution $v_*(s)$ can be found by one-step lookahead
- The idea of value iteration is to apply these updates iteratively
- Intuition: start with final rewards and work backwards
- Still works with loopy, stochastic MDPs
Value Iteration
- Problem: find optimal policy $\pi$
- Solution: iterative application of Bellman optimality backup
- Unlike policy iteration, there is no explicit policy
- Intermediate value functions may not correspond to any policy
Summary of DP algorithms
Problem | Bellman Equation | Algorithm |
Prediction | Bellman Expectation Equation | Iterative Policy Evaluation |
Control | Bellman Expectation Equation + Greedy Policy Improvement | Policy Iteration |
Control | Bellman Optimality Equation | Value Iteration |
- Algorithms are based on state-value function $v_{\pi} (s) $ or $v_* (s) $
- Complexity $O(mn^2)$ per iteration, for $m$ action and $n$ states
- Could also apply to action-value function $q_{\pi} (s,a)$ or $q_* (s,a)$
- Complexity $O(m^2 n^2)$ per iteration
Extensions to DP
Asynchronous DP
- Asynchronous DP backs up states individually, in any order
- For each selected state, apply the appropriate backup
- Can significantly reduce computation
- Guaranteed to converge if all states continue to be selected
Three simple ideas
In-place dynamic programming
Synchronous value iteration stores two copies of value function for all $s$ in $S$
In-place value iteration only stores one copy of value function for all $s$Prioritised sweeping
Use magnitude of Bellman error to guide state selection
Backup the state with the largest remaining Bellman error
Update Bellman error of affected states after each backup
Require knowledge of reverse dynamics
Can be implemented efficiently by maintaining a priority queueReal-time dynamic programming
Idea: only states that are relevant to agent
Use agent’s experience to guide the selection of states
After each time-step $S_t,A_t,R_{t+1}$
Backup the state $S_t$
Full-width and sample backups
Full-width Backups
- DP uses full-width backups
- For each backup (sync or async)
- Every successor state and action is considered
- Using knowledge of the MDP transitions and reward function
- DP is effective for medium-sized problems (millions of states)
- For large problems DP suffers Bellman’s curse of dimensionality
- Even one backup can be too expensive
Sample Backups
- Instead of reward function R and transition dynamics P
- Advantages
- Model-free: no advance knowledge of MDP required
- Breaks the curse of dimensionality through sampling
- Cost of backup is constant, independent of $n=|S|$
Approximate Dynamic Programming
- Approximate the value function
- Using a function approximator $\hat{v} (s,w)$
- Apply dynamic programming to $\hat{v} (\cdot,w)$
Contraction Mapping
Contraction Mapping resolves that convergence problem such as converge or not, uniqueness, and converge speed
Value function Space
- Consider the vector space $V$ over value functions
- There are |S| dimensions
- Each points in this space fully specifies a value function
- Bellman backup brings values functions closer
- And therefore the backups must converge on a unique solution
Bellman Expectation Backup is a Contraction
When use the $\infty$ norm as the distance metric, we have
- Define the Bellman expectation backup operator $T^{\pi}$
$$ T^{\pi} (v) = R^{\pi} + yP^{\pi} v $$ - This operator is a y-contraction, it makes value functions closer by at least y
Theorem (Contraction Mapping Theorem)
For any metric space $V$ that is complete under an operator $T(v)$, where $T$ is a y-contraction
- $T$ converges to a unique fixed point
- At a linear convergence rate of y
Convergence of Iter. Policy Evaluation and Policy Iteration
- The Bellman expectation operator $T^{\pi}$ has a unique fixed point
- $v_{\pi}$ is a fixed point of $T^{\pi}$ (by Bellman expectation equation)
- By contraction mapping theorem
- Iterative policy evaluation converges on $v_{\pi}$
- Policy iteration converges on $v_*$
Bellman Optimality Backup is a Contraction
- Define the Bellman optimality backup operator $T^*$
$$ T^* (v) = \max_{a\in A} R^a + y P^a v$$ - This operator is a y-contraction, it makes value function closer by at least $y$
$$ ||T^* (u) - T^* (v) ||_{\infty} \le y||u-v||_{\infty} $$
Convergence of Value Iteration
- The bellman optimality operator $T^*$ has a unique fixed point
- $v_*$ is a fixed point of $T^*$ (by Bellman optimality equation)
- By contraction mapping theorem
- Value iteration converges on $v_*$