reference:
UCL Course on RL

lecture 3 Planning by Dynamic Programming

introduction

Definition

Dynamic: sequential or temporal component to the problem
Programming: optimising a “program”, i.e. a policy

A method for solving complex problems
By breaking them down into sub-problems
- Solve the sub-problems
- Combine solutions to sub-problems

Requirements

Dynamic Programming is a very general solution method for problems which have two properties:

Optimal substructure
- Principle of optimality applies
- Optimal solution can be decomposed into sub-problems
Overlapping sub-problems
- Sub-problems recur many times
- Solution can be cached and reused

Markov decision processes satisfy both properties

Bellman equation gives recursive decomposition
Value function stores and reuses solutions

Planning by DP

DP assumes full knowledge of the MDP
It is used for planning in an MDP
For prediction
- Input: MDP (S,A,P,R,y) and policy $\pi$
- or: MRP $(S,P^{\pi},R^{\pi},y)$
- Output: value function $v_{\pi}$
Or for control:
- Input: MDP (S,A,P,R,y)
- Output: optimal value function $v_*$
- and: optimal policy $\pi_*$

Policy Evaluation

Iterative Policy Evaluation

Problem: evaluate a given policy $\pi$
Solution: iterative application of Bellman expectation backup
Using synchronous backups
- At each iteration $k+1$
- For all states $s\in S$
- Update $v_{k+1} (s)$ from $v_k(s’)$
- where $s’$ is a successor state of $s$
  $$ v^{k+1} = R^{\pi} + y P^{\pi} v^k $$

Policy Iteration

Policy Improvement

Given a policy $\pi$
- Evaluate the policy $\pi$
- Improve the policy by acting greedily with respect to $v_{\pi}$
In general, need more iterations of improvement/evaluation
But this process of policy iteration always converges to $\pi^*$

STEP

Consider a deterministic policy, $a=\pi (s) $
We can improve the policy by acting greedily
This improves the value from any state $s$ over one step
It therefore improves the value function, $v_{\pi’} (s) \ge v_{\pi} (s) $
If improvements stop, $q_{\pi} (s,\pi’(s)) = v_{\pi} (s)$
Then the Bellman optimality equation has been satisfied $v_{\pi} (s) = \max_{a\in A} q_{\pi} (s,a)$
Therefore $v_{\pi} (s) = v_* (s) $ for all $s\in S$, so $\pi$ is an optimal policy

Extension to Policy Iteration

Does policy evaluation need to converge to $v_{\pi}$
Or should we introduce a stopping condition
Or simply stop after k iterations of iterative policy evaluation

Value Iteration

Principle of Optimality

Any optimal policy can be subdivided into two components

An optimal first action $A_*$
Followed by an optimal policy from successor state $S’$

Theorem
A policy $\pi (a|s)$ achieves the optimal value from state $s$, $v_{\pi} (s) = v_* (s)$, if and only if

For any state $s’$ reachable from s

$\pi$ achieves the optimal value from state $s’$

Deterministic Value Iteration

If we know the solution to sub-problems $v_*(s’)$
Then solution $v_*(s)$ can be found by one-step lookahead
The idea of value iteration is to apply these updates iteratively
Intuition: start with final rewards and work backwards
Still works with loopy, stochastic MDPs

Value Iteration

Problem: find optimal policy $\pi$
Solution: iterative application of Bellman optimality backup
Unlike policy iteration, there is no explicit policy
Intermediate value functions may not correspond to any policy

Summary of DP algorithms

Problem	Bellman Equation	Algorithm
Prediction	Bellman Expectation Equation	Iterative Policy Evaluation
Control	Bellman Expectation Equation + Greedy Policy Improvement	Policy Iteration
Control	Bellman Optimality Equation	Value Iteration

Algorithms are based on state-value function $v_{\pi} (s) $ or $v_* (s) $
Complexity $O(mn^2)$ per iteration, for $m$ action and $n$ states
Could also apply to action-value function $q_{\pi} (s,a)$ or $q_* (s,a)$
Complexity $O(m^2 n^2)$ per iteration

Extensions to DP

Asynchronous DP

Asynchronous DP backs up states individually, in any order
For each selected state, apply the appropriate backup
Can significantly reduce computation
Guaranteed to converge if all states continue to be selected

Three simple ideas

In-place dynamic programming

Synchronous value iteration stores two copies of value function for all $s$ in $S$
In-place value iteration only stores one copy of value function for all $s$
Prioritised sweeping

Use magnitude of Bellman error to guide state selection
Backup the state with the largest remaining Bellman error
Update Bellman error of affected states after each backup
Require knowledge of reverse dynamics
Can be implemented efficiently by maintaining a priority queue
Real-time dynamic programming

Idea: only states that are relevant to agent
Use agent’s experience to guide the selection of states
After each time-step $S_t,A_t,R_{t+1}$
Backup the state $S_t$

Full-width and sample backups

Full-width Backups

DP uses full-width backups
For each backup (sync or async)
- Every successor state and action is considered
- Using knowledge of the MDP transitions and reward function
DP is effective for medium-sized problems (millions of states)
For large problems DP suffers Bellman’s curse of dimensionality
Even one backup can be too expensive

Sample Backups

Instead of reward function R and transition dynamics P
Advantages
- Model-free: no advance knowledge of MDP required
- Breaks the curse of dimensionality through sampling
- Cost of backup is constant, independent of $n=|S|$

Approximate Dynamic Programming

Approximate the value function
Using a function approximator $\hat{v} (s,w)$
Apply dynamic programming to $\hat{v} (\cdot,w)$

Contraction Mapping

Contraction Mapping resolves that convergence problem such as converge or not, uniqueness, and converge speed

Value function Space

Consider the vector space $V$ over value functions
There are |S| dimensions
Each points in this space fully specifies a value function
Bellman backup brings values functions closer
And therefore the backups must converge on a unique solution

Bellman Expectation Backup is a Contraction

When use the $\infty$ norm as the distance metric, we have

Define the Bellman expectation backup operator $T^{\pi}$
$$ T^{\pi} (v) = R^{\pi} + yP^{\pi} v $$
This operator is a y-contraction, it makes value functions closer by at least y

Theorem (Contraction Mapping Theorem)
For any metric space $V$ that is complete under an operator $T(v)$, where $T$ is a y-contraction

$T$ converges to a unique fixed point

At a linear convergence rate of y

Convergence of Iter. Policy Evaluation and Policy Iteration

The Bellman expectation operator $T^{\pi}$ has a unique fixed point
$v_{\pi}$ is a fixed point of $T^{\pi}$ (by Bellman expectation equation)
By contraction mapping theorem
Iterative policy evaluation converges on $v_{\pi}$
Policy iteration converges on $v_*$

Bellman Optimality Backup is a Contraction

Define the Bellman optimality backup operator $T^*$
$$ T^* (v) = \max_{a\in A} R^a + y P^a v$$
This operator is a y-contraction, it makes value function closer by at least $y$
$$ ||T^* (u) - T^* (v) ||_{\infty} \le y||u-v||_{\infty} $$

Convergence of Value Iteration

The bellman optimality operator $T^*$ has a unique fixed point
$v_*$ is a fixed point of $T^*$ (by Bellman optimality equation)
By contraction mapping theorem
Value iteration converges on $v_*$