UCL Course on RL
lecture 2 Markov decision Processes
MDP formally describe an environment for RL, where the environment is fully observable
Markov Property
The future is independent of the past given the present
$$ P(S_{t+1} | S_{t} ) = P(S_{t+1} | S_1,S_2,\dots,S_t) $$
The state is sufficient statistic of the future
Markov Process
A Markov Process is a memoryless random process
A Markov Process (Markov Chain) is a tuple (S,P)
- S is a (finite) set of states
- P is a state transition probability matrix.
Markov Reward Process
A Markov reward process is a Markov chain with variable
A Markov Reward Process is a tuple (S,P,R,y)
- S is a finite set of states
- P is a state transition probability matrix
- R is a reward function: $ R_s = E[R_{t+1}|S_t=s] $
- y is a discount factor
The return $G_t$ is the total discounted reward from time-step t.
$$ G_t = \sum_{k=0}^{\infty} y^k R_{t+k+1}$$
Why discount?
- Mathematically convenient to discount reward
- Avoids infinite returns in cyclic Markov processed
- Uncertainty about the future may not be fully represented
- If the reward is financial, immediate rewards may earn more interest than delayed rewards
- Animal/human behaviour shows preference for immediate reward
- It is sometimes possible to use undiscounted Markov reward processes
Value Function
The value function gives the long-term value of state s
The state value function $v(s)$ of an MRP is the expected return starting from state s
$$ v(s) = E[G_t | S_t = s]$$
Bellman Equation for MRPs
The value function can be decomposed into two patrs
- immediate reward $R_{t+1}$
- discounted value of successor state $y v(S_{t+1})$
The Bellman equation can be expressed concisely using matrices,
$$ v = R + y P v $$
where $v$ is a column vector with one entry per states
Solving Bellman Equation
- linear equation and can be solved directly
- Computational complexity is $O(n^3)$ for $n$ states
- Direct solution only possible for small MRPs
- Many iterative methods for large MRPs
- Dynamic Programming
- Monte-Carlo evaluation
- Temporal-Difference learning
Markov Decision Process
A policy $\pi$ is a distribution over actions given states,
$$ \pi(a|s) = P[A_t=a|S_t=s]$$
MDP policies depend on the current state (not the history)Given an MDP M =(S,A,P,R,y) and a Policy $\pi$
- The state sequence $S_1,S_2,\dots$ is a Markov process $ (S,P^{\pi}) $
- The state and reward sequence is a Markov reward process $ (S,P^{\pi},R^{\pi},y) $
Value function
State Value function
The state value function $v_{\pi}(s)$ of an MDP is the expected return starting from state $s$, and then following policy $\pi$
$$ v_{\pi} (s) = E_{\pi} [G_t |S_t=s] $$
Action value function
The action value function $q_{\pi}(s,a)$ is expected return starting from state $s$, taking action $a$, and then following policy $\pi$
$$ q_{s,a} = E_{\pi} [G_t |S_t=s,A_t=a]$$
Bellman Expectation Equation
The state-value function/action-value function can be decomposed into immediate reward plus discounted value of successor state.
For $V^{\pi}$
$$ V_{\pi} (s) = \sum_{a \in A} \pi (a|s) q_{\pi} (s,a) $$
For $Q^{\pi}$
$$ q_{\pi} (s,a) = R_{s}^{a} + y \sum_{s’ \in S} P_{ss’}^{a} v_{\pi} (s’) $$
For $V^{\pi}$ (2)
$$ V_{\pi} (s) = \sum_{a \in A} \pi (a|s) R_{s}^{a} + y \sum_{s’ \in S} P_{ss’}^{a} v_{\pi} (s’) $$
For $Q^{\pi}$ (2)
$$ q_{\pi} (s,a) = R_{s}^{a} + y \sum_{a \in A} \pi (a’|s’) q_{\pi} (s’,a’) $$
Matrix form
$$ v_{\pi} = R^{\pi} + yP^{\pi} v_{\pi}$$
Optimal value function
The optimal state-value function is the maximum value function over all policies
The optimal action-value function is the maximum action-value function over all policies
- The optimal value function specifies the best possible performance in the MDP.
- An MDP is “solved” when we know the optimal value fn
Optimal Policy
Define a partial ordering over policies
$$ \pi \ge \pi’, v_{\pi} (s) \ge v_{\pi ‘} (s) $$
For any Markov Decision Process
- There exists an optimal policy that is better than or equal to all other policies
- All optimal policies achieve the optimal value function
- All optimal policies achieve the optimal action-value function
An optimal policy can be found by maximising over optimal q function
- There is always a deterministic optimal policy for any MDP
- If we know optimal q function, we immediately have the optimal policy
Bellman Optimality Equation
For $v_*$
$$v_* (s) = \max_{a} q_* (s,a) $$
For $Q^*$
$$ q_* (s,a) = R_s^a + y \sum_{s’ \in S} P_{ss’}^{a} v_* (s’) $$
For $V^*$
$$ v_*(s) = \max_{a} R_s^a + y \sum_{s’ \in S} P_{ss’}^a v_* (s’) $$
For $Q^*$
$$ q_* (s,a) = R_s^a + y \sum_{s’ \in S} P_{ss’}^a \max_{a’} q_* (s’,a’) $$
Solving the Bellman Optimality Equation
- Bellman Optimality Equation is non-linear
- No closed form solution (in general)
- Many iterative solution methods
- value iteration
- policy iteration
- q-learning
- Sarsa
Extensions to MDPs
Infinite and continuous MDPs
- Countably infinite state and/or action spaces
- Continuous state and/or action spaces
- Continuous time
- Requires partial differential equations
- Hamilton-Jacob-Bellman (HJB) equation
- Limiting case of Bellman equation as time-step –> 0
Partially observable MDPs
POMDPs (Partially Observable MDPs)
A Partially Observable Markov Decision Process is an MDP with hidden states. It is a hidden Markov model with actions.
A POMDP is a tuple (S,A,O,P,R,Z,y)
- S is a finite set of states
- A is a finite set of actions
- O is a finite set of observations
- P is a state transition probability matrix
- R is a reward function
- Z is an observable function
- y is a discount factor
Belief States
- A history $H_t$ is a sequence of actions, observations and rewards
- $$ H_t = A_0, O_1, R_1, \dots, A_{t-1}, O_t, R_t $$
- A belief state $b(h)$ is a probability distribution over states conditioned on the history h
- $$ b(h) = (P[S_t = s^1 |H_t = h], \dots, P[S_t=s^n |H_t =h]) $$
Reductions of POMDPs
- The history $H_t$ satisfies the Markov property
- The belief state $b(H_t)$ satisfies the Markov property
Undiscounted, average reward MDPs
Ergodic Markov Process
An ergodic Markov process is - Recurrent: each state is visited an infinite number of times
- Aperiodic: each state is visited without any systematic period
An ergodic Markov process has a limiting stationary distribution $d^{\pi} (s)$ with the property
$$ d^{\pi} (s) = \sum_{s’ \in S} d^{\pi} (s’) P_{s’s} $$
Ergodic MDP
An MDP is ergodic if the Markov chain induced by any policy is ergodic.
For any policy $\pi$, an ergodic MDP has an average reward per time-step that is independent of start state
Average Reward Value Function
The value function of an undiscounted, ergodic MDP can be expressed in terms of average reward.