RL2 Markov decision Processes

reference:
UCL Course on RL

lecture 2 Markov decision Processes

MDP formally describe an environment for RL, where the environment is fully observable

Markov Property
The future is independent of the past given the present
$$ P(S_{t+1} | S_{t} ) = P(S_{t+1} | S_1,S_2,\dots,S_t) $$
The state is sufficient statistic of the future

Markov Process

A Markov Process is a memoryless random process

Definition
A Markov Process (Markov Chain) is a tuple (S,P)

  • S is a (finite) set of states
  • P is a state transition probability matrix.

Markov Reward Process

A Markov reward process is a Markov chain with variable

Definition
A Markov Reward Process is a tuple (S,P,R,y)

  • S is a finite set of states
  • P is a state transition probability matrix
  • R is a reward function: $ R_s = E[R_{t+1}|S_t=s] $
  • y is a discount factor

Return

Return
The return $G_t$ is the total discounted reward from time-step t.
$$ G_t = \sum_{k=0}^{\infty} y^k R_{t+k+1}$$

Why discount?

  1. Mathematically convenient to discount reward
  2. Avoids infinite returns in cyclic Markov processed
  3. Uncertainty about the future may not be fully represented
  4. If the reward is financial, immediate rewards may earn more interest than delayed rewards
  5. Animal/human behaviour shows preference for immediate reward
  6. It is sometimes possible to use undiscounted Markov reward processes

Value Function

The value function gives the long-term value of state s

Definition
The state value function $v(s)$ of an MRP is the expected return starting from state s
$$ v(s) = E[G_t | S_t = s]$$

Bellman Equation for MRPs

The value function can be decomposed into two patrs

  • immediate reward $R_{t+1}$
  • discounted value of successor state $y v(S_{t+1})$

The Bellman equation can be expressed concisely using matrices,
$$ v = R + y P v $$
where $v$ is a column vector with one entry per states

Solving Bellman Equation

  • linear equation and can be solved directly
  • Computational complexity is $O(n^3)$ for $n$ states
  • Direct solution only possible for small MRPs
  • Many iterative methods for large MRPs
    • Dynamic Programming
    • Monte-Carlo evaluation
    • Temporal-Difference learning

Markov Decision Process

Policy

Definition
A policy $\pi$ is a distribution over actions given states,
$$ \pi(a|s) = P[A_t=a|S_t=s]$$
MDP policies depend on the current state (not the history)

Given an MDP M =(S,A,P,R,y) and a Policy $\pi$

  • The state sequence $S_1,S_2,\dots$ is a Markov process $ (S,P^{\pi}) $
  • The state and reward sequence is a Markov reward process $ (S,P^{\pi},R^{\pi},y) $

Value function

State Value function

The state value function $v_{\pi}(s)$ of an MDP is the expected return starting from state $s$, and then following policy $\pi$
$$ v_{\pi} (s) = E_{\pi} [G_t |S_t=s] $$

Action value function

The action value function $q_{\pi}(s,a)$ is expected return starting from state $s$, taking action $a$, and then following policy $\pi$
$$ q_{s,a} = E_{\pi} [G_t |S_t=s,A_t=a]$$

Bellman Expectation Equation

The state-value function/action-value function can be decomposed into immediate reward plus discounted value of successor state.
For $V^{\pi}$
$$ V_{\pi} (s) = \sum_{a \in A} \pi (a|s) q_{\pi} (s,a) $$

For $Q^{\pi}$
$$ q_{\pi} (s,a) = R_{s}^{a} + y \sum_{s’ \in S} P_{ss’}^{a} v_{\pi} (s’) $$

For $V^{\pi}$ (2)
$$ V_{\pi} (s) = \sum_{a \in A} \pi (a|s) R_{s}^{a} + y \sum_{s’ \in S} P_{ss’}^{a} v_{\pi} (s’) $$

For $Q^{\pi}$ (2)
$$ q_{\pi} (s,a) = R_{s}^{a} + y \sum_{a \in A} \pi (a’|s’) q_{\pi} (s’,a’) $$

Matrix form
$$ v_{\pi} = R^{\pi} + yP^{\pi} v_{\pi}$$

Optimal value function

Definition
The optimal state-value function is the maximum value function over all policies
The optimal action-value function is the maximum action-value function over all policies

  • The optimal value function specifies the best possible performance in the MDP.
  • An MDP is “solved” when we know the optimal value fn

Optimal Policy

Define a partial ordering over policies
$$ \pi \ge \pi’, v_{\pi} (s) \ge v_{\pi ‘} (s) $$

Theorem
For any Markov Decision Process

  • There exists an optimal policy that is better than or equal to all other policies
  • All optimal policies achieve the optimal value function
  • All optimal policies achieve the optimal action-value function

An optimal policy can be found by maximising over optimal q function

  • There is always a deterministic optimal policy for any MDP
  • If we know optimal q function, we immediately have the optimal policy

Bellman Optimality Equation

For $v_*$

$$v_* (s) = \max_{a} q_* (s,a) $$

For $Q^*$

$$ q_* (s,a) = R_s^a + y \sum_{s’ \in S} P_{ss’}^{a} v_* (s’) $$

For $V^*$

$$ v_*(s) = \max_{a} R_s^a + y \sum_{s’ \in S} P_{ss’}^a v_* (s’) $$

For $Q^*$

$$ q_* (s,a) = R_s^a + y \sum_{s’ \in S} P_{ss’}^a \max_{a’} q_* (s’,a’) $$

Solving the Bellman Optimality Equation

  • Bellman Optimality Equation is non-linear
  • No closed form solution (in general)
  • Many iterative solution methods
    • value iteration
    • policy iteration
    • q-learning
    • Sarsa

Extensions to MDPs

Infinite and continuous MDPs

  • Countably infinite state and/or action spaces
  • Continuous state and/or action spaces
  • Continuous time
    • Requires partial differential equations
    • Hamilton-Jacob-Bellman (HJB) equation
    • Limiting case of Bellman equation as time-step –> 0

Partially observable MDPs

POMDPs (Partially Observable MDPs)

A Partially Observable Markov Decision Process is an MDP with hidden states. It is a hidden Markov model with actions.

Definition
A POMDP is a tuple (S,A,O,P,R,Z,y)

  • S is a finite set of states
  • A is a finite set of actions
  • O is a finite set of observations
  • P is a state transition probability matrix
  • R is a reward function
  • Z is an observable function
  • y is a discount factor

Belief States

  • A history $H_t$ is a sequence of actions, observations and rewards
  • $$ H_t = A_0, O_1, R_1, \dots, A_{t-1}, O_t, R_t $$
  • A belief state $b(h)$ is a probability distribution over states conditioned on the history h
  • $$ b(h) = (P[S_t = s^1 |H_t = h], \dots, P[S_t=s^n |H_t =h]) $$

Reductions of POMDPs

  • The history $H_t$ satisfies the Markov property
  • The belief state $b(H_t)$ satisfies the Markov property

    Undiscounted, average reward MDPs

    Ergodic Markov Process
    An ergodic Markov process is
  • Recurrent: each state is visited an infinite number of times
  • Aperiodic: each state is visited without any systematic period

Theorem
An ergodic Markov process has a limiting stationary distribution $d^{\pi} (s)$ with the property
$$ d^{\pi} (s) = \sum_{s’ \in S} d^{\pi} (s’) P_{s’s} $$

Ergodic MDP
An MDP is ergodic if the Markov chain induced by any policy is ergodic.

For any policy $\pi$, an ergodic MDP has an average reward per time-step that is independent of start state

Average Reward Value Function
The value function of an undiscounted, ergodic MDP can be expressed in terms of average reward.

Donate comment here