reference:
UCL Course on RL
lecture 1 Introduction to Reinforcement Learning
reinforcement learning feature
- no supervisor, only a reward signal
- delayed feedback
- time matters
- agent’s actions affect the subsequent data
Reward Hypothesis
all goal can be described by the maximisation of expected cumulative reward.
environment state and agent state
environment state : whatever data to environment
agent state: whatever data to agent
information state
Information state is Markov
Fully Observable Environments and Partially Observable Environments
Major Components of an RL Agent
- Policy
- Value function
- Model
- A model predict what the environment will do next
- P predict the next state
- R predict the next reward
Maze Example
Agent may have an internal model of the environment
Dynamics: how actions change the state
Rewards: how much reward from each state
The model may be imperfect
Categorize RL agents
- value based
- policy based
Actor Critic
Model Free
- Model Based
Learning and Planning
Two fundamental problems in sequential decision making
- Reinforcement learning
- The environment is initially unknown
- The agent interacts with the environment
- The agent improves its policy
- Planning
- A model of the environment is known
- The agent performs computation with its model
- The agent improves its policy
Exploration and Exploitation
- Reinforcement learning is like trail-and-error learning
- the agent should discover a good policy
- From its experience of the environment
- Without losing too much reward along the way
Prediction and Control
- Prediction: evaluate the future, given a policy
- Control: optimize the future, find the best policy