Solve MDPs' equations and understand the intuition behind it leading to reinforcement learning.

## Markov Decision Processes¶

**Differences in Learning**

- Supervised learning: y = f(x)
- Given x-y pairs
- Learn function f(x)
- Function approximation

- Unsupervised learning: f(x)
- Given x's
- Learn the function f(x)
- Clustering description

- Reinforcement learning:
- Given x's and z's
- Learn function f(x)

**Markov Decision Processes**

**States: S**- They are kind of like positions on a map if you are navigating to an end point

**Model (Transition Function): T(s, a, s') ~ P(s' | s, a)**- The model is like a set of rules for a game (physics of the world).
- s: state
- a: action
- s': another state
- Probability of s' given s and a
- Simple: if you're in a deterministic world, if you go straight (a) from the toilet (s), you would go straight (s') with a probability of 1.
- Slightly complex: can be that you have a probability of only 0.8 if you choose to go up, you may go right with a probability of 0.2.

- Your potential state (s') is only dependent on your current state (s) if not you will have more s-es.
- This is a Markovian property where only the present matters.
- Also, given the present state, the future and the past are independent.

**Actions: A(s), A**- They are things you can do such as going up, down, left and right if you are navigating through a box grid.
- They can be other kinds of actions too, depending on your scenario.

**Reward: R(s), R(s, a), R(s, a, s')**- Scalar value for being in a state.
- Position 1, you'll have 100 points.
- Position 2, you'll have -10 points.

- R(s): reward for being in a state
- R(s, a): reward for being in a state and taking an action
- R(s, a, s'): reward for being in a state, taking an action and ending up in the other state
- They are mathematically equivalent.

- Scalar value for being in a state.
**Policy: π(s) → a**- This is basically the solution: what action to take in any particular state.
- It takes in a state, s, and it tells you the action, a, to take.
- π* optimizes the amount the reward at any given point in time.
- Here we have (s, a, r) triplets.
- We need to find the optimal action
- Given what we know above
- π*: function f(x)
- r: z
- s: x
- a: y

- Here we have (s, a, r) triplets.

**Visualization**

**Rewards (Domain Knowledge)**

**Delayed rewards**- You won't know what your immediate action would lead to down the road.
- Example in chess
- You did many moves
- Then you have your final reward where you did well or badly
- You could have done well at the start then worst at the end.
- You could have done badly at the start then better at the end.

- Example in chess

- You won't know what your immediate action would lead to down the road.
**Example**- R(s) = -0.01
- Reward for every state except for the 2 absorbing states is -0.01.
- s_good = +1
- s_bad = -0.01
- Each action you take, you get punished.
- You are encouraged to end the game soon.

- Minor changes to your reward function matters a lot.
- If you have a huge negative reward, you will be in a hurry.
- If you have a small negative reward, you may not be in a hurry to reach the end.

- R(s) = -0.01

**Sequence of rewards: assumptions**

**Stationary preferences**- [a1, a2, ...] > [b1, b2, ...]
- [r, a1, a2, ...] > [r, b1, b2, ...]
- You'll do the same action regardless of time.

**Inifinite horizons**- The game does not end until we reach the absorbing state.
- If we do not have an inifinite time, and a short amount of time, we may want to rush through to get a positive reward.
**Policy with finite horizon: π(s, t) → a**- This would terminate after t
- Gives non-stationary polices where states' utilities changes with time.

**Utility of sequences**- We will be adding rewards using discounted sums.
- $$ U(S_0 S_1 ...) = \sum_{i=0}^∞ γ^tR(S_t) = \frac {R_{max}}{1 - γ}$$
- $$ 0 ≤ γ < 1 $$
- If we do not multiply by gamma, all the rewards would sum to infinity.
- When we multiply by gamma, we get a geometric series that allows us to multiply an infinite number of rewards to get a finite number.
- Also, the discount factor is multiplied because sooner rewards probably have higher utility than later rewards.
- Things in the future are discounted so much they become negligible. Hence, we reach a finite number for our utility.

**Absorbing state**- Guarantee that for every policy, a terminal state will eventually be reached.

**Optimal Policy**

- $$ π^* = argmax_π \ E \ [ \sum_{i=0}^∞γ^tR(s_t) \ | \ π ] $$
- Expectation because it is non-deterministic
- π*: optimal policy

- $$ U^π(s) = E \ [ \sum_{i=0}^∞γ^tR(s_t) \ | \ π, \ s_0 = s ] $$
- Utility of a particular state depends on the policy we are taking π
- Utility is what we expect to see from then.
- This is not equal to the reward, R(s) for being in that state that is immediate.
- If you look at going to university, there's an immediate negative reward, say paying for school fees. But in the long-run you may have higher salary which is similar to U(s), a form of delayed reward.

- Utility, U(s) gives us a long-term view.

- $$ π^*(s) = argmax_a \sum_{s'}T(s, a, s') U(s') $$
- This is the optimal action to take at a state.
- For every state, returns an action that maximizes my expected utility.

- $$ U(s) = R(s) + γ \ argmax_a \sum_{s'}T(s, a, s') U(s') $$
- True utility of that state U(s): R(S) + expected utility
- Belman equation.
- Reward of that state plus the discount of all the reward from then on.
- T(s, a, s'): model.
- U(s'): utility of another state.

- Reward of that state plus the discount of all the reward from then on.
- n equations: since we've n number of states.
- n unknowns: U(s') since we've n number of states.

**Finding Policies: Value Iteration**

- We start with arbitrary utilities
- For example, 0.

- Update utilities based on neighbours (all the states it can reach).
- $$ \hat {U}_{t+1}(s) = R(s) + γ \ argmax_a \sum_{s'}T(s, a, s') \ \hat {U}_t(s') $$
- Bellman update equation.

- Repeat until convergence

**Finding Policies: Policy Iteration**

- Start with by guessing
- $$π_0$$

- Evaluate
- Given
- $$π_t$$

- Calculate
- $$u_t = u^{π_t}$$

- Given
- Improve: maximizing expected utility
- $$ π_{t+1} = argmax_a \sum_{s'}T(s, a, s') \ U_t(s') $$
- $$ U_t(s) = R(s) + γ \ \sum_{s'}T(s, a, s') U_t(s') $$