Solve MDPs' equations and understand the intuition behind it leading to reinforcement learning.
Markov Decision Processes

Markov Decision Processes¶

Differences in Learning

• Supervised learning: y = f(x)
• Given x-y pairs
• Learn function f(x)
• Function approximation
• Unsupervised learning: f(x)
• Given x's
• Learn the function f(x)
• Clustering description
• Reinforcement learning:
• Given x's and z's
• Learn function f(x)

Markov Decision Processes

• States: S
• They are kind of like positions on a map if you are navigating to an end point
• Model (Transition Function): T(s, a, s') ~ P(s' | s, a)
• The model is like a set of rules for a game (physics of the world).
• s: state
• a: action
• s': another state
• Probability of s' given s and a
• Simple: if you're in a deterministic world, if you go straight (a) from the toilet (s), you would go straight (s') with a probability of 1.
• Slightly complex: can be that you have a probability of only 0.8 if you choose to go up, you may go right with a probability of 0.2.
• Your potential state (s') is only dependent on your current state (s) if not you will have more s-es.
• This is a Markovian property where only the present matters.
• Also, given the present state, the future and the past are independent.
• Actions: A(s), A
• They are things you can do such as going up, down, left and right if you are navigating through a box grid.
• They can be other kinds of actions too, depending on your scenario.
• Reward: R(s), R(s, a), R(s, a, s')
• Scalar value for being in a state.
• Position 1, you'll have 100 points.
• Position 2, you'll have -10 points.
• R(s): reward for being in a state
• R(s, a): reward for being in a state and taking an action
• R(s, a, s'): reward for being in a state, taking an action and ending up in the other state
• They are mathematically equivalent.
• Policy: π(s) → a
• This is basically the solution: what action to take in any particular state.
• It takes in a state, s, and it tells you the action, a, to take.
• π* optimizes the amount the reward at any given point in time.
• Here we have (s, a, r) triplets.
• We need to find the optimal action
• Given what we know above
• π*: function f(x)
• r: z
• s: x
• a: y
• Visualization
• Rewards (Domain Knowledge)

• Delayed rewards
• You won't know what your immediate action would lead to down the road.
• Example in chess
• You did many moves
• Then you have your final reward where you did well or badly
• You could have done well at the start then worst at the end.
• You could have done badly at the start then better at the end.
• Example
• R(s) = -0.01
• Reward for every state except for the 2 absorbing states is -0.01.
• s_good = +1
• s_bad = -0.01
• Each action you take, you get punished.
• You are encouraged to end the game soon.
• Minor changes to your reward function matters a lot.
• If you have a huge negative reward, you will be in a hurry.
• If you have a small negative reward, you may not be in a hurry to reach the end.

Sequence of rewards: assumptions

• Stationary preferences
• [a1, a2, ...] > [b1, b2, ...]
• [r, a1, a2, ...] > [r, b1, b2, ...]
• You'll do the same action regardless of time.
• Inifinite horizons
• The game does not end until we reach the absorbing state.
• If we do not have an inifinite time, and a short amount of time, we may want to rush through to get a positive reward.
• Policy with finite horizon: π(s, t) → a
• This would terminate after t
• Gives non-stationary polices where states' utilities changes with time.
• Utility of sequences
• We will be adding rewards using discounted sums.
• $$U(S_0 S_1 ...) = \sum_{i=0}^∞ γ^tR(S_t) = \frac {R_{max}}{1 - γ}$$
• $$0 ≤ γ < 1$$
• If we do not multiply by gamma, all the rewards would sum to infinity.
• When we multiply by gamma, we get a geometric series that allows us to multiply an infinite number of rewards to get a finite number.
• Also, the discount factor is multiplied because sooner rewards probably have higher utility than later rewards.
• Things in the future are discounted so much they become negligible. Hence, we reach a finite number for our utility.
• Absorbing state
• Guarantee that for every policy, a terminal state will eventually be reached.

Optimal Policy

• $$π^* = argmax_π \ E \ [ \sum_{i=0}^∞γ^tR(s_t) \ | \ π ]$$
• Expectation because it is non-deterministic
• π*: optimal policy
• $$U^π(s) = E \ [ \sum_{i=0}^∞γ^tR(s_t) \ | \ π, \ s_0 = s ]$$
• Utility of a particular state depends on the policy we are taking π
• Utility is what we expect to see from then.
• This is not equal to the reward, R(s) for being in that state that is immediate.
• If you look at going to university, there's an immediate negative reward, say paying for school fees. But in the long-run you may have higher salary which is similar to U(s), a form of delayed reward.
• Utility, U(s) gives us a long-term view.
• $$π^*(s) = argmax_a \sum_{s'}T(s, a, s') U(s')$$
• This is the optimal action to take at a state.
• For every state, returns an action that maximizes my expected utility.
• $$U(s) = R(s) + γ \ argmax_a \sum_{s'}T(s, a, s') U(s')$$
• True utility of that state U(s): R(S) + expected utility
• Belman equation.
• Reward of that state plus the discount of all the reward from then on.
• T(s, a, s'): model.
• U(s'): utility of another state.
• n equations: since we've n number of states.
• n unknowns: U(s') since we've n number of states.

Finding Policies: Value Iteration

• For example, 0.
2. Update utilities based on neighbours (all the states it can reach).
• $$\hat {U}_{t+1}(s) = R(s) + γ \ argmax_a \sum_{s'}T(s, a, s') \ \hat {U}_t(s')$$
• Bellman update equation.
3. Repeat until convergence

Finding Policies: Policy Iteration

• $$π_0$$
• $$π_t$$
• $$u_t = u^{π_t}$$
• $$π_{t+1} = argmax_a \sum_{s'}T(s, a, s') \ U_t(s')$$
• $$U_t(s) = R(s) + γ \ \sum_{s'}T(s, a, s') U_t(s')$$