Lecture 1: Reinforcement learning
- Stimulus $u_i$
- Reward $r_i$
- Expected reward: $v_i ≝ w u_i$
- Prediction error: $δ_i = r_i - v_i$
- Loss: $L_i ≝ δ_i^2$
Rescola-Wagner Rule (aka “delta rule”): \(w ← w + ε δ_i u_i\)
Reinforcement learning:
-
greedy policy
-
$ε$-greedy policy
-
softmax Gibbs-policy
- Softmax-Gibbs Policy:
- \[p(b) = \frac{\exp(r_b)}{\exp(r_b)+\exp(r_y)}\\ p(y) = \frac{\exp(r_y)}{\exp(r_b)+\exp(r_y)}\]
Greedy startegy
Greedy update
\[m_b = r_{b,i}\\ m_y = r_{y,i}\]Batch update
\[m_b = \frac 1 N \sum\limits_{ i=1 }^N r_{b,i}\\ m_y = \frac 1 N \sum\limits_{ i=1 }^N r_{y,i}\]Online update
Inspired from supervized learning:
\[m_b ← m_b + ε(r_{b,i}-m_b)\]Key idea of reinforcement learning
You have
- an environment $S ≝ \lbrace s_1, ⋯ \rbrace$
- a set of actions $𝒜 ≝ \lbrace a_1, ⋯ \rbrace$
- a reward $r ∈ ℝ$
Define
- a policy $π(s, a) = P(a \mid s)$ (the proba of executing action $a$ given $s$)
Optimal policy: $π^\ast$
Rat Maze problem
A rat is in a maze (binary tree of height 2): at each intersection, it has to choose between going left or going right. At the end, there’s a reward ⟶ wants to maxmize it.
-
States $S ≝ \lbrace A, B, C, D, E, F, G \rbrace$
-
Rewards: $r(s)$
-
Actions: $𝒜 ≝ \lbrace left, right \rbrace$
-
Policy: $P(a \mid s)$
⇒ Markovian process
- Value of a state $V(s)$:
-
sum of all future rewards
-
General Policy Interation theorem
- Monte Carlo Policy Evaluation ($TD(1)$)
-
Temporal-difference ($TD(0)$) learning:
\[V(s) ← V(s) + εδ\\ \text{where } δ= r(s)+ \hat{V}(s') - \hat{V}(s)\] - Generalization: $TD(λ)$, for $λ ∈ [0, 1]$
Leave a comment