Lecture 1: Reinforcement learning

Stimulus $u_i$
Reward $r_i$
Expected reward: $v_i ≝ w u_i$
Prediction error: $δ_i = r_i - v_i$
Loss: $L_i ≝ δ_i^2$

Rescola-Wagner Rule (aka “delta rule”): $w ← w + ε δ_i u_i$

Reinforcement learning:

greedy policy
$ε$-greedy policy
softmax Gibbs-policy

Softmax-Gibbs Policy:: \[p(b) = \frac{\exp(r_b)}{\exp(r_b)+\exp(r_y)}\\ p(y) = \frac{\exp(r_y)}{\exp(r_b)+\exp(r_y)}\]

Greedy startegy

Greedy update

\[m_b = r_{b,i}\\ m_y = r_{y,i}\]

Batch update

\[m_b = \frac 1 N \sum\limits_{ i=1 }^N r_{b,i}\\ m_y = \frac 1 N \sum\limits_{ i=1 }^N r_{y,i}\]

Online update

Inspired from supervized learning:

\[m_b ← m_b + ε(r_{b,i}-m_b)\]

Key idea of reinforcement learning

You have

an environment $S ≝ \lbrace s_1, ⋯ \rbrace$
a set of actions $𝒜 ≝ \lbrace a_1, ⋯ \rbrace$
a reward $r ∈ ℝ$

Define

a policy $π(s, a) = P(a \mid s)$ (the proba of executing action $a$ given $s$)

Optimal policy: $π^\ast$

Rat Maze problem

A rat is in a maze (binary tree of height 2): at each intersection, it has to choose between going left or going right. At the end, there’s a reward ⟶ wants to maxmize it.

States $S ≝ \lbrace A, B, C, D, E, F, G \rbrace$
Rewards: $r(s)$
Actions: $𝒜 ≝ \lbrace left, right \rbrace$
Policy: $P(a \mid s)$

⇒ Markovian process

Value of a state $V(s)$:: sum of all future rewards

General Policy Interation theorem
Monte Carlo Policy Evaluation ($TD(1)$)
Temporal-difference ($TD(0)$) learning:
\[V(s) ← V(s) + εδ\\ \text{where } δ= r(s)+ \hat{V}(s') - \hat{V}(s)\]
Generalization: $TD(λ)$, for $λ ∈ [0, 1]$

Share on

Twitter Facebook Google+ LinkedIn