# Lecture 1: Reinforcement learning

• Stimulus $u_i$
• Reward $r_i$
• Expected reward: $v_i ≝ w u_i$
• Prediction error: $δ_i = r_i - v_i$
• Loss: $L_i ≝ δ_i^2$

Rescola-Wagner Rule (aka “delta rule”): $w ← w + ε δ_i u_i$

Reinforcement learning:

• greedy policy

• $ε$-greedy policy

• softmax Gibbs-policy

Softmax-Gibbs Policy:
p(b) = \frac{\exp(r_b)}{\exp(r_b)+\exp(r_y)}\\ p(y) = \frac{\exp(r_y)}{\exp(r_b)+\exp(r_y)}

## Greedy startegy

### Greedy update

m_b = r_{b,i}\\ m_y = r_{y,i}

### Batch update

m_b = \frac 1 N \sum\limits_{ i=1 }^N r_{b,i}\\ m_y = \frac 1 N \sum\limits_{ i=1 }^N r_{y,i}

### Online update

Inspired from supervized learning:

m_b ← m_b + ε(r_{b,i}-m_b)

## Key idea of reinforcement learning

You have

• an environment $S ≝ \lbrace s_1, ⋯ \rbrace$
• a set of actions $𝒜 ≝ \lbrace a_1, ⋯ \rbrace$
• a reward $r ∈ ℝ$

Define

• a policy $π(s, a) = P(a \mid s)$ (the proba of executing action $a$ given $s$)

Optimal policy: $π^\ast$

## Rat Maze problem

A rat is in a maze (binary tree of height 2): at each intersection, it has to choose between going left or going right. At the end, there’s a reward ⟶ wants to maxmize it.

• States $S ≝ \lbrace A, B, C, D, E, F, G \rbrace$

• Rewards: $r(s)$

• Actions: $𝒜 ≝ \lbrace left, right \rbrace$

• Policy: $P(a \mid s)$

⇒ Markovian process

Value of a state $V(s)$:

sum of all future rewards

• General Policy Interation theorem

• Monte Carlo Policy Evaluation ($TD(1)$)
• Temporal-difference ($TD(0)$) learning:

V(s) ← V(s) + εδ\\ \text{where } δ= r(s)+ \hat{V}(s') - \hat{V}(s)
• Generalization: $TD(λ)$, for $λ ∈ [0, 1]$

Tags:

Updated: