Lecture 1: Reinforcement learning

  • Stimulus $u_i$
  • Reward $r_i$
  • Expected reward: $v_i ≝ w u_i$
  • Prediction error: $δ_i = r_i - v_i$
  • Loss: $L_i ≝ δ_i^2$

Rescola-Wagner Rule (aka “delta rule”): w ← w + ε δ_i u_i

Reinforcement learning:

  • greedy policy

  • $ε$-greedy policy

  • softmax Gibbs-policy

Softmax-Gibbs Policy:
p(b) = \frac{\exp(r_b)}{\exp(r_b)+\exp(r_y)}\\ p(y) = \frac{\exp(r_y)}{\exp(r_b)+\exp(r_y)}

Greedy startegy

Greedy update

m_b = r_{b,i}\\ m_y = r_{y,i}

Batch update

m_b = \frac 1 N \sum\limits_{ i=1 }^N r_{b,i}\\ m_y = \frac 1 N \sum\limits_{ i=1 }^N r_{y,i}

Online update

Inspired from supervized learning:

m_b ← m_b + ε(r_{b,i}-m_b)

Key idea of reinforcement learning

You have

  • an environment $S ≝ \lbrace s_1, ⋯ \rbrace$
  • a set of actions $𝒜 ≝ \lbrace a_1, ⋯ \rbrace$
  • a reward $r ∈ ℝ$

Define

  • a policy $π(s, a) = P(a \mid s)$ (the proba of executing action $a$ given $s$)

Optimal policy: $π^\ast$

Rat Maze problem

A rat is in a maze (binary tree of height 2): at each intersection, it has to choose between going left or going right. At the end, there’s a reward ⟶ wants to maxmize it.

  • States $S ≝ \lbrace A, B, C, D, E, F, G \rbrace$

  • Rewards: $r(s)$

  • Actions: $𝒜 ≝ \lbrace left, right \rbrace$

  • Policy: $P(a \mid s)$

⇒ Markovian process

Value of a state $V(s)$:

sum of all future rewards

  • General Policy Interation theorem

  • Monte Carlo Policy Evaluation ($TD(1)$)
  • Temporal-difference ($TD(0)$) learning:

    V(s) ← V(s) + εδ\\ \text{where } δ= r(s)+ \hat{V}(s') - \hat{V}(s)
  • Generalization: $TD(λ)$, for $λ ∈ [0, 1]$

Leave a comment