Lecture 5: Reinforcement Learning
Lecturer: Lyudmila Kushnir
Introduction
Example of unsupervized learning: modeling generalization (in biology): an organism is presented with an inupt ⟶ it has to respond to it in the same way as already seen “analogous” input
Reinforcement learning (RL): between supervized and unsupervized learning.
Partial supervision: you get a reward, that tells you if you’re doing good or bad (ex: in chess: no position to learn, but a given position can enhance your chance of winning (reward))
Examples of RL:
-
navigation: you know if you’re getting closer to the goal
-
Precise motor command: moving the muscles of the end ⇒ influence the direction of the arrow if you’re using a bow
-
Social interaction: there’s some structure/predictability to how people respond to your actions
⇒ Reinforcement is often delayed
Classical conditioning paradigm
Pavlov dog (unconditioned/conditioned stimulus): cf. computational neuroscience and neuromodeling courses
⟶ Many interpretations possible: the dog is trying to learn the reward
Stimulus: $u$, Reward: $r$
\[\text{Estimated reward: } v = wu\\ \text{Loss: } \quad L = \sum\limits_{ i } (r_i - wu_i)^2\]The learning stops when:
\[w = ⟨r⟩_{u=1}\]Generalization:
\[\textbf{v} = \textbf{w} \cdot \textbf{u}\]→ Not different from supervized learning
Static action choice
Bee landing on flowers ⟶ cf. computational neuroscience and neuromodeling courses
Softmax parametrization: policy
- \[p_i = \cfrac{\exp(β m_i)}{\exp(β m_b) + \exp(β m_y)}\]
-
$m_b, m_y$: action values (internal estimates)
- $β$: exploration-exploitation tradeoff parameter
How to update the action values?
Indirect actor framework
\[m_i → m_i + ε (r_i-m_i)\] \[⟹ \text{Ultimately: } m_i = ⟨r_i⟩\]Risk avoidance
Observed with real bees!
- $r_b = 2$ with $p=1$
- $r_y = 6$ with $p = 1/3$
So
\[⟨r_b⟩ = ⟨r_y⟩\]But in practice: bees prefer the more reliable/stable one, i.e. the blue flower
⇒ Not explained by the indirect actor framework, since $m_b$ and $m_y$ should be the same
Fix: bees don’t use $r$ in the update rule, but a (stricly) concave utility function
Direct actor
Goal: Optimize
\[⟨r⟩ = p_b ⟨r_b⟩ + p_y ⟨r_y⟩\] \[m_b → m_b + ε(1 - p_b)\underbrace{(r_b - \bar{r})}_{≝ δ} \qquad \text{when blue is selected}\\ m_b → m_b - ε p_b (r_b - \bar{r}) \qquad \text{when blue is not selected}\]where $\bar{r}$ is an arbitrary parameter controling the speed of learning. Reasonable choice:
\[\bar{r} = ⟨r⟩\]NB: so the parameter is set to be the empirical average thereof so far: $\bar{r} = ⟨r⟩$ is not really necessary, the closer you are to $⟨r⟩$, the more likely you are to converge
Sequential action choice
Ex: Rat in a maze.
Notation: $u$: current location, $u’$: next one
- Expected future reward for current policy:
- \[v(u) = \textbf{w} \cdot \underbrace{ \textbf{u}}_{\rlap{\text{stimulus at location}}}(u)\]
There a two theoretical “agents”
- the critic, who evaluates $v$ (and hence $\textbf{w}$)
- the actor, who changes the policy
Critic learning (policy evaluation):
\[\textbf{w} → \textbf{w} + ε δ \textbf{u}(u)\]TD-learning:
\[δ = r_a(u) + v(u') - v(u)\]Actor learning (policy improvement):
\[m_a(\textbf{u}) → m_a(\textbf{u}) + ε(1-p_{a, \textbf{u}}) δ \qquad \text{if action } a \text{ is chosen}\\ m_a(\textbf{u}) → m_a(\textbf{u}) - ε p_{a, \textbf{u}} δ \qquad \text{if action } a \text{ is not chosen}\\\]where $δ = (r_a(u)+v(u’)) - v(u)= r_a - \bar{r}$ remains the same (direct actor algorithm with reward $r_a(u) + v(u’)$ and $\bar{r} = v(u)$)
What is $v(u)?$
In this case:
\[v(u) = \Big\langle \sum\limits_{ τ≥0 } r(t+τ) \Big\rangle\]With a discount factor $0 < γ ≤ 1$
The TD-learning becomes:
\[δ = r_a(u) + γv(u') - v(u)\]\[v(u) = \sum\limits_{ \underbrace{i}_{\text{types of stimuli}} } \sum\limits_{ τ=0 }^t w_i(τ) u_i(t-τ)\]
Experiments on monkeys: Dopamine signals reward prediction error
Leave a comment