# Lecture 5: Reinforcement Learning

Lecturer: Lyudmila Kushnir

# Introduction

Example of unsupervized learning: modeling generalization (in biology): an organism is presented with an inupt ⟶ it has to respond to it in the same way as already seen “analogous” input

Reinforcement learning (RL): between supervized and unsupervized learning.

Partial supervision: you get a reward, that tells you if you’re doing good or bad (ex: in chess: no position to learn, but a given position can enhance your chance of winning (reward))

Examples of RL:

• navigation: you know if you’re getting closer to the goal

• Precise motor command: moving the muscles of the end ⇒ influence the direction of the arrow if you’re using a bow

• Social interaction: there’s some structure/predictability to how people respond to your actions

⇒ Reinforcement is often delayed

Pavlov dog (unconditioned/conditioned stimulus): cf. computational neuroscience and neuromodeling courses

⟶ Many interpretations possible: the dog is trying to learn the reward

Stimulus: $u$, Reward: $r$

\text{Estimated reward: } v = wu\\ \text{Loss: } \quad L = \sum\limits_{ i } (r_i - wu_i)^2

The learning stops when:

w = ⟨r⟩_{u=1}

Generalization:

\textbf{v} = \textbf{w} \cdot \textbf{u}

→ Not different from supervized learning

## Static action choice

Bee landing on flowers ⟶ cf. computational neuroscience and neuromodeling courses

Softmax parametrization: policy

• p_i = \cfrac{\exp(β m_i)}{\exp(β m_b) + \exp(β m_y)}
• $m_b, m_y$: action values (internal estimates)

• $β$: exploration-exploitation tradeoff parameter

How to update the action values?

### Indirect actor framework

m_i → m_i + ε (r_i-m_i)
⟹ \text{Ultimately: } m_i = ⟨r_i⟩

### Risk avoidance

Observed with real bees!

• $r_b = 2$ with $p=1$
• $r_y = 6$ with $p = 1/3$

So

⟨r_b⟩ = ⟨r_y⟩

But in practice: bees prefer the more reliable/stable one, i.e. the blue flower

⇒ Not explained by the indirect actor framework, since $m_b$ and $m_y$ should be the same

Fix: bees don’t use $r$ in the update rule, but a (stricly) concave utility function

### Direct actor

Goal: Optimize

⟨r⟩ = p_b ⟨r_b⟩ + p_y ⟨r_y⟩
m_b → m_b + ε(1 - p_b)\underbrace{(r_b - \bar{r})}_{≝ δ} \qquad \text{when blue is selected}\\ m_b → m_b - ε p_b (r_b - \bar{r}) \qquad \text{when blue is not selected}

where $\bar{r}$ is an arbitrary parameter controling the speed of learning. Reasonable choice:

\bar{r} = ⟨r⟩

NB: so the parameter is set to be the empirical average thereof so far: $\bar{r} = ⟨r⟩$ is not really necessary, the closer you are to $⟨r⟩$, the more likely you are to converge

## Sequential action choice

Ex: Rat in a maze.

Notation: $u$: current location, $u’$: next one

Expected future reward for current policy:
v(u) = \textbf{w} \cdot \underbrace{ \textbf{u}}_{\rlap{\text{stimulus at location}}}(u)

There a two theoretical “agents”

• the critic, who evaluates $v$ (and hence $\textbf{w}$)
• the actor, who changes the policy

### Critic learning (policy evaluation):

\textbf{w} → \textbf{w} + ε δ \textbf{u}(u)

TD-learning:

δ = r_a(u) + v(u') - v(u)

### Actor learning (policy improvement):

m_a(\textbf{u}) → m_a(\textbf{u}) + ε(1-p_{a, \textbf{u}}) δ \qquad \text{if action } a \text{ is chosen}\\ m_a(\textbf{u}) → m_a(\textbf{u}) - ε p_{a, \textbf{u}} δ \qquad \text{if action } a \text{ is not chosen}\\

where $δ = (r_a(u)+v(u’)) - v(u)= r_a - \bar{r}$ remains the same (direct actor algorithm with reward $r_a(u) + v(u’)$ and $\bar{r} = v(u)$)

### What is $v(u)?$

In this case:

v(u) = \Big\langle \sum\limits_{ τ≥0 } r(t+τ) \Big\rangle

### With a discount factor $0 < γ ≤ 1$

The TD-learning becomes:

δ = r_a(u) + γv(u') - v(u)

v(u) = \sum\limits_{ \underbrace{i}_{\text{types of stimuli}} } \sum\limits_{ τ=0 }^t w_i(τ) u_i(t-τ)

Experiments on monkeys: Dopamine signals reward prediction error

Tags:

Updated: