Lecture 3: Exploration-exploitation dilemma

Computational model of behavior:

  • based on a feedback, the animal is able to learn something

Conditioning

  • classical: Pavlovian

  • instrumental

Classical conditioning

Recall the

Rescorla-Wagner Rule:

w ← w + ε u_i δ_i

Instrumental conditioning

  • static action choice: reward delivered immediately after the choice

  • sequential choice: reward after a series of action

Decision strategies

Policy:

the strategy used by the animal to maximize the reward

Ex: Real experiment: bees and flowers

Flower Drops of nectar
Blue $r_b = 8$
Yellow $r_y = 2$

Expected value of the reward:

⟨R⟩ = r_y \cdot p(a=\text{yellow}) + r_b \cdot p(a=\text{blue})

These probabilities depend only on the policy of the animal.

Greedy policy

  • $p(a=\text{blue}) = 1$
  • $p(a=\text{yellow}) = 0$

Ex: go always for the blue flower.

⟨R⟩ = 8 × 1 + 0 = 8

But if $r_b$ and $r_y$ change throughout time, the animal is tricked.

$ε$-Greedy policy

For $ε « 1$:

  • $p(a=\text{blue}) = 1-ε$
  • $p(a=\text{yellow}) = ε$

Ex: back to our example:

⟨R⟩ = 8 - 6 ε

softmax Gibbs-policy

Depends on the reward. For a fixed $β ≥ 0$:

  • $p(a=\text{blue}) = \frac{\exp(βr_b)}{\exp(βr_b)+\exp(βr_y)}$
  • $p(a=\text{yellow}) = \frac{\exp(βr_y)}{\exp(βr_b)+\exp(βr_y)}$

NB: Gibbs distribution comes from physics, where $β$ is proportional to the inverse of the temperature.

Exploration-Exploitation trade-off:

  • $β ⟶ 0$: Exploration (physical analogy: very high temperature)

  • $β ⟶ +∞$: Exploitation (physical analogy: very low temperature)

$p(b)$ is a sigmoid of the differences of the reward:

p(b) = \frac{1}{1+ \exp(-β(r_b-r_y))} \begin{cases} \xrightarrow[r_b-r_y \to +∞]{} 1 \\ \xrightarrow[r_b-r_y \to -∞]{} 0 \end{cases}

NB: $r_b-r_y$ can be positive of negative


But:

  • the animal never knows the reward, it can only estimate it
  • what if the reward changes over time?

Internal estimates

Internal estimates:

Flower Drops of nectar Internal estimate
Blue $r_b = 8$ $m_b$
Yellow $r_y = 2$ $m_y$

Greedy update

  • $m_b = r_{b,i}$
  • $m_y = r_{y,i}$

Batch update

  • $m_b = \frac 1 N \sum\limits_{ i=1 }^N r_{b,i}$
  • $m_y = \frac 1 N \sum\limits_{ i=1 }^N r_{y,i}$

Online update

Indirect actor, same as Rescorla-Wagner rule (delta-rule):

m_b ← m_b + ε \underbrace{(r_{b,i} - m_b)}_{ ≝ \, δ}
  • $ε$: learning rate

  • $δ$: prediction error

Leave a comment