Lecture 3: Exploration-exploitation dilemma
Computational model of behavior:
- based on a feedback, the animal is able to learn something
Conditioning
-
classical: Pavlovian
-
instrumental
Classical conditioning
Recall the
Rescorla-Wagner Rule:
\[w ← w + ε u_i δ_i\]
Instrumental conditioning
-
static action choice: reward delivered immediately after the choice
-
sequential choice: reward after a series of action
Decision strategies
- Policy:
-
the strategy used by the animal to maximize the reward
Ex: Real experiment: bees and flowers
Flower | Drops of nectar |
---|---|
Blue | $r_b = 8$ |
Yellow | $r_y = 2$ |
Expected value of the reward:
\[⟨R⟩ = r_y \cdot p(a=\text{yellow}) + r_b \cdot p(a=\text{blue})\]
These probabilities depend only on the policy of the animal.
Greedy policy
- $p(a=\text{blue}) = 1$
- $p(a=\text{yellow}) = 0$
Ex: go always for the blue flower.
\[⟨R⟩ = 8 × 1 + 0 = 8\]But if $r_b$ and $r_y$ change throughout time, the animal is tricked.
$ε$-Greedy policy
For $ε « 1$:
- $p(a=\text{blue}) = 1-ε$
- $p(a=\text{yellow}) = ε$
Ex: back to our example:
\[⟨R⟩ = 8 - 6 ε\]softmax Gibbs-policy
Depends on the reward. For a fixed $β ≥ 0$:
- $p(a=\text{blue}) = \frac{\exp(βr_b)}{\exp(βr_b)+\exp(βr_y)}$
- $p(a=\text{yellow}) = \frac{\exp(βr_y)}{\exp(βr_b)+\exp(βr_y)}$
NB: Gibbs distribution comes from physics, where $β$ is proportional to the inverse of the temperature.
Exploration-Exploitation trade-off:
-
$β ⟶ 0$: Exploration (physical analogy: very high temperature)
-
$β ⟶ +∞$: Exploitation (physical analogy: very low temperature)
$p(b)$ is a sigmoid of the differences of the reward:
\[p(b) = \frac{1}{1+ \exp(-β(r_b-r_y))} \begin{cases} \xrightarrow[r_b-r_y \to +∞]{} 1 \\ \xrightarrow[r_b-r_y \to -∞]{} 0 \end{cases}\]NB: $r_b-r_y$ can be positive of negative
But:
- the animal never knows the reward, it can only estimate it
- what if the reward changes over time?
Internal estimates
Internal estimates:
Flower | Drops of nectar | Internal estimate |
---|---|---|
Blue | $r_b = 8$ | $m_b$ |
Yellow | $r_y = 2$ | $m_y$ |
Greedy update
- $m_b = r_{b,i}$
- $m_y = r_{y,i}$
Batch update
- $m_b = \frac 1 N \sum\limits_{ i=1 }^N r_{b,i}$
- $m_y = \frac 1 N \sum\limits_{ i=1 }^N r_{y,i}$
Online update
Indirect actor, same as Rescorla-Wagner rule (delta-rule):
\[m_b ← m_b + ε \underbrace{(r_{b,i} - m_b)}_{ ≝ \, δ}\]-
$ε$: learning rate
-
$δ$: prediction error
Leave a comment