Lecture 3: Exploration-exploitation dilemma
Computational model of behavior:
- based on a feedback, the animal is able to learn something
Conditioning
-
classical: Pavlovian
-
instrumental
Classical conditioning
Recall the
Rescorla-Wagner Rule:
Instrumental conditioning
-
static action choice: reward delivered immediately after the choice
-
sequential choice: reward after a series of action
Decision strategies
- Policy:
-
the strategy used by the animal to maximize the reward
Ex: Real experiment: bees and flowers
Flower | Drops of nectar |
---|---|
Blue | |
Yellow |
Expected value of the reward:
These probabilities depend only on the policy of the animal.
Greedy policy
Ex: go always for the blue flower.
But if
-Greedy policy
For
Ex: back to our example:
softmax Gibbs-policy
Depends on the reward. For a fixed
NB: Gibbs distribution comes from physics, where
Exploration-Exploitation trade-off:
-
: Exploration (physical analogy: very high temperature) -
: Exploitation (physical analogy: very low temperature)
NB:
But:
- the animal never knows the reward, it can only estimate it
- what if the reward changes over time?
Internal estimates
Internal estimates:
Flower | Drops of nectar | Internal estimate |
---|---|---|
Blue | ||
Yellow |
Greedy update
Batch update
Online update
Indirect actor, same as Rescorla-Wagner rule (delta-rule):
-
: learning rate -
: prediction error
Leave a comment