Lecture 3: Exploration-exploitation dilemma

Computational model of behavior:

  • based on a feedback, the animal is able to learn something

Conditioning

  • classical: Pavlovian

  • instrumental

Classical conditioning

Recall the

Rescorla-Wagner Rule:

ww+εuiδi

Instrumental conditioning

  • static action choice: reward delivered immediately after the choice

  • sequential choice: reward after a series of action

Decision strategies

Policy:

the strategy used by the animal to maximize the reward

Ex: Real experiment: bees and flowers

Flower Drops of nectar
Blue rb=8
Yellow ry=2

Expected value of the reward:

R=ryp(a=yellow)+rbp(a=blue)

These probabilities depend only on the policy of the animal.

Greedy policy

  • p(a=blue)=1
  • p(a=yellow)=0

Ex: go always for the blue flower.

R=8×1+0=8

But if rb and ry change throughout time, the animal is tricked.

ε-Greedy policy

For ε«1:

  • p(a=blue)=1ε
  • p(a=yellow)=ε

Ex: back to our example:

R=86ε

softmax Gibbs-policy

Depends on the reward. For a fixed β0:

  • p(a=blue)=exp(βrb)exp(βrb)+exp(βry)
  • p(a=yellow)=exp(βry)exp(βrb)+exp(βry)

NB: Gibbs distribution comes from physics, where β is proportional to the inverse of the temperature.

Exploration-Exploitation trade-off:

  • β0: Exploration (physical analogy: very high temperature)

  • β+: Exploitation (physical analogy: very low temperature)

p(b) is a sigmoid of the differences of the reward:

p(b)=11+exp(β(rbry)){rbry+1rbry0

NB: rbry can be positive of negative


But:

  • the animal never knows the reward, it can only estimate it
  • what if the reward changes over time?

Internal estimates

Internal estimates:

Flower Drops of nectar Internal estimate
Blue rb=8 mb
Yellow ry=2 my

Greedy update

  • mb=rb,i
  • my=ry,i

Batch update

  • mb=1Ni=1Nrb,i
  • my=1Ni=1Nry,i

Online update

Indirect actor, same as Rescorla-Wagner rule (delta-rule):

mbmb+ε(rb,imb)δ
  • ε: learning rate

  • δ: prediction error

Leave a comment