[Problem Set 2 Quantitative models of behavior] Problem 4: Reinforcement learning in a maze

Link of the iPython notebook for the code

AT2 – Neuromodeling: Problem set #2 QUANTITATIVE MODELS OF BEHAVIOR

PROBLEM 4: Reinforcement learning in a maze.

Imagine a rat going through the maze shown below. The rat enters the maze at state $A$, then moves on to either state $B$ or $C$, then to $D, E, F$, or $G$, where it can potentially collect a reward (as given by the numbers in the figure), and finally, the rat is taken out of the maze by the experimenter, and thereby moves into the “terminal” state $H$ (not shown).

(a) Assume the rat follows a random decision-making strategy (“policy”), i.e. at each junction, it moves left or right with $50%$ probability. How often does the rat visit each state s (where $s ∈ \lbrace A,B,C,D,E,F,G,H \rbrace$)? Give the theoretical number (what you expect given the policy), then perform a numerical simulation. Generate $N = 100$ trials of the rat’s behavior and count how often it visits each state.

Note: each trial consists of a succession of four states: $s_1, s_2, s_3, s_4$, the first state is always $s_1 = A$ and the last state is always the terminal state $s_4 = H$. We assume that the rat never turns back and always moves forward.

(b) In reinforcement learning theory, each state has a value, which is the expected sum of all possible future rewards. These values can be learned through experience using a method called “temporal difference learning”. Initially, the rat assumes that no state carries any value ($V (s) = 0$ for all $s$). After each trial, the state values are updated according to the temporal difference learning rule:

\[V(s_t) → V (s_t) + ε \left[ r(s_t) + V (s_{t+1}) − V (s_t) \right] \qquad (9)\]

where $s_t$ (with $t ∈ \lbrace 1, 2, 3, 4 \rbrace$) denotes the sequence of states in a trial, and $r(s)$ is the reward obtained in state $s$.

Use the trial sequences generated in (a) to update the values $V(s)$. Plot the values $V(s)$ as a function of the trial number.

(c) A smart rat may want to use the information it is collecting about the values of the different states (i.e. the expected future reward!) to adapt its decision-making strategy. Use the estimated values $V(s)$ from above to change the rat’s policy: assume that at each junction, the rat compares the values of the two choices using the “softmax”-decision rule used in bee-learning. What happens to the learning process if the rat is “greedy”, i.e. usually goes for the side with the larger value, what if the rat is very explorative?

Share on

Twitter Facebook Google+ LinkedIn

AT2 – Neuromodeling: Problem set #2 QUANTITATIVE MODELS OF BEHAVIOR

PROBLEM 4: Reinforcement learning in a maze.

Use the trial sequences generated in (a) to update the values $V(s)$. Plot the values $V(s)$ as a function of the trial number.

Share on

Leave a comment