[Problem Set 2 Quantitative models of behavior] Problem 4: Reinforcement learning in a maze

Link of the iPython notebook for the code

AT2 – Neuromodeling: Problem set #2 QUANTITATIVE MODELS OF BEHAVIOR

PROBLEM 4: Reinforcement learning in a maze.

Imagine a rat going through the maze shown below. The rat enters the maze at state $A$, then moves on to either state $B$ or $C$, then to $D, E, F$, or $G$, where it can potentially collect a reward (as given by the numbers in the figure), and finally, the rat is taken out of the maze by the experimenter, and thereby moves into the “terminal” state $H$ (not shown).

(a) Assume the rat follows a random decision-making strategy (“policy”), i.e. at each junction, it moves left or right with $50%$ probability. How often does the rat visit each state s (where $s ∈ \lbrace A,B,C,D,E,F,G,H \rbrace$)? Give the theoretical number (what you expect given the policy), then perform a numerical simulation. Generate $N = 100$ trials of the rat’s behavior and count how often it visits each state.

Note: each trial consists of a succession of four states: $s_1, s_2, s_3, s_4$, the first state is always $s_1 = A$ and the last state is always the terminal state $s_4 = H$. We assume that the rat never turns back and always moves forward.

(b) In reinforcement learning theory, each state has a value, which is the expected sum of all possible future rewards. These values can be learned through experience using a method called “temporal difference learning”. Initially, the rat assumes that no state carries any value ($V (s) = 0$ for all $s$). After each trial, the state values are updated according to the temporal difference learning rule:

V(s_t) → V (s_t) + ε \left[ r(s_t) + V (s_{t+1}) − V (s_t) \right] \qquad (9)

where $s_t$ (with $t ∈ \lbrace 1, 2, 3, 4 \rbrace$) denotes the sequence of states in a trial, and $r(s)$ is the reward obtained in state $s$.

Use the trial sequences generated in (a) to update the values $V(s)$. Plot the values $V(s)$ as a function of the trial number.

(c) A smart rat may want to use the information it is collecting about the values of the different states (i.e. the expected future reward!) to adapt its decision-making strategy. Use the estimated values $V(s)$ from above to change the rat’s policy: assume that at each junction, the rat compares the values of the two choices using the “softmax”-decision rule used in bee-learning. What happens to the learning process if the rat is “greedy”, i.e. usually goes for the side with the larger value, what if the rat is very explorative?

Leave a comment