AT2 – Neuromodeling: Problem set #2 QUANTITATIVE MODELS OF BEHAVIOR
PROBLEM 4: Reinforcement learning in a maze.
Imagine a rat going through the maze shown below. The rat enters the maze at state $A$, then moves on to either state $B$ or $C$, then to $D, E, F$, or $G$, where it can potentially collect a reward (as given by the numbers in the figure), and finally, the rat is taken out of the maze by the experimenter, and thereby moves into the “terminal” state $H$ (not shown).
(a) Assume the rat follows a random decision-making strategy (“policy”), i.e. at each junction, it moves left or right with $50%$ probability. How often does the rat visit each state s (where $s ∈ \lbrace A,B,C,D,E,F,G,H \rbrace$)? Give the theoretical number (what you expect given the policy), then perform a numerical simulation. Generate $N = 100$ trials of the rat’s behavior and count how often it visits each state.
Note: each trial consists of a succession of four states: $s_1, s_2, s_3, s_4$, the first state is always $s_1 = A$ and the last state is always the terminal state $s_4 = H$. We assume that the rat never turns back and always moves forward.
(b) In reinforcement learning theory, each state has a value, which is the expected sum of all possible future rewards. These values can be learned through experience using a method called “temporal difference learning”. Initially, the rat assumes that no state carries any value ($V (s) = 0$ for all $s$). After each trial, the state values are updated according to the temporal difference learning rule:
where $s_t$ (with $t ∈ \lbrace 1, 2, 3, 4 \rbrace$) denotes the sequence of states in a trial, and $r(s)$ is the reward obtained in state $s$.