Problem Set 3: Reinforcement Learning

Problem 1

Inhibitory conditioning is a conditioning paradigm in which one stimulus is shown in conjunction with the reward on some trials and with conjunction with another neutral stimulus in the absence of reward on other trials. The animal learns that the second stimulus predicts the absence of the reward. This is demonstrated by slow learning of associating the second (inhibitory) stimulus with reward in the future of by reduced response when the inhibitory stimulus is paired with a third stimulus previously associated with reward.

Show how this result follows from the Rescola-Wagner rule.

The Resorla-Wagner rule is as follows:

\textbf{w}_i → \textbf{w}_i + ε (r_i - \textbf{w}_i \cdot \textbf{u}_i) \textbf{u}_i

In overshadowing, two stimuli are presented together during training in conjunction with reward. Both stimuli get associated with the reward, but the association is often shared between them unequally, as if one stimulus is more salient than the other (after training one of the stimuli elicits stronger response).

How can the Rescola-Wagner rule account for this?


In secondary conditioning, two neutral stimuli are associated after one of them has been associated with a reward. This causes the second stimulus to evoke expectation of the reward with which it has never been paired (although, only if the two stimuli were not presented too many times in the absence of the reward).

Can the Resorla-Wagner rule explain this?

Problem 2

For the maze task presented in class, assume that there are no additional stimuli accessible to the rat. Write the vectors $\textbf{u}(u)$ in unary representation ($u(A) = (1, 0, 0), u(B) = (0, 1, 0), u(C) = (0, 0, 1)$). Assume an initial policy $\textbf{m}(\textbf{u}) = \textbf{0}$.

Is it optimal?

Compute the vector $\textbf{w}$ at the end of critic learning under this policy (correct policy evaluation), and the resulting value function $v(\textbf{u})$.

What is the optimal policy $\textbf{m}(\textbf{u})$?

What are $\textbf{w}$ and $v(\textbf{u})$ that the critic will learn after the actor has learned the optimal policy?

Leave a comment