Exercise Sheet 2: Temporal-Difference learning with discounting

Exercise Sheet 2: Temporal-Difference learning

Younesse Kaddar

1. Temporal-Difference learning with discounting

In many instances, immediate rewards are worth more than those in the future. To take this observation into account, the value $V(s_t)$ of a particular state $s_t$ is not the sum all of future rewards, but rather the sum of all future discounted rewards.

V(s_t) ≝ \sum\limits_{ τ=0 }^{+∞} γ^τ r(s_{t+τ})

where $0 < γ < 1$. Here, $s_t$ is the state at time $t$, i.e. the state in which the agent is right now, $s_{t+1}$ the state that the agent will move to next, and so on.

Following the derivation in the lecture, show that the temporal-difference-learning rule in this case is given by:

V(s_t) → V(s_t) + ε\Big(r(s_t) + γ V(s_{t+1}) - V(s_t)\Big)

Reminder: with $γ = 1$

With the notations used in the lecture: in compliance with the $δ$-rule, we came up with the following TD-learning rule with $γ = 1$ (no discounting):

\hat{V}(s_t) → \hat{V}(s_t) + ε δ

where

  • $\hat{V}(s_t)$ is the internal estimate of value (denoted by $V(s_t)$ in the problem statement, by abuse of notation)
  • $0 < ε « 1$ is a learning rate
  • $δ$ is the prediction error

The most “natural” prediction error is the difference between the actual value the state $V(s_t)$ and the internal estimate thereof:

δ_{natural} ≝ V(s_t) - \hat{V}(s_t)

As explained in class: the problem is that the real value of the state $V(s_t)$ is not known, so one is unable to compute such a prediction error! The trick we used to cope with that was to notice that:

\begin{align*} δ_{natural} & ≝ \overbrace{V(s_t)}^{= \; r(s_t) + V(s_{t+1})} - \hat{V}(s_t) \\ & = r(s_t) + V(s_{t+1}) - \hat{V}(s_t)\\ \end{align*}

so that we could use the current internal estimate $\hat{V}(s_{t+1})$ of $V(s_{t+1})$, as in dynamic programming, to approximate:

δ_{natural} = r(s_t) + V(s_{t+1}) - \hat{V}(s_t) ≃ r(s_t) + \hat{V}(s_{t+1}) - \hat{V}(s_t)

And we set $δ$ to be this approximate (and now computable!) value:

δ ≝ r(s_t) + \hat{V}(s_{t+1}) - \hat{V}(s_t)

For $0 < γ < 1$

In the discounting case, likewise:

\begin{align*} δ_{natural} & ≝ V(s_t) - \hat{V}(s_t) \\ & = \sum\limits_{ τ=0 }^{+∞} γ^τ r(s_{t+τ}) - \hat{V}(s_t) \\ & = r(s_t) + \sum\limits_{ τ=1 }^{+∞} γ^τ r(s_{t+τ}) - \hat{V}(s_t) \\ & = r(s_t) + \sum\limits_{ τ=0 }^{+∞} γ^{τ+1} r(s_{t+τ+1}) - \hat{V}(s_t) \\ & = r(s_t) + γ \, \underbrace{\sum\limits_{ τ=0 }^{+∞} γ^τ r(s_{(t+1)τ})}_{≝ \; V(s_{t+1})} - \hat{V}(s_t) \\ & = r(s_t) + γ \, \underbrace{V(s_{t+1})}_{≃ \, \hat{V}(s_{t+1})} - \hat{V}(s_t) \end{align*}

And again, by approximating $V(s_{t+1})$ by $\hat{V}(s_{t+1})$, one sets $δ$ to be:

δ ≝ r(s_t) + γ \hat{V}(s_{t+1}) - \hat{V}(s_t)

Therefore, the overall TD-learning rule is, in the discounting case:

\hat{V}(s_t) → \hat{V}(s_t) + ε δ = \hat{V}(s_t) + ε\Big(r(s_t) + γ \hat{V}(s_{t+1}) - \hat{V}(s_t)\Big)

or, with the abuse of notation ($\hat{V}$ is denoted by $V$ as well) made in the problem statement:

V(s_t) → V(s_t) + ε\Big(r(s_t) + γ V(s_{t+1}) - V(s_t)\Big)

2. Models for the value function

In the lecture, we talked about the necessity to introduce models for the value of a state, so that one could properly generalize to new, unseen situations. One very simple model is given by the value function $V(\textbf{u}) = \textbf{w} \cdot \textbf{u}$ where $\textbf{u}$ is a vector of stimuli that could either be present (1) or absent (0).

a)

Take the example of two stimuli, $\textbf{u} = (u_1, u_2)$. Let’s assume that the subject (agent) has already learned the value of a state in which the first (resp. second) stimulus is present: $V(\textbf{u} = (1, 0)) ≝ α$ (resp. $V(\textbf{u} = (0, 1)) ≝ β$).

What are the values of the parameters $\textbf{w} = (w_1, w_2)$ that the agent has learnt?

\begin{align*} w_1 & = w_1 × 1 + w_2 × 0 \\ & = \textbf{w} \cdot (1, 0) \\ & = V((1, 0)) \\ & = α \end{align*}

and analogously

\begin{align*} w_2 & = w_1 × 0 + w_2 × 1 \\ & = \textbf{w} \cdot (0, 1) \\ & = V((0, 1)) \\ & = β \end{align*}

Therefore:

\textbf{w} = (α, β)

Assuming that the agent, for the very first time, runs into a state in which both stimuli are present: what is the value of this state?

If both stimuli are present, then $u_1 = u_2 = 1$, and:

V(\textbf{u} = (1, 1)) = \textbf{w} \cdot (1, 1) = w_1 × 1 + w_2 × 1 = α + β

What if we now add some uncertainty: what would be the value of a state where the first stimulus has 50% chance of being present and the second stimulus has 10%?

In this case: $u_1 = 0.5$ and $u_2 = 0.1$, so that:

V(\textbf{u} = (0.5, 0.1)) = \textbf{w} \cdot (0.5, 0.1) = w_1 × 0.5 + w_2 × 0.1 = \frac{α}{2} + \frac{β}{10}

In what situation do you think this sort of a more generalized model that you just came up with would not make much sense?

The model is more generalized in that the domain of $V$ is no longer $\lbrace 0, 1 \rbrace^2$ but now $[0, 1]^2$.

But it remains a linear model, which is a rather significant constraint: the stimuli $u_1$ and $u_2$ constribute independently to the inner estimate of value, and they do so in a linear fashion.

A linear model is not suited for all situations: for example, the mere idea that there is a slight chance of having food might cause a dog to relish the prospect of getting food and make it salivate as much as if there were actually food.

b) Advanced: derive the temporal-difference learning rule for the parameter $\textbf{w}$ that needs to be learned if the value function is $V(\textbf{u}) = \textbf{w} \cdot \textbf{u}$. Hint: Start from a loss function - what would be a suitable choice?

Let’s denote by $\textbf{u}_t ≝ (u_{t, 1}, u_{t, 2})$ the stimuli vector at time $t$, and again by $\hat{V}$ the inner estimate of value (as in the first answer).

As for the Rescola-Wagner rule, the squared loss seems to be a suitable choice in this case:

L_t ≝ (V(\textbf{u}_t) - \underbrace{\hat{V}(\textbf{u}_t)}_{≝ \; \textbf{w} \cdot \textbf{u}_t})^2

Then, steps to update $\textbf{w}$ are taken along the opposite of the gradient, i.e. toward a local minimum of the loss (gradient descent algorithm), which leads to the following learning rule:

\textbf{w} → \textbf{w} - ε' \nabla_{\textbf{w}} L_t

where the gradient

\nabla_{\textbf{w}} L_t ≝ - 2 (V(\textbf{u}_t) - \textbf{w} \cdot \textbf{u}_t) \, \textbf{u}_t

The problem is that the agent doesn’t know $V$, but, as shown in question 1, we can approximate it as follows:

V(\textbf{u}_t) ≃ r(\textbf{u}_t) + γ \, \underbrace{\hat{V}(\textbf{u}_{t+1})}_{≝ \; \textbf{w} \cdot \textbf{u}_{t+1}}

So that $\nabla_{\textbf{w}} L_t$ becomes:

\nabla_{\textbf{w}} L_t ≃ - 2 \big(r(\textbf{u}_t) + \textbf{w} \cdot (γ \textbf{u}_{t+1} - \textbf{u}_t)\big) \, \textbf{u}_t

By setting $ε ≝ \frac{ε’}{2}$, it follows that:

the temporal-difference learning rule for the parameter $\textbf{w}$ if $\hat{V}(\textbf{u}) = \textbf{w} \cdot \textbf{u} \;$ is:

\textbf{w} → \textbf{w} + ε \big(r(\textbf{u}_t) + \textbf{w} \cdot (γ \textbf{u}_{t+1} - \textbf{u}_t)\big) \, \textbf{u}_t

Can you derive a learning rule if the value function were given by $V(\textbf{u}) = f(\textbf{w} \cdot \textbf{u})$, where $f$ is a known (non-linear) function?

Assuming that $f$ is differentiable: the only difference with the previous answer is that now:

L_t ≝ (V(\textbf{u}_t) - f(\textbf{w} \cdot \textbf{u}_t))^2

So that:

\nabla_{\textbf{w}} L_t ≝ - 2 (V(\textbf{u}_t) - \textbf{w} \cdot \textbf{u}_t) f'(\textbf{w} \cdot \textbf{u}_t) \, \textbf{u}_t

as result of which:

the temporal-difference learning rule for the parameter $\textbf{w}$ if $\hat{V}(\textbf{u}) = f(\textbf{w} \cdot \textbf{u})$, where $f$ is differentiable, is:

\textbf{w} → \textbf{w} + ε \big(r(\textbf{u}_t) + \textbf{w} \cdot (γ \textbf{u}_{t+1} - \textbf{u}_t)\big) f'(\textbf{w} \cdot \textbf{u}_t) \, \textbf{u}_t

Leave a Comment