# Exercise Sheet 2: Temporal-Difference learning with discounting

# Exercise Sheet 2: Temporal-Difference learning

### Younesse Kaddar

# 1. Temporal-Difference learning with discounting

In many instances, immediate rewards are worth more than those in the future. To take this observation into account, the value $V(s_t)$ of a particular state $s_t$ is not the sum all of future rewards, but rather the sum of all future *discounted* rewards.

where $0 < γ < 1$. Here, $s_t$ is the state at time $t$, *i.e.* the state in which the agent is right now, $s_{t+1}$ the state that the agent will move to next, and so on.

## Following the derivation in the lecture, show that the temporal-difference-learning rule in this case is given by:

### Reminder: with $γ = 1$

With the notations used in the lecture: in compliance with the $δ$-rule, we came up with the following TD-learning rule with $γ = 1$ (no discounting):

where

- $\hat{V}(s_t)$ is the internal estimate of value (denoted by $V(s_t)$ in the problem statement, by abuse of notation)
- $0 < ε « 1$ is a learning rate
- $δ$ is the prediction error

The most “natural” prediction error is the difference between the actual value the state $V(s_t)$ and the internal estimate thereof:

As explained in class: the problem is that the real value of the state $V(s_t)$ is *not* known, so one is unable to compute such a prediction error! The trick we used to cope with that was to notice that:

so that we could use the current internal estimate $\hat{V}(s_{t+1})$ of $V(s_{t+1})$, as in *dynamic programming*, to approximate:

And we set $δ$ to be this approximate (and now computable!) value:

### For $0 < γ < 1$

In the discounting case, likewise:

And again, by approximating $V(s_{t+1})$ by $\hat{V}(s_{t+1})$, one sets $δ$ to be:

Therefore, the overall TD-learning rule is, in the discounting case:

or, with the abuse of notation ($\hat{V}$ is denoted by $V$ as well) made in the problem statement:

V(s_t) → V(s_t) + ε\Big(r(s_t) + γ V(s_{t+1}) - V(s_t)\Big)

# 2. Models for the value function

In the lecture, we talked about the necessity to introduce models for the value of a state, so that one could properly generalize to new, unseen situations. One very simple model is given by the value function $V(\textbf{u}) = \textbf{w} \cdot \textbf{u}$ where $\textbf{u}$ is a vector of stimuli that could either be present (1) or absent (0).

## a)

Take the example of two stimuli, $\textbf{u} = (u_1, u_2)$. Let’s assume that the subject (agent) has already learned the value of a state in which the first (resp. second) stimulus is present: $V(\textbf{u} = (1, 0)) ≝ α$ (resp. $V(\textbf{u} = (0, 1)) ≝ β$).

## What are the values of the parameters $\textbf{w} = (w_1, w_2)$ that the agent has learnt?

and analogously

Therefore:

\textbf{w} = (α, β)

## Assuming that the agent, for the very first time, runs into a state in which both stimuli are present: what is the value of this state?

If both stimuli are present, then $u_1 = u_2 = 1$, and:

## What if we now add some uncertainty: what would be the value of a state where the first stimulus has 50% chance of being present and the second stimulus has 10%?

In this case: $u_1 = 0.5$ and $u_2 = 0.1$, so that:

## In what situation do you think this sort of a more generalized model that you just came up with would not make much sense?

The model is more generalized in that the domain of $V$ is no longer $\lbrace 0, 1 \rbrace^2$ but now $[0, 1]^2$.

But it remains a **linear model**, which is a rather significant constraint: the stimuli $u_1$ and $u_2$ constribute independently to the inner estimate of value, and they do so in a linear fashion.

A linear model is not suited for all situations: for example, the mere idea that there is a slight chance of having food might cause a dog to relish the prospect of getting food and make it salivate as much as if there were actually food.

## b) *Advanced*: derive the temporal-difference learning rule for the parameter $\textbf{w}$ that needs to be learned if the value function is $V(\textbf{u}) = \textbf{w} \cdot \textbf{u}$. *Hint*: Start from a loss function - what would be a suitable choice?

Let’s denote by $\textbf{u}_t ≝ (u_{t, 1}, u_{t, 2})$ the stimuli vector at time $t$, and again by $\hat{V}$ the inner estimate of value (as in the first answer).

As for the Rescola-Wagner rule, the squared loss seems to be a suitable choice in this case:

Then, steps to update $\textbf{w}$ are taken along the opposite of the gradient, i.e. toward a local minimum of the loss (gradient descent algorithm), which leads to the following learning rule:

where the gradient

The problem is that the agent doesn’t know $V$, but, as shown in question 1, we can approximate it as follows:

So that $\nabla_{\textbf{w}} L_t$ becomes:

By setting $ε ≝ \frac{ε’}{2}$, it follows that:

the temporal-difference learning rule for the parameter $\textbf{w}$ if $\hat{V}(\textbf{u}) = \textbf{w} \cdot \textbf{u} \;$ is:

\textbf{w} → \textbf{w} + ε \big(r(\textbf{u}_t) + \textbf{w} \cdot (γ \textbf{u}_{t+1} - \textbf{u}_t)\big) \, \textbf{u}_t

### Can you derive a learning rule if the value function were given by $V(\textbf{u}) = f(\textbf{w} \cdot \textbf{u})$, where $f$ is a known (non-linear) function?

**Assuming that $f$ is differentiable**: the only difference with the previous answer is that now:

So that:

as result of which:

the temporal-difference learning rule for the parameter $\textbf{w}$ if $\hat{V}(\textbf{u}) = f(\textbf{w} \cdot \textbf{u})$, where $f$ is differentiable, is:

\textbf{w} → \textbf{w} + ε \big(r(\textbf{u}_t) + \textbf{w} \cdot (γ \textbf{u}_{t+1} - \textbf{u}_t)\big) f'(\textbf{w} \cdot \textbf{u}_t) \, \textbf{u}_t

## Leave a Comment