Exercise Sheet 2: Temporal-Difference learning with discounting
Exercise Sheet 2: Temporal-Difference learning
Younesse Kaddar
1. Temporal-Difference learning with discounting
In many instances, immediate rewards are worth more than those in the future. To take this observation into account, the value $V(s_t)$ of a particular state $s_t$ is not the sum all of future rewards, but rather the sum of all future discounted rewards.
\[V(s_t) ≝ \sum\limits_{ τ=0 }^{+∞} γ^τ r(s_{t+τ})\]where $0 < γ < 1$. Here, $s_t$ is the state at time $t$, i.e. the state in which the agent is right now, $s_{t+1}$ the state that the agent will move to next, and so on.
Following the derivation in the lecture, show that the temporal-difference-learning rule in this case is given by:
\[V(s_t) → V(s_t) + ε\Big(r(s_t) + γ V(s_{t+1}) - V(s_t)\Big)\]Reminder: with $γ = 1$
With the notations used in the lecture: in compliance with the $δ$-rule, we came up with the following TD-learning rule with $γ = 1$ (no discounting):
\[\hat{V}(s_t) → \hat{V}(s_t) + ε δ\]where
- $\hat{V}(s_t)$ is the internal estimate of value (denoted by $V(s_t)$ in the problem statement, by abuse of notation)
- $0 < ε « 1$ is a learning rate
- $δ$ is the prediction error
The most “natural” prediction error is the difference between the actual value the state $V(s_t)$ and the internal estimate thereof:
\[δ_{natural} ≝ V(s_t) - \hat{V}(s_t)\]As explained in class: the problem is that the real value of the state $V(s_t)$ is not known, so one is unable to compute such a prediction error! The trick we used to cope with that was to notice that:
\[\begin{align*} δ_{natural} & ≝ \overbrace{V(s_t)}^{= \; r(s_t) + V(s_{t+1})} - \hat{V}(s_t) \\ & = r(s_t) + V(s_{t+1}) - \hat{V}(s_t)\\ \end{align*}\]so that we could use the current internal estimate $\hat{V}(s_{t+1})$ of $V(s_{t+1})$, as in dynamic programming, to approximate:
\[δ_{natural} = r(s_t) + V(s_{t+1}) - \hat{V}(s_t) ≃ r(s_t) + \hat{V}(s_{t+1}) - \hat{V}(s_t)\]And we set $δ$ to be this approximate (and now computable!) value:
\[δ ≝ r(s_t) + \hat{V}(s_{t+1}) - \hat{V}(s_t)\]For $0 < γ < 1$
In the discounting case, likewise:
\[\begin{align*} δ_{natural} & ≝ V(s_t) - \hat{V}(s_t) \\ & = \sum\limits_{ τ=0 }^{+∞} γ^τ r(s_{t+τ}) - \hat{V}(s_t) \\ & = r(s_t) + \sum\limits_{ τ=1 }^{+∞} γ^τ r(s_{t+τ}) - \hat{V}(s_t) \\ & = r(s_t) + \sum\limits_{ τ=0 }^{+∞} γ^{τ+1} r(s_{t+τ+1}) - \hat{V}(s_t) \\ & = r(s_t) + γ \, \underbrace{\sum\limits_{ τ=0 }^{+∞} γ^τ r(s_{(t+1)τ})}_{≝ \; V(s_{t+1})} - \hat{V}(s_t) \\ & = r(s_t) + γ \, \underbrace{V(s_{t+1})}_{≃ \, \hat{V}(s_{t+1})} - \hat{V}(s_t) \end{align*}\]And again, by approximating $V(s_{t+1})$ by $\hat{V}(s_{t+1})$, one sets $δ$ to be:
\[δ ≝ r(s_t) + γ \hat{V}(s_{t+1}) - \hat{V}(s_t)\]Therefore, the overall TD-learning rule is, in the discounting case:
\[\hat{V}(s_t) → \hat{V}(s_t) + ε δ = \hat{V}(s_t) + ε\Big(r(s_t) + γ \hat{V}(s_{t+1}) - \hat{V}(s_t)\Big)\]or, with the abuse of notation ($\hat{V}$ is denoted by $V$ as well) made in the problem statement:
\[V(s_t) → V(s_t) + ε\Big(r(s_t) + γ V(s_{t+1}) - V(s_t)\Big)\]
2. Models for the value function
In the lecture, we talked about the necessity to introduce models for the value of a state, so that one could properly generalize to new, unseen situations. One very simple model is given by the value function $V(\textbf{u}) = \textbf{w} \cdot \textbf{u}$ where $\textbf{u}$ is a vector of stimuli that could either be present (1) or absent (0).
a)
Take the example of two stimuli, $\textbf{u} = (u_1, u_2)$. Let’s assume that the subject (agent) has already learned the value of a state in which the first (resp. second) stimulus is present: $V(\textbf{u} = (1, 0)) ≝ α$ (resp. $V(\textbf{u} = (0, 1)) ≝ β$).
What are the values of the parameters $\textbf{w} = (w_1, w_2)$ that the agent has learnt?
\[\begin{align*} w_1 & = w_1 × 1 + w_2 × 0 \\ & = \textbf{w} \cdot (1, 0) \\ & = V((1, 0)) \\ & = α \end{align*}\]and analogously
\[\begin{align*} w_2 & = w_1 × 0 + w_2 × 1 \\ & = \textbf{w} \cdot (0, 1) \\ & = V((0, 1)) \\ & = β \end{align*}\]Therefore:
\[\textbf{w} = (α, β)\]
Assuming that the agent, for the very first time, runs into a state in which both stimuli are present: what is the value of this state?
If both stimuli are present, then $u_1 = u_2 = 1$, and:
\[V(\textbf{u} = (1, 1)) = \textbf{w} \cdot (1, 1) = w_1 × 1 + w_2 × 1 = α + β\]What if we now add some uncertainty: what would be the value of a state where the first stimulus has 50% chance of being present and the second stimulus has 10%?
In this case: $u_1 = 0.5$ and $u_2 = 0.1$, so that:
\[V(\textbf{u} = (0.5, 0.1)) = \textbf{w} \cdot (0.5, 0.1) = w_1 × 0.5 + w_2 × 0.1 = \frac{α}{2} + \frac{β}{10}\]In what situation do you think this sort of a more generalized model that you just came up with would not make much sense?
The model is more generalized in that the domain of $V$ is no longer $\lbrace 0, 1 \rbrace^2$ but now $[0, 1]^2$.
But it remains a linear model, which is a rather significant constraint: the stimuli $u_1$ and $u_2$ constribute independently to the inner estimate of value, and they do so in a linear fashion.
A linear model is not suited for all situations: for example, the mere idea that there is a slight chance of having food might cause a dog to relish the prospect of getting food and make it salivate as much as if there were actually food.
b) Advanced: derive the temporal-difference learning rule for the parameter $\textbf{w}$ that needs to be learned if the value function is $V(\textbf{u}) = \textbf{w} \cdot \textbf{u}$. Hint: Start from a loss function - what would be a suitable choice?
Let’s denote by $\textbf{u}_t ≝ (u_{t, 1}, u_{t, 2})$ the stimuli vector at time $t$, and again by $\hat{V}$ the inner estimate of value (as in the first answer).
As for the Rescola-Wagner rule, the squared loss seems to be a suitable choice in this case:
\[L_t ≝ (V(\textbf{u}_t) - \underbrace{\hat{V}(\textbf{u}_t)}_{≝ \; \textbf{w} \cdot \textbf{u}_t})^2\]Then, steps to update $\textbf{w}$ are taken along the opposite of the gradient, i.e. toward a local minimum of the loss (gradient descent algorithm), which leads to the following learning rule:
\[\textbf{w} → \textbf{w} - ε' \nabla_{\textbf{w}} L_t\]where the gradient
\[\nabla_{\textbf{w}} L_t ≝ - 2 (V(\textbf{u}_t) - \textbf{w} \cdot \textbf{u}_t) \, \textbf{u}_t\]The problem is that the agent doesn’t know $V$, but, as shown in question 1, we can approximate it as follows:
\[V(\textbf{u}_t) ≃ r(\textbf{u}_t) + γ \, \underbrace{\hat{V}(\textbf{u}_{t+1})}_{≝ \; \textbf{w} \cdot \textbf{u}_{t+1}}\]So that $\nabla_{\textbf{w}} L_t$ becomes:
\[\nabla_{\textbf{w}} L_t ≃ - 2 \big(r(\textbf{u}_t) + \textbf{w} \cdot (γ \textbf{u}_{t+1} - \textbf{u}_t)\big) \, \textbf{u}_t\]By setting $ε ≝ \frac{ε’}{2}$, it follows that:
the temporal-difference learning rule for the parameter $\textbf{w}$ if $\hat{V}(\textbf{u}) = \textbf{w} \cdot \textbf{u} \;$ is:
\[\textbf{w} → \textbf{w} + ε \big(r(\textbf{u}_t) + \textbf{w} \cdot (γ \textbf{u}_{t+1} - \textbf{u}_t)\big) \, \textbf{u}_t\]
Can you derive a learning rule if the value function were given by $V(\textbf{u}) = f(\textbf{w} \cdot \textbf{u})$, where $f$ is a known (non-linear) function?
Assuming that $f$ is differentiable: the only difference with the previous answer is that now:
\[L_t ≝ (V(\textbf{u}_t) - f(\textbf{w} \cdot \textbf{u}_t))^2\]So that:
\[\nabla_{\textbf{w}} L_t ≝ - 2 (V(\textbf{u}_t) - \textbf{w} \cdot \textbf{u}_t) f'(\textbf{w} \cdot \textbf{u}_t) \, \textbf{u}_t\]as result of which:
the temporal-difference learning rule for the parameter $\textbf{w}$ if $\hat{V}(\textbf{u}) = f(\textbf{w} \cdot \textbf{u})$, where $f$ is differentiable, is:
\[\textbf{w} → \textbf{w} + ε \big(r(\textbf{u}_t) + \textbf{w} \cdot (γ \textbf{u}_{t+1} - \textbf{u}_t)\big) f'(\textbf{w} \cdot \textbf{u}_t) \, \textbf{u}_t\]
Leave a comment