Lecture 8: Reinforcement learning, Online Learning
Online learning
In many applications, the environment is too complex to be modelled with a comprehension theoretical/statistical model (iid data).
- necessity to have a robust approach that learns as one goes along learning from experience as more and more aspects of the problem are observed
- ⟶ goal of online learning
Reference: Prediction learning and Games (2006)
Setting
A player iteratively makes decisions based on past observations.
An environement assigns a loss/a gain to each decision and the player suffers the corresponding loss.
Losses are unknown before the decisions of the player and may be observially chosen.
At each time step $t = 1, \ldots, T$:
- a player chooses an action/decision/prediction $x_t ∈ \underbrace{𝒦}_{\rlap{\text{decision set}}}$
- the adversary chooses a loss $l_t(x)$ for each action $x∈X$
-
the player suffers $l_t(x_t)$ and observes
-
either all $l_t(x)$ for all $x∈K$: full information
- ex: trying to predict electricity consumption (EDF in France, Pierre did his pHD over there): one want to compare different methods of prediction:
- each day, on picks one method, then at the end of the day, one compares the performance of this method to what all the others would have predicted
- ex: trying to predict electricity consumption (EDF in France, Pierre did his pHD over there): one want to compare different methods of prediction:
-
or only $l_t(x_t)$: bandit feedback
- ex: Ads:
- bonus point for the advertiser if the user click on the add
- no point otherwise
- ex: Ads:
-
The goal of the player is to minimize its cumulative loss:
\[\sum\limits_{ t=1 }^T l_t(x_t)\]Of course, if the environment choose large $l_t(x)$ for all $x$, the player will also suffer a huge loss.
Thus we need to choose a relative criterion: the regret.
Regret
- Regret:
- \[R_T(x_1, \ldots, x_T) = \sum\limits_{ t=1 }^T l_t(x_t) - \min_{x ∈ 𝒦} \sum\limits_{ t=1 }^T l_t(x)\]
The goal is to ensure \(R_T = o(T)\) for any sequence of losses $l_1, \ldots, l_T$
That is, make the loss reach, assymptotically, the best mean loss.
Examples of applications
- prediction from expert advice
-
online shortest path
- many to go from home to the ENS: we want to compare them
- day in, day out: take a new path
- ⟶ goal: reaching the average time we would have taken of we had always taken the best path
- many to go from home to the ENS: we want to compare them
- $K$-armed bandits: online advertisement, medication
- portfolio selection:
- you have some money, you want to invest it in stock options (different portfolios): each day, choose one
- ⟶ goal: reachin the average gain we would have earned if we had always taken the best stock options
Prediction from expert advice
At each step $t$:
- a player chooses $x_t ∈ \lbrace 1, \ldots, K \rbrace$
- the environment chooses $l_t ≝ (l_t(1) ⋯ l_t(K)) ∈ [0,1]^K$
- the player observes $l_t$
There exists no deterministic algorithm that ensures
\[R_T(x_1, \ldots, x_T) ≤ T \cdot \frac{K-1}{K} \quad\text{ for all } (l_t)\]Indeed, ot suffices to choose
\[\begin{cases} l_t(x_t) = 1 \\ l_t(x) = 0 \text{ for all } x≠x_t \end{cases}\]⟹ We need a random $x_t$:
At time $t$,
-
the player chooses \(p_t ∈ Δ_K ≝ \left\lbrace q ∈ [0, 1]^K \mid \sum\limits_{ k=1 }^K q_t = 1 \right\rbrace\)
and samples $x_t \sim p_t$ ($x_t$ according to $p_t$)
- $p_t = (p_{1t}, ⋯, p_{Kt}) ∈ Δ_K$
- $\sum\limits_{ k } p_{kt} = 1$
- th player suffers \(𝔼(l_t(x_t)) = \sum\limits_{ k=1 }^K p_{kt} l_t(k)\)
Question: how to choose $p_t$?
Exponentially weighted average algorithm (EWA or Hedge)
\[p_{kt} = \frac{\exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(k)\right)}{\sum\limits_{ j=1 }^K \exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(j)\right) }\]where $η$ is a learning rate
- $η = ∞$ ⟶ follow the leader
- $η = 0$ ⟶ $p_t = (\frac 1 k, ⋯, \frac 1 k)$: you don’t learn
Theorem 1: Let $η>0$, then if $l_t(j)≥0$ for all $j$, the EWA satisfies:
\[\sum\limits_{ t=1 }^T \sum\limits_{ k=1 }^K \underbrace{p_{kt} l_t(k)}_{𝔼_{x_t \sim p_t}(l_t(x_t))} ≤ \min_{1≤k≤K} \sum\limits_{ t=1 }^T l_t(k) + \frac{\log k}{η} + η \underbrace{\sum\limits_{ t }\sum\limits_{ j=1 }^K p_{jt} l_t^2(j)}_{≤ T \text{ if } l_t ∈ [0,1]^K}\]
Corollary: if $l_t ∈ [0, 1]^K$ and $η = \sqrt{\frac{\log K}{T}}$, then
\[𝔼(R_T(x_1, ⋯, x_T)) ≤ 2 \sqrt{T \log K}\]
Proof: we denote
- \[w_{kt} ≝ \exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(k)\right)\]
- \[W_t = \sum\limits_{ j=1 }^K \exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(j)\right)\]
so that \(p_{kt} = \frac{w_{kt}}{W_t}\)
Then
\[\begin{align*} W_t & = \sum\limits_{ k=1 }^K w_{k(t-1)} \exp{-η l_{t-1}(k)} &&\text{because } w_{kt} = w_{k(t-1)} \exp(-η l_{t-1}(k)) \\ & W_{t-1} \sum\limits_{ k=1 }^K p_{k(t-1)} \exp(-η l_{t-1}(k)) &&\text{because } p_{kt} = \frac{w_{kt}}{W_t} \\ & ≤ W_{t-1} \sum\limits_{ k=1 }^K p_{k(t-1)} (1 - ηl_{t-1}(k) + η^2l^2_{t-1}(k) ) &&\text{because } \exp(-x) ≤ 1-x+x^2 \text{ for } x >0 \\ & ≤ W_{t-1} (1 - η \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)) \\ & ≤ W_{t-1} \exp\left(-η \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)\right) &&\text{because } 1+x ≤ \exp(x) \end{align*}\]Thus:
\[\begin{align*} W_{T+1} & ≤ W_1 \prod\limits_{t=2}^{T+1} \exp\left(-η \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)\right) \\ & ≤ K \exp\left(\sum\limits_{t=2}^{T+1} -η \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)\right) \\ \end{align*}\]$W_1 = K$ and \(W_{T+1} = \sum\limits_{ k=1 }^K \exp\left(-η \sum\limits_{ t=1 }^T l_t(k)\right) ≥ \exp\left(-η \min_{1 ≤ k≤K}\sum\limits_{ t=1 }^T l_t(k)\right)\)
Therefore, taking the $\log$ and substituting inside the previous inegality:
\[-η \min_{1 ≤ k≤K}\sum\limits_{ t=1 }^T l_t(k) ≤ \log(K) - η\sum\limits_{t=2}^{T+1} \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)\]Reordering and dividing by $η > 0$ conludes the proof.
NB:
- up to constant factors, the bound $O(\sqrt{T \log K})$ is optimal
- since the algorithm is translation, the proof is valid for any bounded loss \(l_t ∈ [a, b] \quad \text{ for any } a ≤ b∈ ℝ\)
- calibration of $η = \sqrt{\frac{\log K}{T}}$
Sometimes, we don’t know $T$ in advance and want an any-time algorithm.
\[\begin{align*} 𝔼(R_t(x_1, ⋯, x_T)) & = \sum\limits_{ i=0 }^{\log_2(T)} 𝔼(R_{egret}(x_{2^i}, ⋯, x_{2^{i+1}-1})) \\ & ≤ \sum\limits_{ i=0 }^{\log_2(T)} 2 \sqrt{2^i \log K} \\ & \overset{\sim}{≤} cste × \sqrt{T \log K} \end{align*}\]⟹ doubling trick: at time $t=2^i$, restart the algorithm with $η = \sqrt{\frac{\log K}{2^i}}$
⟶ A better solution is to use a time varying parameter $η_t = \sqrt{\frac{\log K}{t}}$
Bandit feedback
The player observes his/her loss $l_t(x_t)$ and not $l_t(x)$ for all $x≠x_t$.
Trade-off to make between exploration and exploitation.
To adress this trade-off, there are 3 main methods for iid losses:
-
Upper Confident Bound (UCB):
Assign confidence intervals to all expected losses $𝔼(l_t(k))$ and choose the action that choose the one with the lowest confidence bound (⟺ highest for gains)
-
$ε$-greedy strategy:
Explore with probability $ε$ or otherwise exploit the best action so far
-
Thomson sampling
For iid losses:
\[𝔼(R_T(x_1, ⋯, x_T)) ≤ cste × \frac{K \log T}{Δ}\]where $Δ$ is the gap between the best expected loss and the second best.
Here, we want to deal with adversarial losses.
Exp 3
\[p_{kt} = \frac{\exp\left(-η \sum\limits_{ s=1 }^{t-1} \overbrace{ l_s(k)}^{\rlap{\text{not possible because not observed}}}\right)}{\sum\limits_{ j=1 }^K \exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(j)\right) }\]We will try to estimate the $l_s(k), \; s = 1, ⋯, t-1$
Idea: replace $l_s(k)$ by an unbiased estimator:
\[\tilde{l}_s(k) ≝ \frac{l_s(k)}{p_{ks}} 1_{k = x_s}\]
Indeed:
\[𝔼_{x_t \sim p_t}(\tilde{l}_s(k)) = \sum\limits_{ j=1 }^K p_{jt} \frac{l_t(k)}{p_{kt}} 1_{k = j} = l_t(k)\]Theorem: EWA applied with loss vectors \(\tilde{l}_s ≝ \left(\frac{l_s(k)}{p_{ks}} 1_{k = x_s}\right)_{1 ≤ k ≤ K}\)
gives
\[𝔼_{x_t \sim p_t}\left(\sum\limits_{ t=1 }^T l_t(x_t)\right) ≤ \min_{1≤k≤K} \sum\limits_{ t=1 }^T l_t(k) + \frac{\log k}{η} + K T η\]if $η>0$ and the losses $l_t(k) ∈ [0, 1] \quad ∀k, t$
The choice $η ≝ \sqrt{\frac{\log K}{KT}}$ yields:
\[𝔼(R_T(x_1, ⋯, x_T)) ≤ 2 \sqrt{TK \log K}\]Proof sketch:
-
We apply Theorem 1 with $\tilde{l}_t(k)$ instead of $l_t(k)$:
\(\sum\limits_{ t, k } p_{kt} \tilde{l}_t(k) ≤ \sum\limits_{ t } \tilde{l}_t(i^\ast) + \frac{\log k}{η} + η \sum\limits_{ t, k } p_{kt} \tilde{l}_t^2(k) \quad ⊛\) and then we use $𝔼(\tilde{l}_t(k)) = l_t(k)$ and
\[𝔼\left(\sum\limits_{ t, k } p_{kt} \tilde{l}_t^2(k)\right) = 𝔼\left(\sum\limits_{ t, k } l_t^2(k)\right) ≤ K\]because
\[𝔼\left(\tilde{l}_t^2(k)\right) = 𝔼\left(\sum\limits_{ j=1 }^K p_{jt} \frac{l^2_t(k)}{p_{kt}^2} 1_{k=j}\right) = 𝔼\left(\frac{l^2_t(k)}{p_{kt}}\right)\]substituting into $⊛$ concludes the proof.
\[𝔼\left(\sum\limits_{ t, k } p_{kt} \tilde{l}_t^2(k)\right) = 𝔼\left(𝔼_{x_t \sim p_t}\left(\sum\limits_{ t, k } p_{kt} \tilde{l}_t^2(k)\right)\right)\]
NB: In pratice:
- $η_t = \sqrt{\frac{\log K}{\widehat{V}_t}}$
- $η_t ∈ argmin_{η∈ grad} \sum\limits_{ t=1 }^T l_t(x_t^n)$
Leave a comment