Lecture 8: Reinforcement learning, Online Learning

Online learning

In many applications, the environment is too complex to be modelled with a comprehension theoretical/statistical model (iid data).

necessity to have a robust approach that learns as one goes along learning from experience as more and more aspects of the problem are observed
⟶ goal of online learning

Reference: Prediction learning and Games (2006)

Setting

A player iteratively makes decisions based on past observations.

An environement assigns a loss/a gain to each decision and the player suffers the corresponding loss.

Losses are unknown before the decisions of the player and may be observially chosen.

At each time step $t = 1, \ldots, T$:

a player chooses an action/decision/prediction $x_t ∈ \underbrace{𝒦}_{\rlap{\text{decision set}}}$
the adversary chooses a loss $l_t(x)$ for each action $x∈X$
the player suffers $l_t(x_t)$ and observes
- either all $l_t(x)$ for all $x∈K$: full information
  - ex: trying to predict electricity consumption (EDF in France, Pierre did his pHD over there): one want to compare different methods of prediction:
    - each day, on picks one method, then at the end of the day, one compares the performance of this method to what all the others would have predicted
- or only $l_t(x_t)$: bandit feedback
  - ex: Ads:
    - bonus point for the advertiser if the user click on the add
    - no point otherwise

The goal of the player is to minimize its cumulative loss:

\[\sum\limits_{ t=1 }^T l_t(x_t)\]

Of course, if the environment choose large $l_t(x)$ for all $x$, the player will also suffer a huge loss.

Thus we need to choose a relative criterion: the regret.

Regret

Regret:: \[R_T(x_1, \ldots, x_T) = \sum\limits_{ t=1 }^T l_t(x_t) - \min_{x ∈ 𝒦} \sum\limits_{ t=1 }^T l_t(x)\]

The goal is to ensure $R_T = o(T)$ for any sequence of losses $l_1, \ldots, l_T$

That is, make the loss reach, assymptotically, the best mean loss.

Examples of applications

prediction from expert advice
online shortest path
- many to go from home to the ENS: we want to compare them
  - day in, day out: take a new path
  - ⟶ goal: reaching the average time we would have taken of we had always taken the best path
$K$-armed bandits: online advertisement, medication
portfolio selection:
- you have some money, you want to invest it in stock options (different portfolios): each day, choose one
- ⟶ goal: reachin the average gain we would have earned if we had always taken the best stock options

Prediction from expert advice

At each step $t$:

a player chooses $x_t ∈ \lbrace 1, \ldots, K \rbrace$
the environment chooses $l_t ≝ (l_t(1) ⋯ l_t(K)) ∈ [0,1]^K$
the player observes $l_t$

There exists no deterministic algorithm that ensures

\[R_T(x_1, \ldots, x_T) ≤ T \cdot \frac{K-1}{K} \quad\text{ for all } (l_t)\]

Indeed, ot suffices to choose

\[\begin{cases} l_t(x_t) = 1 \\ l_t(x) = 0 \text{ for all } x≠x_t \end{cases}\]

⟹ We need a random $x_t$:

At time $t$,

the player chooses $p_t ∈ Δ_K ≝ \left\lbrace q ∈ [0, 1]^K \mid \sum\limits_{ k=1 }^K q_t = 1 \right\rbrace$

and samples $x_t \sim p_t$ ($x_t$ according to $p_t$)

$p_t = (p_{1t}, ⋯, p_{Kt}) ∈ Δ_K$
$\sum\limits_{ k } p_{kt} = 1$
th player suffers $𝔼(l_t(x_t)) = \sum\limits_{ k=1 }^K p_{kt} l_t(k)$

Question: how to choose $p_t$?

Exponentially weighted average algorithm (EWA or Hedge)

\[p_{kt} = \frac{\exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(k)\right)}{\sum\limits_{ j=1 }^K \exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(j)\right) }\]

where $η$ is a learning rate

$η = ∞$ ⟶ follow the leader
$η = 0$ ⟶ $p_t = (\frac 1 k, ⋯, \frac 1 k)$: you don’t learn

Theorem 1: Let $η>0$, then if $l_t(j)≥0$ for all $j$, the EWA satisfies:
\[\sum\limits_{ t=1 }^T \sum\limits_{ k=1 }^K \underbrace{p_{kt} l_t(k)}_{𝔼_{x_t \sim p_t}(l_t(x_t))} ≤ \min_{1≤k≤K} \sum\limits_{ t=1 }^T l_t(k) + \frac{\log k}{η} + η \underbrace{\sum\limits_{ t }\sum\limits_{ j=1 }^K p_{jt} l_t^2(j)}_{≤ T \text{ if } l_t ∈ [0,1]^K}\]

Corollary: if $l_t ∈ [0, 1]^K$ and $η = \sqrt{\frac{\log K}{T}}$, then
\[𝔼(R_T(x_1, ⋯, x_T)) ≤ 2 \sqrt{T \log K}\]

Proof: we denote

\[w_{kt} ≝ \exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(k)\right)\]
\[W_t = \sum\limits_{ j=1 }^K \exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(j)\right)\]

so that $p_{kt} = \frac{w_{kt}}{W_t}$

Then

\[\begin{align*} W_t & = \sum\limits_{ k=1 }^K w_{k(t-1)} \exp{-η l_{t-1}(k)} &&\text{because } w_{kt} = w_{k(t-1)} \exp(-η l_{t-1}(k)) \\ & W_{t-1} \sum\limits_{ k=1 }^K p_{k(t-1)} \exp(-η l_{t-1}(k)) &&\text{because } p_{kt} = \frac{w_{kt}}{W_t} \\ & ≤ W_{t-1} \sum\limits_{ k=1 }^K p_{k(t-1)} (1 - ηl_{t-1}(k) + η^2l^2_{t-1}(k) ) &&\text{because } \exp(-x) ≤ 1-x+x^2 \text{ for } x >0 \\ & ≤ W_{t-1} (1 - η \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)) \\ & ≤ W_{t-1} \exp\left(-η \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)\right) &&\text{because } 1+x ≤ \exp(x) \end{align*}\]

Thus:

\[\begin{align*} W_{T+1} & ≤ W_1 \prod\limits_{t=2}^{T+1} \exp\left(-η \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)\right) \\ & ≤ K \exp\left(\sum\limits_{t=2}^{T+1} -η \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)\right) \\ \end{align*}\]

$W_1 = K$ and $W_{T+1} = \sum\limits_{ k=1 }^K \exp\left(-η \sum\limits_{ t=1 }^T l_t(k)\right) ≥ \exp\left(-η \min_{1 ≤ k≤K}\sum\limits_{ t=1 }^T l_t(k)\right)$

Therefore, taking the $\log$ and substituting inside the previous inegality:

\[-η \min_{1 ≤ k≤K}\sum\limits_{ t=1 }^T l_t(k) ≤ \log(K) - η\sum\limits_{t=2}^{T+1} \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)\]

Reordering and dividing by $η > 0$ conludes the proof.

NB:

up to constant factors, the bound $O(\sqrt{T \log K})$ is optimal
since the algorithm is translation, the proof is valid for any bounded loss $l_t ∈ [a, b] \quad \text{ for any } a ≤ b∈ ℝ$
calibration of $η = \sqrt{\frac{\log K}{T}}$

Sometimes, we don’t know $T$ in advance and want an any-time algorithm.

⟹ doubling trick: at time $t=2^i$, restart the algorithm with $η = \sqrt{\frac{\log K}{2^i}}$

\[\begin{align*} 𝔼(R_t(x_1, ⋯, x_T)) & = \sum\limits_{ i=0 }^{\log_2(T)} 𝔼(R_{egret}(x_{2^i}, ⋯, x_{2^{i+1}-1})) \\ & ≤ \sum\limits_{ i=0 }^{\log_2(T)} 2 \sqrt{2^i \log K} \\ & \overset{\sim}{≤} cste × \sqrt{T \log K} \end{align*}\]

⟶ A better solution is to use a time varying parameter $η_t = \sqrt{\frac{\log K}{t}}$

Bandit feedback

The player observes his/her loss $l_t(x_t)$ and not $l_t(x)$ for all $x≠x_t$.

Trade-off to make between exploration and exploitation.

To adress this trade-off, there are 3 main methods for iid losses:

Upper Confident Bound (UCB):

Assign confidence intervals to all expected losses $𝔼(l_t(k))$ and choose the action that choose the one with the lowest confidence bound (⟺ highest for gains)
$ε$-greedy strategy:

Explore with probability $ε$ or otherwise exploit the best action so far
Thomson sampling

For iid losses:

\[𝔼(R_T(x_1, ⋯, x_T)) ≤ cste × \frac{K \log T}{Δ}\]

where $Δ$ is the gap between the best expected loss and the second best.

Here, we want to deal with adversarial losses.

Exp 3

\[p_{kt} = \frac{\exp\left(-η \sum\limits_{ s=1 }^{t-1} \overbrace{ l_s(k)}^{\rlap{\text{not possible because not observed}}}\right)}{\sum\limits_{ j=1 }^K \exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(j)\right) }\]

We will try to estimate the $l_s(k), \; s = 1, ⋯, t-1$

Idea: replace $l_s(k)$ by an unbiased estimator:
\[\tilde{l}_s(k) ≝ \frac{l_s(k)}{p_{ks}} 1_{k = x_s}\]

Indeed:

\[𝔼_{x_t \sim p_t}(\tilde{l}_s(k)) = \sum\limits_{ j=1 }^K p_{jt} \frac{l_t(k)}{p_{kt}} 1_{k = j} = l_t(k)\]

Theorem: EWA applied with loss vectors $\tilde{l}_s ≝ \left(\frac{l_s(k)}{p_{ks}} 1_{k = x_s}\right)_{1 ≤ k ≤ K}$

gives
\[𝔼_{x_t \sim p_t}\left(\sum\limits_{ t=1 }^T l_t(x_t)\right) ≤ \min_{1≤k≤K} \sum\limits_{ t=1 }^T l_t(k) + \frac{\log k}{η} + K T η\]
if $η>0$ and the losses $l_t(k) ∈ [0, 1] \quad ∀k, t$

The choice $η ≝ \sqrt{\frac{\log K}{KT}}$ yields:

\[𝔼(R_T(x_1, ⋯, x_T)) ≤ 2 \sqrt{TK \log K}\]

Proof sketch:

We apply Theorem 1 with $\tilde{l}_t(k)$ instead of $l_t(k)$:

$\sum\limits_{ t, k } p_{kt} \tilde{l}_t(k) ≤ \sum\limits_{ t } \tilde{l}_t(i^\ast) + \frac{\log k}{η} + η \sum\limits_{ t, k } p_{kt} \tilde{l}_t^2(k) \quad ⊛$ and then we use $𝔼(\tilde{l}_t(k)) = l_t(k)$ and
\[𝔼\left(\sum\limits_{ t, k } p_{kt} \tilde{l}_t^2(k)\right) = 𝔼\left(\sum\limits_{ t, k } l_t^2(k)\right) ≤ K\]
because
\[𝔼\left(\tilde{l}_t^2(k)\right) = 𝔼\left(\sum\limits_{ j=1 }^K p_{jt} \frac{l^2_t(k)}{p_{kt}^2} 1_{k=j}\right) = 𝔼\left(\frac{l^2_t(k)}{p_{kt}}\right)\]
substituting into $⊛$ concludes the proof.
\[𝔼\left(\sum\limits_{ t, k } p_{kt} \tilde{l}_t^2(k)\right) = 𝔼\left(𝔼_{x_t \sim p_t}\left(\sum\limits_{ t, k } p_{kt} \tilde{l}_t^2(k)\right)\right)\]

NB: In pratice:

$η_t = \sqrt{\frac{\log K}{\widehat{V}_t}}$
$η_t ∈ argmin_{η∈ grad} \sum\limits_{ t=1 }^T l_t(x_t^n)$

Share on

Twitter Facebook Google+ LinkedIn