Lecture 8: Reinforcement learning, Online Learning

Online learning

In many applications, the environment is too complex to be modelled with a comprehension theoretical/statistical model (iid data).

  • necessity to have a robust approach that learns as one goes along learning from experience as more and more aspects of the problem are observed
  • goal of online learning

Reference: Prediction learning and Games (2006)


A player iteratively makes decisions based on past observations.

An environement assigns a loss/a gain to each decision and the player suffers the corresponding loss.

Losses are unknown before the decisions of the player and may be observially chosen.

At each time step $t = 1, \ldots, T$:

  • a player chooses an action/decision/prediction $x_t ∈ \underbrace{𝒦}_{\rlap{\text{decision set}}}$
  • the adversary chooses a loss $l_t(x)$ for each action $x∈X$
  • the player suffers $l_t(x_t)$ and observes

    • either all $l_t(x)$ for all $x∈K$: full information

      • ex: trying to predict electricity consumption (EDF in France, Pierre did his pHD over there): one want to compare different methods of prediction:
        • each day, on picks one method, then at the end of the day, one compares the performance of this method to what all the others would have predicted
    • or only $l_t(x_t)$: bandit feedback

      • ex: Ads:
        • bonus point for the advertiser if the user click on the add
        • no point otherwise

The goal of the player is to minimize its cumulative loss:

\[\sum\limits_{ t=1 }^T l_t(x_t)\]

Of course, if the environment choose large $l_t(x)$ for all $x$, the player will also suffer a huge loss.

Thus we need to choose a relative criterion: the regret.


\[R_T(x_1, \ldots, x_T) = \sum\limits_{ t=1 }^T l_t(x_t) - \min_{x ∈ 𝒦} \sum\limits_{ t=1 }^T l_t(x)\]

The goal is to ensure \(R_T = o(T)\) for any sequence of losses $l_1, \ldots, l_T$

That is, make the loss reach, assymptotically, the best mean loss.

Examples of applications

  • prediction from expert advice
  • online shortest path

    • many to go from home to the ENS: we want to compare them
      • day in, day out: take a new path
      • ⟶ goal: reaching the average time we would have taken of we had always taken the best path
  • $K$-armed bandits: online advertisement, medication
  • portfolio selection:
    • you have some money, you want to invest it in stock options (different portfolios): each day, choose one
    • ⟶ goal: reachin the average gain we would have earned if we had always taken the best stock options

Prediction from expert advice

At each step $t$:

  • a player chooses $x_t ∈ \lbrace 1, \ldots, K \rbrace$
  • the environment chooses $l_t ≝ (l_t(1) ⋯ l_t(K)) ∈ [0,1]^K$
  • the player observes $l_t$

There exists no deterministic algorithm that ensures

\[R_T(x_1, \ldots, x_T) ≤ T \cdot \frac{K-1}{K} \quad\text{ for all } (l_t)\]

Indeed, ot suffices to choose

\[\begin{cases} l_t(x_t) = 1 \\ l_t(x) = 0 \text{ for all } x≠x_t \end{cases}\]

⟹ We need a random $x_t$:

At time $t$,

  • the player chooses \(p_t ∈ Δ_K ≝ \left\lbrace q ∈ [0, 1]^K \mid \sum\limits_{ k=1 }^K q_t = 1 \right\rbrace\)

    and samples $x_t \sim p_t$ ($x_t$ according to $p_t$)

  • $p_t = (p_{1t}, ⋯, p_{Kt}) ∈ Δ_K$
  • $\sum\limits_{ k } p_{kt} = 1$
  • th player suffers \(𝔼(l_t(x_t)) = \sum\limits_{ k=1 }^K p_{kt} l_t(k)\)

Question: how to choose $p_t$?

Exponentially weighted average algorithm (EWA or Hedge)

\[p_{kt} = \frac{\exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(k)\right)}{\sum\limits_{ j=1 }^K \exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(j)\right) }\]

where $η$ is a learning rate

  • $η = ∞$ ⟶ follow the leader
  • $η = 0$ ⟶ $p_t = (\frac 1 k, ⋯, \frac 1 k)$: you don’t learn

Theorem 1: Let $η>0$, then if $l_t(j)≥0$ for all $j$, the EWA satisfies:

\[\sum\limits_{ t=1 }^T \sum\limits_{ k=1 }^K \underbrace{p_{kt} l_t(k)}_{𝔼_{x_t \sim p_t}(l_t(x_t))} ≤ \min_{1≤k≤K} \sum\limits_{ t=1 }^T l_t(k) + \frac{\log k}{η} + η \underbrace{\sum\limits_{ t }\sum\limits_{ j=1 }^K p_{jt} l_t^2(j)}_{≤ T \text{ if } l_t ∈ [0,1]^K}\]

Corollary: if $l_t ∈ [0, 1]^K$ and $η = \sqrt{\frac{\log K}{T}}$, then

\[𝔼(R_T(x_1, ⋯, x_T)) ≤ 2 \sqrt{T \log K}\]

Proof: we denote

  • \[w_{kt} ≝ \exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(k)\right)\]
  • \[W_t = \sum\limits_{ j=1 }^K \exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(j)\right)\]

so that \(p_{kt} = \frac{w_{kt}}{W_t}\)


\[\begin{align*} W_t & = \sum\limits_{ k=1 }^K w_{k(t-1)} \exp{-η l_{t-1}(k)} &&\text{because } w_{kt} = w_{k(t-1)} \exp(-η l_{t-1}(k)) \\ & W_{t-1} \sum\limits_{ k=1 }^K p_{k(t-1)} \exp(-η l_{t-1}(k)) &&\text{because } p_{kt} = \frac{w_{kt}}{W_t} \\ & ≤ W_{t-1} \sum\limits_{ k=1 }^K p_{k(t-1)} (1 - ηl_{t-1}(k) + η^2l^2_{t-1}(k) ) &&\text{because } \exp(-x) ≤ 1-x+x^2 \text{ for } x >0 \\ & ≤ W_{t-1} (1 - η \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)) \\ & ≤ W_{t-1} \exp\left(-η \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)\right) &&\text{because } 1+x ≤ \exp(x) \end{align*}\]


\[\begin{align*} W_{T+1} & ≤ W_1 \prod\limits_{t=2}^{T+1} \exp\left(-η \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)\right) \\ & ≤ K \exp\left(\sum\limits_{t=2}^{T+1} -η \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)\right) \\ \end{align*}\]

$W_1 = K$ and \(W_{T+1} = \sum\limits_{ k=1 }^K \exp\left(-η \sum\limits_{ t=1 }^T l_t(k)\right) ≥ \exp\left(-η \min_{1 ≤ k≤K}\sum\limits_{ t=1 }^T l_t(k)\right)\)

Therefore, taking the $\log$ and substituting inside the previous inegality:

\[-η \min_{1 ≤ k≤K}\sum\limits_{ t=1 }^T l_t(k) ≤ \log(K) - η\sum\limits_{t=2}^{T+1} \sum\limits_{ k=1 }^K p_{k(t-1)} l_{t-1}(k) + η^2 \sum\limits_{ k=1 }^K p_{k(t-1)} l^2_{t-1}(k)\]

Reordering and dividing by $η > 0$ conludes the proof.


  1. up to constant factors, the bound $O(\sqrt{T \log K})$ is optimal
  2. since the algorithm is translation, the proof is valid for any bounded loss \(l_t ∈ [a, b] \quad \text{ for any } a ≤ b∈ ℝ\)
  3. calibration of $η = \sqrt{\frac{\log K}{T}}$

Sometimes, we don’t know $T$ in advance and want an any-time algorithm.

doubling trick: at time $t=2^i$, restart the algorithm with $η = \sqrt{\frac{\log K}{2^i}}$

\[\begin{align*} 𝔼(R_t(x_1, ⋯, x_T)) & = \sum\limits_{ i=0 }^{\log_2(T)} 𝔼(R_{egret}(x_{2^i}, ⋯, x_{2^{i+1}-1})) \\ & ≤ \sum\limits_{ i=0 }^{\log_2(T)} 2 \sqrt{2^i \log K} \\ & \overset{\sim}{≤} cste × \sqrt{T \log K} \end{align*}\]

⟶ A better solution is to use a time varying parameter $η_t = \sqrt{\frac{\log K}{t}}$

Bandit feedback

The player observes his/her loss $l_t(x_t)$ and not $l_t(x)$ for all $x≠x_t$.

Trade-off to make between exploration and exploitation.

To adress this trade-off, there are 3 main methods for iid losses:

  1. Upper Confident Bound (UCB):

    Assign confidence intervals to all expected losses $𝔼(l_t(k))$ and choose the action that choose the one with the lowest confidence bound (⟺ highest for gains)

  2. $ε$-greedy strategy:

    Explore with probability $ε$ or otherwise exploit the best action so far

  3. Thomson sampling

For iid losses:

\[𝔼(R_T(x_1, ⋯, x_T)) ≤ cste × \frac{K \log T}{Δ}\]

where $Δ$ is the gap between the best expected loss and the second best.

Here, we want to deal with adversarial losses.

Exp 3

\[p_{kt} = \frac{\exp\left(-η \sum\limits_{ s=1 }^{t-1} \overbrace{ l_s(k)}^{\rlap{\text{not possible because not observed}}}\right)}{\sum\limits_{ j=1 }^K \exp\left(-η \sum\limits_{ s=1 }^{t-1} l_s(j)\right) }\]

We will try to estimate the $l_s(k), \; s = 1, ⋯, t-1$

Idea: replace $l_s(k)$ by an unbiased estimator:

\[\tilde{l}_s(k) ≝ \frac{l_s(k)}{p_{ks}} 1_{k = x_s}\]


\[𝔼_{x_t \sim p_t}(\tilde{l}_s(k)) = \sum\limits_{ j=1 }^K p_{jt} \frac{l_t(k)}{p_{kt}} 1_{k = j} = l_t(k)\]

Theorem: EWA applied with loss vectors \(\tilde{l}_s ≝ \left(\frac{l_s(k)}{p_{ks}} 1_{k = x_s}\right)_{1 ≤ k ≤ K}\)


\[𝔼_{x_t \sim p_t}\left(\sum\limits_{ t=1 }^T l_t(x_t)\right) ≤ \min_{1≤k≤K} \sum\limits_{ t=1 }^T l_t(k) + \frac{\log k}{η} + K T η\]

if $η>0$ and the losses $l_t(k) ∈ [0, 1] \quad ∀k, t$

The choice $η ≝ \sqrt{\frac{\log K}{KT}}$ yields:

\[𝔼(R_T(x_1, ⋯, x_T)) ≤ 2 \sqrt{TK \log K}\]

Proof sketch:

  1. We apply Theorem 1 with $\tilde{l}_t(k)$ instead of $l_t(k)$:

    \(\sum\limits_{ t, k } p_{kt} \tilde{l}_t(k) ≤ \sum\limits_{ t } \tilde{l}_t(i^\ast) + \frac{\log k}{η} + η \sum\limits_{ t, k } p_{kt} \tilde{l}_t^2(k) \quad ⊛\) and then we use $𝔼(\tilde{l}_t(k)) = l_t(k)$ and

    \[𝔼\left(\sum\limits_{ t, k } p_{kt} \tilde{l}_t^2(k)\right) = 𝔼\left(\sum\limits_{ t, k } l_t^2(k)\right) ≤ K\]


    \[𝔼\left(\tilde{l}_t^2(k)\right) = 𝔼\left(\sum\limits_{ j=1 }^K p_{jt} \frac{l^2_t(k)}{p_{kt}^2} 1_{k=j}\right) = 𝔼\left(\frac{l^2_t(k)}{p_{kt}}\right)\]

    substituting into $⊛$ concludes the proof.

    \[𝔼\left(\sum\limits_{ t, k } p_{kt} \tilde{l}_t^2(k)\right) = 𝔼\left(𝔼_{x_t \sim p_t}\left(\sum\limits_{ t, k } p_{kt} \tilde{l}_t^2(k)\right)\right)\]

NB: In pratice:

  • $η_t = \sqrt{\frac{\log K}{\widehat{V}_t}}$
  • $η_t ∈ argmin_{η∈ grad} \sum\limits_{ t=1 }^T l_t(x_t^n)$

Leave a comment