Exercises 5: Learning theory and PAC bounds

EX 1: Hoeffding Inequality

Laplace transform of random var $X$, function $f$ s.t.

f(s) ≝ 𝔼(\exp(sX))

Goal of exercise: Prove the

Hoeffding Inequality:
ℙ(S_n - 𝔼(S_n) ≥ t) ≤ \exp\left(-2 \frac{t^2}{\sum\limits_{ i } (b_i - a_i)^2}\right)


  • $S_n ≝ \sum\limits_{ i } X_i$
  • $X_i$ independent
  • $a_i ≤ X_i ≤ b_i$

NB: For the proof, we need a lemma:

Lemma: if $X$ s.t. $𝔼(X)=0$ and $a ≤ X ≤ b$, then its Laplace transform $f$ is such that f(s) ≤ \exp\left(\frac{s^2 (b-a)^2}{8}\right) \quad ∀s>0


Use Chernoff’s method (Markov inequality + exp) to bound $ℙ(S_n - 𝔼(S_n)≥ t)$ by a quantity that depends on the Laplace transform $f(s)$ of $Z_n ≝ S_n - 𝔼(S_n)$.

Then, optimize the bound over $S$ to get Hoeffding’s result.


Markov inequality for $X ≥ 0$:
ℙ(X ≥ a) ≤ \frac{𝔼(X)}{a}

\begin{align*} ℙ(Z_n ≥ t) &= ℙ(\exp(s Z_n) ≥ \exp(st)) \\ & ≤ \frac{\overbrace{𝔼(\exp(s Z_n))}^{\text{Laplace transform}}}{\exp(st)} \text{ (Markov)}\\ & ≤ \frac{\exp\left(\frac{s^2 (\sum\limits_{ i } b_i - a_i)^2}{8}\right)}{\exp(st)} \\ \end{align*}


Z_n = \sum\limits_{i} X_i - 𝔼(\sum\limits_{ i } X_i \\ = \sum\limits_{i} \underbrace{(X_i - 𝔼(X_i))}_{\text{ noted } Y_i}

and the $Y_i$ are independent due to the $X_i$ being independent.


\begin{align*} f_{Z_n}(s) &= 𝔼(\exp{s Z_n}) \\ &= 𝔼(\exp{s \sum\limits_{ i } Y_i }) \\ &= 𝔼(\prod_i \exp{s Y_i }) \\ &= \prod_i \underbrace{𝔼(\exp{s Y_i })}_{≝ \; f_{Y_i}(s)} &&\text{(independence)} \end{align*}


\begin{align*} ℙ(Z_n ≥ t) &≤ \prod_i \frac{𝔼(\exp{s Y_i })}{\exp(st)} \\ & ≤ \prod_i \frac{\exp\left(\frac{s^2 (b_i - a_i)^2}{8}\right)}{\exp(st)} \\ &=\exp\left(\frac{s^2 \sum\limits_{ i } (b_i - a_i)^2}{8} -st\right) \\ \end{align*}

This bound is true for all $s>0$: to get the tightest bound possible, we minimize over $s$.

Since $\exp$ is non-decreasing, we just need to minimize

φ(s) ≝ \frac{s^2 \sum\limits_{ i } (b_i - a_i)^2}{8} -st

s^\ast ≝ \frac{4t}{\sum\limits_{ i } (b_i - a_i)^2}


ℙ(S_n - 𝔼(S_n) ≥ t) ≤ \exp\left(-2 \frac{t^2}{\sum\limits_{ i } (b_i - a_i)^2}\right)


Chernoff: If $X_i ⇝ Ber(p)$,

ℙ\left(\sum\limits_{ i } X_i - p ≥ t\right) ≤ \exp(-2t^2)

because $0 ≤ X_i ≤ 1$

EX 2: Regularized VS Constrained problems

Consider the regression problem:

f(x) = w^T x

over the sample $(x_1, y_1), ⋯, (x_n, y_n)$

To avoid overfitting, we can

  • either regularize with the norm of $w$ (Ridge regression): \min_w \underbrace{l}_{\text{loss}}(w^T x - y) + λ \Vert w \Vert_2^2 \qquad (ℛ_λ)

  • or contrain the norm $\Vert w \Vert_2^2$: \min_w l(w^T x - y) \quad \text{ s.t. } \Vert w \Vert_2^2 ≤ R^2 \qquad (𝒞_R)

Actually, if the loss $l$ is convex, both problems are equivalent, i.e.:

  • $∀R, ∃λ_R$
  • $∀λ, ∃R_λ$

such that both problems have the same solutions.

So: \begin{cases} ℛ_λ &= 𝒞_{R_λ}\\ 𝒞_R &= ℛ_{λ_R} \end{cases}


Show that: $∀λ ≥ 0$, the problem $ℛ_λ$ with solution $w_λ$ corresponds to the problem $𝒞_R$ with a certain $R$ (meaning that $w_λ$ is also a solution to $𝒞_R$)

Indication: $R^2_λ ≝ \Vert w_λ \Vert_2^2$

Show that $w_λ$ is a solution of

\min_w l(w^T x - y) \quad \text{ s.t. } \Vert w \Vert_2^2 ≤ R_λ^2 ≝ \Vert w_λ \Vert_2^2 \qquad (𝒞_R)

Assume $w’$ is a solution of $(𝒞_R)$ s.t. $w’≠ w_λ$:

  • either: $\Vert w’ \Vert = \Vert w_λ \Vert$: then $w’$ is also a solution of $ℛ_λ$ (because it minimizes the loss function and $λ \Vert w’ \Vert = λ \Vert w_λ \Vert$)

  • or: $\Vert w’ \Vert < \Vert w_λ \Vert$: then as $λ \Vert w’ \Vert < λ \Vert w_λ \Vert$ and $l(w’^T x - y) = l(w_λ^T x - y)$, it contradicts $w_λ$ minimizing $ℛ_λ$ (since $w’$ is a better minimizer thereof)


Consider $g(R) ≝ \min_{\Vert w \Vert_2^2 = R^2} \sum\limits_{i} l(w^T x_i - y_i)$$

f(R) ≝ \min_{\Vert w \Vert_2^2 ≤ R^2} \sum\limits_{ i } l(w^T x_i - y_i)

Show that $f$ is a decreasing function.

Let $R’ > R$.

  • $f(R’)$ solution of $\min_{\Vert w \Vert_2^2 ≤ R’^2} \sum\limits_{ i } l(w^T x_i - y_i)$
  • $f(R)$ solution of $\min_{\Vert w \Vert_2^2 ≤ R^2} \sum\limits_{ i } l(w^T x_i - y_i)$

Since \lbrace w \mid \Vert w \Vert_2^2 ≤ R'^2 \rbrace \supset \lbrace w \mid \Vert w \Vert_2^2 ≤ R^2 \rbrace

the minimization set for $R’$ is larger than the one for $R$, so that $f(R’) ≤ f(R)$

So f(R) = \inf_{R'≤ R} g(R')

Leave a comment