Lecture 6: PAC-Learning

Probably Approximately Correct

PAC:

A theoritical framework to analyze machine learning algos. (Leslie Valiant 1984)

Introduction

D_n ≝ \lbrace (X_i, Y_i) \rbrace_{1 ≤ i ≤ n} ⇝ ℙ \text{ iid }
  • $X_i ∈ 𝒳$ (input space)
  • $Y_i ∈ 𝒴$ (ouput)
Learning algo:
𝒜: \bigcup\limits_{n ≥ 1} \underbrace{(𝒳 × 𝒴)^n}_{\text{training data}} ⟶ \underbrace{𝒴^𝒳}_{\text{estimator of } Y \mid X}
Loss function:
l: \begin{cases} 𝒴 × 𝒴 ⟶ ℝ\\ \hat{y}, y ⟼ l(\hat{y}, y) = 1_{\hat{y}≠ y} \end{cases}
Satistical risk:
ℛ(f) ≝ 𝔼_{(X, Y) \sim ℙ}( l(f(X, Y)))
Estimator:
\hat{f}_n = 𝒜(D_n) : \text{ random}

NB: $ℛ(\hat{f}_n)$ is also random

⟹ Solution we analyze: $\underbrace{𝔼}_{\text{randomness of} \hat{f}_n : D_n}(ℛ(\hat{f}_n))$

Classification:
ℛ(\hat{f}_n) ≝ 𝔼_{X, Y}(1_{\hat{f}_n ≠ Y}) = ℙ_{X, Y}(\hat{f}_n ≠ Y)

Goal: minimize $ℛ(\hat{f}_n)$ ⟶ $𝔼(ℛ(\hat{f}_n))$

ℛ^\ast ≝ \min_{f \text{ mesurable } ∈ 𝒴^𝒳} ℛ(f)
𝔼(ℛ(\hat{f}_n)) - ℛ^\ast = \Big(𝔼(ℛ(\hat{f}_n)) - \min_{f∈F} ℛ(f)\Big) + \Big(\min_{f∈F} ℛ(f) - ℛ^\ast \Big)
  • the first term is the estimation error of $F$

    • here: we will analyze this term, we will prove results independently of $P$ (just wrt $\hat{f}_n, F$)
  • the second one is the approximation error of $F$

PAC Bounds

PAC-Bound:

$\hat{f}_n$ is said to be $ε$-accurate with proba $1-δ$ (aka $(ε, δ)$-PAC) ℙ\Big(𝔼(ℛ(\hat{f}_n)) - \min_{f∈F} ℛ(f) > ε\Big) < δ

NB: this is better than a guarantee in expectation:

\underbrace{𝔼(ℛ(\hat{f}_n)) - \min_{f∈F} ℛ(f)}_{A(\hat{f}_n)} ≤ 𝔼(A(\hat{f}_n) \mid A(\hat{f}_n) > ε) \underbrace{ℙ(A(\hat{f}_n > ε))}_{≤ δ} + \underbrace{𝔼(A(\hat{f}_n) \mid A(\hat{f}_n) < ε)}_{≤ ε} ℙ(A(\hat{f}_n < ε)) \\ ≤ δ + ε

A simple example of PAC-bound for binary classification:

  • $𝒴 ≝ \lbrace 0, 1 \rbrace$
  • $l(\hat{y}, y) = 1_{\hat{y}≠y}$
  • $F$: finite set of functions from $𝒳 ⟶ 𝒴$

Assumption: $\min_{f∈ F} ℛ(f) = 0$

Th:

Let $\hat{f}n ∈ argmin{f∈F} \Big\lbrace \hat{ℛ}n(f) = \frac 1 n \sum\limits{ i=1 }^n 1_{f(X_i) ≠ Y_i} \Big\rbrace$ then for any $ε>0$:

ℙ(ℛ(\hat{f}_n) > ε) ≤ \underbrace{\vert F \vert \exp(-n ε)}_{δ}

Proof: first we note that $\hat{ℛ}(\hat{f}_n) = 0$

Then, we show that

ℙ(ℛ(\hat{f}_n) > ε) ≤ \vert F \vert (1 - ε)^n ≤ \vert F \vert \exp(-nε) \overset{n ⟶ ∞}{⟶} 0

NB: δ = \vert F \vert \exp(-nε) \overset{n ⟶ ∞}{⟶} 0

For a fixed $δ$:

n = \frac{\log \vert F \vert + \log(1/δ)}{ε}

Corollary:

𝔼(ℛ(\hat{f}_n)) ≤ \frac{1+\log \vert F \vert + \log n}{n}

Proof:

𝔼(ℛ(\hat{f}_n)) ≤ δ + ε ≤ \vert F \vert \exp(-n ε) + ε

Then, take $ε = \frac{\log \vert F \vert + \log n}{n}$

General PAC-Bounds

Goal: with no assumption, $\min_{f∈ F} ℛ(f) = 0$

We still focus on $\hat{f}n ∈ argmin{f∈F}(\hat{ℛ}_n(f))$

How good is $\hat{f}_n$?

Law of large numbers:
\hat{ℛ}_n(f) = \frac 1 n \sum\limits_{ i=1 }^n l(f(X_i), Y_i) \overset{n⟶∞}{⟶} ℛ(f)

Assume that we have the following inequality with proba $1 - δ$:

∀f ∈ F, \; \vert \hat{ℛ}_n(f) - ℛ(f) \vert ≤ ε \qquad ⊛

Roughly true because of law of large numbers.


With $⊛$:

ℛ(\hat{f}_n) \overset{⊛}{≤} \hat{ℛ}(\hat{f}_n) + ε \\ ≤ \hat{ℛ}(f^\ast) + ε \\ \overset{⊛}{≤} ℛ(f^\ast) + 2ε

⟹ Bounds like $⊛$ is enough to get PAC-bounds: why?

We want the rate of convergence of $\hat{ℛ}_n(f) ⟶ ℛ(f)$

NB: Central-limit theorem not enough ⟶ because it’s only asymptotical.

Solution: we use a concentration bound (the analogous of the central-limit theorem, but which holds for each $n$)

Chernoff bound

∀ε> 0, ℙ(\frac 1 n \sum\limits_{ i=1 }^n Z_i ≥ p+ε ) ≤ \exp(-2n ε^2)

and

∀ε> 0, ℙ(\frac 1 n \sum\limits_{ i=1 }^n Z_i ≤ p-ε ) ≤ \exp(-2n ε^2)

where $Z_i \overset{\text{iid}}{⇝} 𝔹(p)$

Other concentration inequalities, such as Azuma-Hoeffding (for any bounded $Z_i$)

NB: a concentration bound (the analogous of the central-limit theorem, but for each $n$

But… the want the result for all $f∈F$, not just one fixed.

Solution: use a union bound:

ℙ(∃f∈F, Property(f)) ≤ \sum\limits_{ f ∈ F } ℙ(Property(f))

With a finite class $F$ for binary classification

Th:

with probability $1 - δ$:

∀f∈F, \; ℛ(\hat{f}_n) ≤ ℛ(f) + \sqrt{\frac{2 \log \vert F \vert + \log(2/δ)}{n}}

Application to histograms

Histogram predicts by majority voting in each cell ($\hat{f}_n^m$)

  • $𝒴 = \lbrace 0, 1 \rbrace$
  • $𝒳 ≝ [0, 1]^d$
  • $(Q_j)_{1 ≤ j ≤ n}$ partition of $𝒳$

$ℱ_n$: class of classification rules predicting $0$ or $1$ in each cell

⟹ $\vert ℱ_n \vert = 2^m$

NB: the histogram is the emprical risk minimizer over $ℱ_n$

\hat{f}_n^m ∈ argmin_{f∈ℱ_n} \hat{ℛ}_n(f)
ℛ(\hat{f}_n) ≤ \min_{f∈F} ℛ(f) + \sqrt{\frac{2 m \log 2 + \log(2/δ)}{n}}

Compromise:

  • for the $\min$: $m$ needs to be big
  • to minimize the square root: $m$ needs to be small

Th:

ℛ(\hat{f}_n^m) ⟶ ℛ^\ast \text{ if } m ⟶ ∞ \text{ and } \frac{m}{n} ⟶ 0

Contably infinite set $F$

We need to assign a positive number $c(f)$ to each $f∈F$ s.t.

\sum\limits_{ f∈F }\exp(-c(f)) ≤ 1

$c(f)$ can be interpreted as:

  • a measure of complexity of $f$
  • obtained by an a priori: suppose we know in advance what may be the good functions π: \text{probability distribution over } F Then, we have \sum\limits_{ f∈F } π(f) = 1 Thus \sum\limits_{ f∈F } \exp(-\underbrace{\log{1/π(f)}}_{ ≝ \; c(f)}) ≤ 1

Information theory view of $c(f)$

encode each function $f∈F$ in a binary language: $c(f)$ is the length of the code word for $f$.

Then, \sum\limits_{ f∈F }\exp(-c(f)) ≤ 1 is Kraft inequality


Th:

with probability $1 - δ$:

∀f∈F, \; ℛ(f) ≤ \hat{ℛ}_n(f) + \sqrt{\frac{2 (c(f) + \log(1/δ))}{2n}}

Algorithm:

Penalized empirical risk minimizer:
\hat{f}_n ∈ argmin_{f∈ F} \left\lbrace \hat{ℛ}_n(f) + \sqrt{\frac{2 (c(f) + \log(1/δ))}{2n}}\right\rbrace

For this penalized ERM:

∀f∈F, \; ℛ(\hat{f}_n) ≤ ℛ(f) + \sqrt{\frac{2 (c(f) + \log(2/δ)))}{n}} \text{ with proba } 1-δ

To target a good bound, we need $f∈F$ with small $ℛ(f)$ and small $c(f)$ ⟶ no free lunch

Warning: the algorithm can be NP-hard to compute ⟹ we need convex loss functions to compute it efficiently

Algorithm:

F_m = 2^m \text{ histograms}

if we have $f ∈ \bigcup_{m≥1} F_m$: we can use

  • $m$ bits to encode in which $F_n$ it belongs
  • $m$ bits to encode which $f∈ F_n$

⟹ We can encode all histograms in $\bigcup_{m≥1} F_m$ in a language with $2m$ bits for any $f∈F_m$.

⟹ $c(f) = 2m$

⟹ we just lose a $\sqrt{2}$ factor when we compute for $\bigcup_{m≥1} F_m$ instead of $F_n$

Leave a comment