Lecture 6: PAC-Learning
Probably Approximately Correct
- PAC:
-
A theoritical framework to analyze machine learning algos. (Leslie Valiant 1984)
Introduction
\[D_n ≝ \lbrace (X_i, Y_i) \rbrace_{1 ≤ i ≤ n} ⇝ ℙ \text{ iid }\]- $X_i ∈ 𝒳$ (input space)
- $Y_i ∈ 𝒴$ (ouput)
- Learning algo:
- \[𝒜: \bigcup\limits_{n ≥ 1} \underbrace{(𝒳 × 𝒴)^n}_{\text{training data}} ⟶ \underbrace{𝒴^𝒳}_{\text{estimator of } Y \mid X}\]
- Loss function:
- \[l: \begin{cases} 𝒴 × 𝒴 ⟶ ℝ\\ \hat{y}, y ⟼ l(\hat{y}, y) = 1_{\hat{y}≠ y} \end{cases}\]
- Satistical risk:
- \[ℛ(f) ≝ 𝔼_{(X, Y) \sim ℙ}( l(f(X, Y)))\]
- Estimator:
- \[\hat{f}_n = 𝒜(D_n) : \text{ random}\]
NB: $ℛ(\hat{f}_n)$ is also random
⟹ Solution we analyze: $\underbrace{𝔼}_{\text{randomness of} \hat{f}_n : D_n}(ℛ(\hat{f}_n))$
- Classification:
- \[ℛ(\hat{f}_n) ≝ 𝔼_{X, Y}(1_{\hat{f}_n ≠ Y}) = ℙ_{X, Y}(\hat{f}_n ≠ Y)\]
Goal: minimize $ℛ(\hat{f}_n)$ ⟶ $𝔼(ℛ(\hat{f}_n))$
\[ℛ^\ast ≝ \min_{f \text{ mesurable } ∈ 𝒴^𝒳} ℛ(f)\] \[𝔼(ℛ(\hat{f}_n)) - ℛ^\ast = \Big(𝔼(ℛ(\hat{f}_n)) - \min_{f∈F} ℛ(f)\Big) + \Big(\min_{f∈F} ℛ(f) - ℛ^\ast \Big)\]-
the first term is the estimation error of $F$
- here: we will analyze this term, we will prove results independently of $P$ (just wrt $\hat{f}_n, F$)
-
the second one is the approximation error of $F$
PAC Bounds
- PAC-Bound:
-
$\hat{f}_n$ is said to be $ε$-accurate with proba $1-δ$ (aka $(ε, δ)$-PAC) \(ℙ\Big(𝔼(ℛ(\hat{f}_n)) - \min_{f∈F} ℛ(f) > ε\Big) < δ\)
NB: this is better than a guarantee in expectation:
\[\underbrace{𝔼(ℛ(\hat{f}_n)) - \min_{f∈F} ℛ(f)}_{A(\hat{f}_n)} ≤ 𝔼(A(\hat{f}_n) \mid A(\hat{f}_n) > ε) \underbrace{ℙ(A(\hat{f}_n > ε))}_{≤ δ} + \underbrace{𝔼(A(\hat{f}_n) \mid A(\hat{f}_n) < ε)}_{≤ ε} ℙ(A(\hat{f}_n < ε)) \\ ≤ δ + ε\]A simple example of PAC-bound for binary classification:
- $𝒴 ≝ \lbrace 0, 1 \rbrace$
- $l(\hat{y}, y) = 1_{\hat{y}≠y}$
- $F$: finite set of functions from $𝒳 ⟶ 𝒴$
Assumption: $\min_{f∈ F} ℛ(f) = 0$
Th:
Let $\hat{f}n ∈ argmin{f∈F} \Big\lbrace \hat{ℛ}n(f) = \frac 1 n \sum\limits{ i=1 }^n 1_{f(X_i) ≠ Y_i} \Big\rbrace$ then for any $ε>0$:
\[ℙ(ℛ(\hat{f}_n) > ε) ≤ \underbrace{\vert F \vert \exp(-n ε)}_{δ}\]
Proof: first we note that $\hat{ℛ}(\hat{f}_n) = 0$
Then, we show that
\[ℙ(ℛ(\hat{f}_n) > ε) ≤ \vert F \vert (1 - ε)^n ≤ \vert F \vert \exp(-nε) \overset{n ⟶ ∞}{⟶} 0\]NB: \(δ = \vert F \vert \exp(-nε) \overset{n ⟶ ∞}{⟶} 0\)
For a fixed $δ$:
\[n = \frac{\log \vert F \vert + \log(1/δ)}{ε}\]Corollary:
\[𝔼(ℛ(\hat{f}_n)) ≤ \frac{1+\log \vert F \vert + \log n}{n}\]
Proof:
\[𝔼(ℛ(\hat{f}_n)) ≤ δ + ε ≤ \vert F \vert \exp(-n ε) + ε\]Then, take $ε = \frac{\log \vert F \vert + \log n}{n}$
General PAC-Bounds
Goal: with no assumption, $\min_{f∈ F} ℛ(f) = 0$
We still focus on $\hat{f}n ∈ argmin{f∈F}(\hat{ℛ}_n(f))$
How good is $\hat{f}_n$?
- Law of large numbers:
- \[\hat{ℛ}_n(f) = \frac 1 n \sum\limits_{ i=1 }^n l(f(X_i), Y_i) \overset{n⟶∞}{⟶} ℛ(f)\]
Assume that we have the following inequality with proba $1 - δ$:
\[∀f ∈ F, \; \vert \hat{ℛ}_n(f) - ℛ(f) \vert ≤ ε \qquad ⊛\]Roughly true because of law of large numbers.
With $⊛$:
\[ℛ(\hat{f}_n) \overset{⊛}{≤} \hat{ℛ}(\hat{f}_n) + ε \\ ≤ \hat{ℛ}(f^\ast) + ε \\ \overset{⊛}{≤} ℛ(f^\ast) + 2ε\]⟹ Bounds like $⊛$ is enough to get PAC-bounds: why?
We want the rate of convergence of $\hat{ℛ}_n(f) ⟶ ℛ(f)$
NB: Central-limit theorem not enough ⟶ because it’s only asymptotical.
Solution: we use a concentration bound (the analogous of the central-limit theorem, but which holds for each $n$)
Chernoff bound
\[∀ε> 0, ℙ(\frac 1 n \sum\limits_{ i=1 }^n Z_i ≥ p+ε ) ≤ \exp(-2n ε^2)\]and
\[∀ε> 0, ℙ(\frac 1 n \sum\limits_{ i=1 }^n Z_i ≤ p-ε ) ≤ \exp(-2n ε^2)\]where $Z_i \overset{\text{iid}}{⇝} 𝔹(p)$
Other concentration inequalities, such as Azuma-Hoeffding (for any bounded $Z_i$)
NB: a concentration bound (the analogous of the central-limit theorem, but for each $n$
But… the want the result for all $f∈F$, not just one fixed.
⟶ Solution: use a union bound:
\[ℙ(∃f∈F, Property(f)) ≤ \sum\limits_{ f ∈ F } ℙ(Property(f))\]With a finite class $F$ for binary classification
Th:
with probability $1 - δ$:
\[∀f∈F, \; ℛ(\hat{f}_n) ≤ ℛ(f) + \sqrt{\frac{2 \log \vert F \vert + \log(2/δ)}{n}}\]
Application to histograms
Histogram predicts by majority voting in each cell ($\hat{f}_n^m$)
- $𝒴 = \lbrace 0, 1 \rbrace$
- $𝒳 ≝ [0, 1]^d$
- $(Q_j)_{1 ≤ j ≤ n}$ partition of $𝒳$
$ℱ_n$: class of classification rules predicting $0$ or $1$ in each cell
⟹ $\vert ℱ_n \vert = 2^m$
NB: the histogram is the emprical risk minimizer over $ℱ_n$
\[\hat{f}_n^m ∈ argmin_{f∈ℱ_n} \hat{ℛ}_n(f)\] \[ℛ(\hat{f}_n) ≤ \min_{f∈F} ℛ(f) + \sqrt{\frac{2 m \log 2 + \log(2/δ)}{n}}\]Compromise:
- for the $\min$: $m$ needs to be big
- to minimize the square root: $m$ needs to be small
Th:
\[ℛ(\hat{f}_n^m) ⟶ ℛ^\ast \text{ if } m ⟶ ∞ \text{ and } \frac{m}{n} ⟶ 0\]
Contably infinite set $F$
We need to assign a positive number $c(f)$ to each $f∈F$ s.t.
\[\sum\limits_{ f∈F }\exp(-c(f)) ≤ 1\]$c(f)$ can be interpreted as:
- a measure of complexity of $f$
- obtained by an a priori: suppose we know in advance what may be the good functions \(π: \text{probability distribution over } F\) Then, we have \(\sum\limits_{ f∈F } π(f) = 1\) Thus \(\sum\limits_{ f∈F } \exp(-\underbrace{\log{1/π(f)}}_{ ≝ \; c(f)}) ≤ 1\)
Information theory view of $c(f)$
encode each function $f∈F$ in a binary language: $c(f)$ is the length of the code word for $f$.
Then, \(\sum\limits_{ f∈F }\exp(-c(f)) ≤ 1\) is Kraft inequality
Th:
with probability $1 - δ$:
\[∀f∈F, \; ℛ(f) ≤ \hat{ℛ}_n(f) + \sqrt{\frac{2 (c(f) + \log(1/δ))}{2n}}\]
Algorithm:
- Penalized empirical risk minimizer:
- \[\hat{f}_n ∈ argmin_{f∈ F} \left\lbrace \hat{ℛ}_n(f) + \sqrt{\frac{2 (c(f) + \log(1/δ))}{2n}}\right\rbrace\]
For this penalized ERM:
\[∀f∈F, \; ℛ(\hat{f}_n) ≤ ℛ(f) + \sqrt{\frac{2 (c(f) + \log(2/δ)))}{n}} \text{ with proba } 1-δ\]To target a good bound, we need $f∈F$ with small $ℛ(f)$ and small $c(f)$ ⟶ no free lunch
Warning: the algorithm can be NP-hard to compute ⟹ we need convex loss functions to compute it efficiently
Algorithm:
\[F_m = 2^m \text{ histograms}\]if we have $f ∈ \bigcup_{m≥1} F_m$: we can use
- $m$ bits to encode in which $F_n$ it belongs
- $m$ bits to encode which $f∈ F_n$
⟹ We can encode all histograms in $\bigcup_{m≥1} F_m$ in a language with $2m$ bits for any $f∈F_m$.
⟹ $c(f) = 2m$
⟹ we just lose a $\sqrt{2}$ factor when we compute for $\bigcup_{m≥1} F_m$ instead of $F_n$
Leave a comment