Lecture 10: Summary
Summary of the class
Supervized learning
 inputs: $x ∈ 𝒳$ (pictures, text, etc…)
 outputs: $y ∈ 𝒴 ∈ \lbrace ℝ, \lbrace 1, 1 \rbrace, \lbrace 1, ⋯, K \rbrace \rbrace$
Goal: find a predictor, which is a function $f: 𝒳 ⟶ 𝒴$ that predicts the labels of new data points.
 Data: $(x_1, y_1), ⋯, (x_n, y_n) ∈ 𝒳 × 𝒴$
 Loss function:
 \[l(y, \hat{y}) \text{ where } \begin{cases} y =\text{ time output} \\ \hat{y}= \text{ predictor} \end{cases}\]
Ex:
 01: $1_{y ≠ \hat{y}}$
 Square: $(y\hat{y})^2$
Train/test data ⟶ there is an underlying relationship between train and test data
Ideal situation: $(x_i, y_i)$ are iid from $p(x, y)$ (train) and test data is from the same distribution.
NB: but this is never true in practice
Ex:
 images from customers are not independent, for Facebook
 for advertisement: train data varies across time
When test data ≠ train data ⟹ domain adaptation (ex: recommender systems: Amazon: recommendation on books and on DVDs are different (train data are not of the same type))
Goal: find $f$ s.t. $𝔼_{p(x, y)}(l(y, f(x)))$ is minimal
Optimal solution
Ex: $l = \text{ square}$
\[f^\ast(x) = 𝔼(y \mid x)\] \[\begin{align*} 𝔼_{x, y}(\vert y  f(x) \vert^2) & \text{ minimal} \\ ⟺ & \quad 𝔼_x 𝔼_{y \mid x}(y  f(x))^2 \text{ minimal} \\ \end{align*}\]But just because I have a formula doesn’t mean I can compute it in practice.
01 loss:
\[f^\ast(x) =\begin{cases} 1 &&\text{ if } p(y=1 \mid x)>p(y=1 \mid x) \\ 1 &&\text{ otherwise} \end{cases}\]Population:
Population $p(x, y)$
 VS empirical observations VS data
 VS data (iid) ($x_i, y_i$)
In practice, for a given $x$, we hardly ever have it already in our train data points.
Two types of estimators
$\hat{f}: 𝒳 ⟶ 𝒴$ random (because data is random)

Local averaging: compute $𝔼(y \mid x)$

Optimization: replace the expectation by the empirical average and optimize it
Local averaging
Compute $𝔼(y \mid x)$
Look around your $x$ (which is not already among train data points), and give it the majority label (make a vote majority in case of classification, average in case of regression) of
 its $k$nearest neighbors
 all points within distance $ε$
Control of complexity
To avoid overfitting (THE bane of machine learning)
Ex: $k$NN

it you take $k=1$, you tend to overfit much more (you highly depend on your nearest neighbor)

the bigger $k$, the better the expectation estimation
 but if $k$ is too big ⟹ underfitting
The bigger $k$:
 the lower the train error
 the lower the test error, until underfitting is reached, from which point onwards the test increases again
Overfitting is worse in a way, because you’re less flexible (you’re kind of lying).
Cross validation
Never ever look at the test set (otherwise you’re going to start to overfit).
Validation set included in the training set ⟹ we work on it when optimizing our predictors.
The test is only used at the very end, to evaluate once and for all our classifier (until then, we work on the validation set).
 Errors bars:

try several splits of trainvalidation sets, and take:
 the average error overall
 the standard deviation as well, to have a sense of the stability
 NB: pay attention to significant digits! (like in physics)
Model
Goal: find $f: 𝒳 ⟶ 𝒴$

$k$NN: because of local averaging ⟹ nonlinear
 nonlinear ⟹ tends to overfit a bit
Empirical risk minimization
Goal: minimize, for $f ∈ ℱ$: \(\frac 1 n \sum\limits_{ i=1 }^n l(y_i, f(x_i))\)
$ℱ$ is a set of
 measurable functions

restricted class of functions:
 linear
 polynomials
 smooth functions (kernel methods)
For complexity:
$x ∈ ℝ^d$
\[f(x) = w^T x + b\]Control of complexity = degree of polynomial (if it is too big ⟹ overfitting)
Regularization
Same model class, but there’s a penalty by a norm $Ω(f)$ (term $λ Ω(f)$)
Ex:
 $f$ is $w^T x + b$
 $Ω(f) = \Vert w \Vert^2$
NB: if $n ≤ d$: there exists a perfect fit ⟹ don’t generalize well
indeed:
 $y ∈ ℝ^n$
 $X ∈ ℝ^{n × d}$
 $y = X w, \quad w∈ℝ^d$ ⟹ one can find such a $w$ if $n ≤ d$
Ex: Genetics: for genome $d ≃ 10 000$, but number of patients usually $≃ 1000$, so we can make a perfect fit
Simple models:
$n » d$
Not sufficiently many observations ⟶ regularization
Complex models (where $f$ is very complicated)
ex: $k$NN ⟹ $n » 2^d$
⟶ curse of dimensionality:
\[ε >> \frac 1 {n^{1/d}}\]So $n$ has to be much larger than $d$
Kernel methods
\[f(x) = ⟨ϕ(x), w⟩ = w^T ϕ(x)\] Feature map:
 \[ϕ: 𝒳 ⟶ ℝ^d\]
If you penalize by $\Vert w \Vert^2_2$
\[\hat{f}(x) = \sum\limits_{ i=1 }^n α_i ϕ(x_i)\]Representer
\[f(x) = \sum\limits_{ i } α_i \underbrace{ϕ(x_i)^T ϕ(x) }_{h(x_i, x) \text{ = kernel}}\]In some cases $d$ is very large, but we can still compute the kernels
Ex: Order $r$ polynomials in dimension $d$ ⟹ featue space of size $≃ d^r$
Kernel trick
$d$ is infinite ($ℝ^d$ becomes a Hilbert space)
If you can compute the kernels, we don’t even need to know $ϕ(x)$
Ex: Gaussian kernel:
\[h(x, y) = \exp(α \Vert xy \Vert^2) = ⟨ϕ(x), ϕ(y)⟩\]Minimize empirical risk:
\[\min_{w ∈ ℝ^d}\frac 1 n \sum\limits_{ i } l(y_i, w^T ϕ(x_i)) + \underbrace{\frac λ 2}_{\text{control of complexity}} \Vert w \Vert^2_2\]Other controls of complexity:
 $k$ in $k$NN
 degree of polynomials
Convex optimization
 make things convex
 how to get a solution
Least squares
\[\frac 1 {2n} \sum\limits_{ i } \vert y_i  w^T ϕ(x_i) \vert^2 + \frac λ 2 \Vert w \Vert^2_2\]Binary classification
 $𝒴 = \lbrace 1, 1 \rbrace$
 $w^T ϕ(x) = f(x) ∈ ℝ ⟹ \text{prediction} = sign(f(x))$
Plot error if $y f(x) < 0$
\[\frac 1 n \sum\limits_{ i } 1_{y_i w^T ϕ(x_i) <0} + \frac λ 2 \Vert w \Vert^2_2\]Problem: not convex, not even continuous
⟹ we convexify it (ex: SVM, logistic loss, etc…)
So that now we have:
\[\frac 1 n \sum\limits_{ i } \log(1 + \exp( y_i w^T ϕ(x_i))) + \frac λ 2 \Vert w \Vert^2_2 ≝ H(w)\]Then
Gradient descent
Iterative algorithm:
w_t = w_{t1}  γ H'(w_{t1})
How to get a solution?
⟶ Formula VS iterative algorithms
Take a look at Lagrange classifier too
Probabilistic interpretation
Maximum likelihood
Logitic regression
\[p(y = 1\mid x) = σ(w^Tx + b)\]where \(σ(x) ≝ \frac 1 {1 + \exp(w^Txb)}\)
Beyond the class (toward the MVA master at ENS Cachan)
Machine learning
\[\underbrace{\text{Theory}}_{\text{statistics}} \overbrace{⟶}^{\text{structured output:} 𝒴 \text{ is complex}} \underbrace{\text{Algorithm}}_{\text{ largescale}} ⟶ \text{Applications}\]
Largescale: when $n, d ≃ 10^9$
 ⟹ stochastic gradient descent: \(w_t = w_{t1}  γ \left(\frac 1 n \sum\limits_{ i=1 }^n l'(y_i, w_{t1}^T x_i) \right)\) replaced by picking one $i$ at random and compute $l’(y_i, w_{t1}^T x_i)$
Neural networks
\[f(x_k) = \sum\limits_{ i=1 }^d w_i x_k[i] = w^T x\]⟹ Single neuron
digraph {
rankdir=LR;
b[label="",shape=none];
"x_1" > "∑ w_i x_i"[label="w_1"];
"x_2" > "∑ w_i x_i"[label="w_2"];
"⋮" > "∑ w_i x_i";
"x_n" > "∑ w_i x_i"[label="w_n"];
"∑ w_i x_i" > b;
}
For 3 layers
\[f(x) = σ(w_3^T σ(w_2^T σ(w_1^T x)))\]Related domains
 Computer vision
 Natural language
 Speech
 Bioinformatics
also Reinfocement learning
Unsupervized learning
 PCA
 $k$means
Leave a comment