Lecture 10: Summary

Summary of the class

Supervized learning

  • inputs: $x ∈ 𝒳$ (pictures, text, etc…)
  • outputs: $y ∈ 𝒴 ∈ \lbrace ℝ, \lbrace -1, 1 \rbrace, \lbrace 1, ⋯, K \rbrace \rbrace$

Goal: find a predictor, which is a function $f: 𝒳 ⟶ 𝒴$ that predicts the labels of new data points.

  • Data: $(x_1, y_1), ⋯, (x_n, y_n) ∈ 𝒳 × 𝒴$
Loss function:
l(y, \hat{y}) \text{ where } \begin{cases} y =\text{ time output} \\ \hat{y}= \text{ predictor} \end{cases}

Ex:

  • 0-1: $1_{y ≠ \hat{y}}$
  • Square: $(y-\hat{y})^2$

Train/test data ⟶ there is an underlying relationship between train and test data

Ideal situation: $(x_i, y_i)$ are iid from $p(x, y)$ (train) and test data is from the same distribution.

NB: but this is never true in practice

Ex:

  1. images from customers are not independent, for Facebook
  2. for advertisement: train data varies across time

When test data ≠ train data ⟹ domain adaptation (ex: recommender systems: Amazon: recommendation on books and on DVDs are different (train data are not of the same type))

Goal: find $f$ s.t. $𝔼_{p(x, y)}(l(y, f(x)))$ is minimal

Optimal solution

Ex: $l = \text{ square}$

f^\ast(x) = 𝔼(y \mid x)
\begin{align*} 𝔼_{x, y}(\vert y - f(x) \vert^2) & \text{ minimal} \\ ⟺ & \quad 𝔼_x 𝔼_{y \mid x}(y - f(x))^2 \text{ minimal} \\ \end{align*}

But just because I have a formula doesn’t mean I can compute it in practice.

0-1 loss:

f^\ast(x) =\begin{cases} 1 &&\text{ if } p(y=1 \mid x)>p(y=-1 \mid x) \\ -1 &&\text{ otherwise} \end{cases}

Population:

Population $p(x, y)$

  • VS empirical observations VS data
  • VS data (iid) ($x_i, y_i$)

In practice, for a given $x$, we hardly ever have it already in our train data points.

Two types of estimators

$\hat{f}: 𝒳 ⟶ 𝒴$ random (because data is random)

  1. Local averaging: compute $𝔼(y \mid x)$

  2. Optimization: replace the expectation by the empirical average and optimize it

Local averaging

Compute $𝔼(y \mid x)$

Look around your $x$ (which is not already among train data points), and give it the majority label (make a vote majority in case of classification, average in case of regression) of

  • its $k$-nearest neighbors
  • all points within distance $ε$

Control of complexity

To avoid overfitting (THE bane of machine learning)

Ex: $k$-NN

  • it you take $k=1$, you tend to overfit much more (you highly depend on your nearest neighbor)

  • the bigger $k$, the better the expectation estimation

    • but if $k$ is too big ⟹ underfitting

The bigger $k$:

  1. the lower the train error
  2. the lower the test error, until underfitting is reached, from which point onwards the test increases again

Overfitting is worse in a way, because you’re less flexible (you’re kind of lying).

Cross validation

Never ever look at the test set (otherwise you’re going to start to overfit).

Validation set included in the training set ⟹ we work on it when optimizing our predictors.

The test is only used at the very end, to evaluate once and for all our classifier (until then, we work on the validation set).

Errors bars:

try several splits of train-validation sets, and take:

  • the average error overall
  • the standard deviation as well, to have a sense of the stability
  • NB: pay attention to significant digits! (like in physics)

Model

Goal: find $f: 𝒳 ⟶ 𝒴$

  • $k$-NN: because of local averaging ⟹ non-linear

    • non-linear ⟹ tends to overfit a bit

Empirical risk minimization

Goal: minimize, for $f ∈ ℱ$: \frac 1 n \sum\limits_{ i=1 }^n l(y_i, f(x_i))

$ℱ$ is a set of

  • measurable functions
  • restricted class of functions:

    • linear
    • polynomials
    • smooth functions (kernel methods)

For complexity:

$x ∈ ℝ^d$

f(x) = w^T x + b

Control of complexity = degree of polynomial (if it is too big ⟹ overfitting)

Regularization

Same model class, but there’s a penalty by a norm $Ω(f)$ (term $λ Ω(f)$)

Ex:

  • $f$ is $w^T x + b$
  • $Ω(f) = \Vert w \Vert^2$

NB: if $n ≤ d$: there exists a perfect fit ⟹ don’t generalize well

indeed:

  • $y ∈ ℝ^n$
  • $X ∈ ℝ^{n × d}$
  • $y = X w, \quad w∈ℝ^d$ ⟹ one can find such a $w$ if $n ≤ d$

Ex: Genetics: for genome $d ≃ 10 000$, but number of patients usually $≃ 1000$, so we can make a perfect fit

Simple models:

$n » d$

Not sufficiently many observations ⟶ regularization

Complex models (where $f$ is very complicated)

ex: $k$-NN ⟹ $n » 2^d$

curse of dimensionality:

ε >> \frac 1 {n^{1/d}}

So $n$ has to be much larger than $d$

Kernel methods

f(x) = ⟨ϕ(x), w⟩ = w^T ϕ(x)
Feature map:
ϕ: 𝒳 ⟶ ℝ^d

If you penalize by $\Vert w \Vert^2_2$

\hat{f}(x) = \sum\limits_{ i=1 }^n α_i ϕ(x_i)

Representer

f(x) = \sum\limits_{ i } α_i \underbrace{ϕ(x_i)^T ϕ(x) }_{h(x_i, x) \text{ = kernel}}

In some cases $d$ is very large, but we can still compute the kernels

Ex: Order $r$ polynomials in dimension $d$ ⟹ featue space of size $≃ d^r$

Kernel trick

$d$ is infinite ($ℝ^d$ becomes a Hilbert space)

If you can compute the kernels, we don’t even need to know $ϕ(x)$

Ex: Gaussian kernel:

h(x, y) = \exp(-α \Vert x-y \Vert^2) = ⟨ϕ(x), ϕ(y)⟩

Minimize empirical risk:

\min_{w ∈ ℝ^d}\frac 1 n \sum\limits_{ i } l(y_i, w^T ϕ(x_i)) + \underbrace{\frac λ 2}_{\text{control of complexity}} \Vert w \Vert^2_2

Other controls of complexity:

  • $k$ in $k$-NN
  • degree of polynomials

Convex optimization

  1. make things convex
  2. how to get a solution

Least squares

\frac 1 {2n} \sum\limits_{ i } \vert y_i - w^T ϕ(x_i) \vert^2 + \frac λ 2 \Vert w \Vert^2_2

Binary classification

  • $𝒴 = \lbrace -1, 1 \rbrace$
  • $w^T ϕ(x) = f(x) ∈ ℝ ⟹ \text{prediction} = sign(f(x))$

Plot error if $y f(x) < 0$

\frac 1 n \sum\limits_{ i } 1_{y_i w^T ϕ(x_i) <0} + \frac λ 2 \Vert w \Vert^2_2

Problem: not convex, not even continuous

⟹ we convexify it (ex: SVM, logistic loss, etc…)

So that now we have:

\frac 1 n \sum\limits_{ i } \log(1 + \exp(- y_i w^T ϕ(x_i))) + \frac λ 2 \Vert w \Vert^2_2 ≝ H(w)

Then

Gradient descent

Iterative algorithm:

w_t = w_{t-1} - γ H'(w_{t-1})

How to get a solution?

⟶ Formula VS iterative algorithms

Take a look at Lagrange classifier too

Probabilistic interpretation

Maximum likelihood

Logitic regression

p(y = 1\mid x) = σ(w^Tx + b)

where σ(x) ≝ \frac 1 {1 + \exp(-w^Tx-b)}

Beyond the class (toward the MVA master at ENS Cachan)

Machine learning

\underbrace{\text{Theory}}_{\text{statistics}} \overbrace{⟶}^{\text{structured output:} 𝒴 \text{ is complex}} \underbrace{\text{Algorithm}}_{\text{ large-scale}} ⟶ \text{Applications}
  • Large-scale: when $n, d ≃ 10^9$

    • ⟹ stochastic gradient descent: w_t = w_{t-1} - γ \left(\frac 1 n \sum\limits_{ i=1 }^n l'(y_i, w_{t-1}^T x_i) \right) replaced by picking one $i$ at random and compute $l’(y_i, w_{t-1}^T x_i)$

Neural networks

f(x_k) = \sum\limits_{ i=1 }^d w_i x_k[i] = w^T x

⟹ Single neuron

  digraph {
    rankdir=LR;
    b[label="",shape=none];
    "x_1" -> "∑ w_i x_i"[label="w_1"];
    "x_2" -> "∑ w_i x_i"[label="w_2"];
    "⋮" -> "∑ w_i x_i";
    "x_n" -> "∑ w_i x_i"[label="w_n"];
    "∑ w_i x_i" -> b;
  }

For 3 layers

f(x) = σ(w_3^T σ(w_2^T σ(w_1^T x)))
  • Computer vision
  • Natural language
  • Speech
  • Bio-informatics

also Reinfocement learning

Unsupervized learning

  • PCA
  • $k$-means

Leave a comment