Lecture 1: Supervised Machine Learning


ML examples:

  • given data samples of images, outputs a program that predicts in which category you fall
  • linear regression

    • $n+1$ order is better than $n$ order, up to some point.
    • But warning to over-fitting: if $n$ is greater than the number of data points, we get a polynomial which goes through each point, but we cannot “wisely” predict certain points (the derivative increases too much at some point)

⟶ Tradeoff between training error and model complexity

Warning: Non Free Lunch Theorem:

for any algorithm, there always exists a set of data on which your algorithm will fail as much as a random algorithm.

Curse of dimensionality:

to recognize chairs for instance, the number of values is exponential in the number of parameters (length of foot, shape of back, color, etc…)!

In high dimension, closeness of points means nothing when it comes to relevance.

Supervised Machine Learning

  • Input: \underbrace{x}_{\text{image, web page, etc...}} ∈ \underbrace{𝒳}_{ℝ^d \text{ usually, but not necessarily}}

    • $x$ is called the feature/covariate vector
  • Output: y ∈ \underbrace{𝒴}_{\lbrace 0, 1 \rbrace, \lbrace -1, 1 \rbrace, ℝ, ⟦1, k⟧ \text{(multiclass)} }

    • + all the rest: the ouput could of the same type as the input (ex: in Natural Language Processing, etc…)
  • Goal: find a function $f: 𝒳 ⟶ 𝒴$

Warning: In most situations, we don’t have:

∃!f; ∀x, y ∈ \text{data}; y = f(x)
  1. In most cases, there’s a part of randomness in determining $y$ (given an $x$, you may have several $y$ with a certain uncertainty)
  2. $x$ can be random itself (ex: we only observe a tiny portion of the input data)


  1. test distribution: $∃ \text{ a distribution on } 𝒳 × 𝒴$
  2. training data $(x_i, y_i)_{1≤ i ≤n}$ are Independent Identically Distributed (iid)

⟶ there has to be link between unseen/test data and training data: they must come from the same distribution

You learn on the training data, and you test on the test data.

But: in reality, we don’t actually have independence:

  1. a picture of me in front of Notre-Dame and a picture 2 minutes later are not independent
  2. real data are not actually identically distributed: the distribution may change from one day to another

Learning algorithm

\underbrace{(𝒳 × 𝒴)^n}_{\text{data}} \leadsto \widehat{f}: 𝒳 ⟶ 𝒴

NB: $\widehat{f}$ depends on the data: $\widehat{f}_{(x_1, y_1, \ldots, x_n, y_n)}$

Loss function $l(y, \widehat{y})$:
l: 𝒴 × 𝒴 ⟶ ℝ \text{ or } ℝ^+
  • $y$: observation
  • $\widehat{y}$: prediction

Example: If $y ∈ \lbrace 0,1 \rbrace$:

0-1 loss:

$l(y, \widehat{y}) = 1_{y ≠ \widehat{y}}$

But usually, instead of $1_{y ≠ \widehat{y}}$:

$0$ $α$
$β$ $0$

Ex: for spam filters, we really don’t want to lose an email, but it’s not a big deal if a spam mail slips through

Regression: $y = ℝ$

Square loss:
l(y, \widehat{y}) = (y - \widehat{y})^2

Risk of a function $f: 𝒳 ⟶ 𝒴$:
R(f) = 𝔼_{(x,y)}(l(y, f(x)))

It is not random, it’s a real number for instance.

Notations: lowercase for random variables

The test datat is fixed, the training data is random:

Risk of the predictor:
R(\widehat{f}_{(x_1, y_1, \ldots, x_n, y_n)}) = 𝔼_{(x,y)}(l(y, \widehat{f}_{(x_1, y_1, \ldots, x_n, y_n)}(x)))

This one is random.

Empirical risk/ training error:
\widehat{R}(f) = \frac 1 n \sum\limits_{ i =1}^n l(y_i, f(x_i))

Goal: try to minimize $R(f)$ over all measurable functions from $𝒳 ⟶ 𝒴$

Bayes predictor:

any minimizer of $R(f)$, for $f: 𝒳 ⟶ 𝒴$

Square loss:
𝔼(l(y, f(x))) = 𝔼_{(x,y)}((y - f(x))^2) = 𝔼_{x} \underbrace{𝔼_{y | x}((y - f(x))^2)}_{\text{ function of } x}

NB: $x$ and $y$ are not independent within a single couple, it’s pairs that are independent between one another.

(Baby) Lemma: $Z$ a rand. var. in $ℝ$:

𝔼 (Z) = {\rm argmin}_{a ∈ ℝ} 𝔼 \vert Z-a \vert^2

Proof: just expand $𝔼 \vert Z-a \vert^2 ≥ 0$: it’s minimized by $a = 𝔼 (Z)$

Same result in 2D.

Given $x$, the minimizer of $𝔼_{y x} \vert y - a \vert$ is $a ≝ 𝔼(y \mid x)$

Rigorously: f^\ast (x) = 𝔼(Y \mid X = x)

Why is it not so easy?:

data $(x_i, y_i)$ will have different values of $x$ ⟶ for a given $x$, it is unlikely that we have that $x$ in our data set.


𝔼 l(y, f(x)) = 𝔼 1_{y ≠ f(x)} \\ = 𝔼_x (𝔼 1_{y ≠ f(x)} \mid x)

Small-Lemma: If $Z$ is a rv in $\lbrace 0, 1 \rbrace$:

{\rm argmin}_{a ∈ \lbrace 0, 1 \rbrace} \underbrace{𝔼_Z 1_{Z ≠ a}}_{ = \underbrace{P(Z=1)}_{η} × (1 - a) + P(Z=0) × a \\ = η + a(1 -2η)} \\ = \begin{cases} 1 \text{ if } η > 1/2 \\ 0 \text{ if } η < 1/2 \\ 0 \text{ or } 1 \text{ else} \end{cases}
f^\ast(x) = \begin{cases} 1 \text{ if } P(Y=1 \mid x) > 1/2 \\ 0 \text{ else } \end{cases}
Bayes Risk / Bayes predictor:
R^\ast = \inf_f R(f) = R(f^\ast) ≠ 0

You cannot go below $R(f^\ast)$: you know it exists, but we can’t know its value.

Excess risk:
R(f) - R^\ast ≥ 0

You want it to tend to zero with $f = \widehat{f}_{x_1, y_1, \ldots, x_n, y_n}$ when $n$ tends to $∞$.

Local averaging (kNN : $k$ nearest neighbors / kernel regression)

Empirical risk / training error:

\widehat{R}(f) = \frac 1 n \sum\limits_{ i =1}^n l(y_i, f(x_i))
Empirical Risk Minimization (ERM):
\widehat{f} ∈ {\rm argmin}_{f ∈ F} \widehat{R}(f)

where $F$ is a subset of all functions from $𝒳 ⟶ 𝒴$

Approximation error (independent of $n$)

Usual source of underfitting:

R(\widehat{f}) - R^\ast = \underbrace{R(\widehat{f}) - \underbrace{R(g)}_{∈ F}}_{\text{ estimation error } ≥ 0} + \underbrace{R(g) - R^\ast}_{\text{Approximation error: deterministic and } ≥ 0}

then, take $g ≝ {\rm argmin}_{g ∈ F} R(g)$

But as you restrict yourself to $F$, you’re bound to make an error.

  • Approximation error decreases with the size of $F$
  • Estimation error increases with the size of $F$
  • The test error is a tradeoff: it’s the sum of the two

NB: estimation error gets smaller when $n$ gets bigger. But when $n$ gets bigger, we usually take a bigger class of functions $F$.

Two questions to address:

  1. Which $F ⊆ 𝒳 ⟶ 𝒴$ ? (exs: linear functions, neural networks, kernel methods, etc…)

  2. Algorithm to compute the minimal risk (optimisation)

  3. Analysis

  • $R(f) = 𝔼 l(y, f(x))$: loss function given by the user
  • the distribution of $y$ and $x$ is not our choice
  • but we choose $f$, so that we have a good approximation error, and can efficiently run algorithms

Leave a Comment