Lecture 2: Regression

Linear Regression and Logistic regression: particular cases of empirical risk regression

I. Linear regression

Ex:

  • EDF: understand what’s the relation between consumption of electricity and the weather (the colder it is, the more electricity is consumed)
\[Y ≝ (Y_1, \ldots, Y_n)^T ∈ ℝ^n\] \[X ≝ (X_1, \ldots, X_n)^T ∈ ℝ^{n × d}\]

Goal: find

\[{\rm argmin}_{f ∈ F} \widehat{R}_n(f) = \frac 1 n \sum\limits_{ i =1 }^n (x_i - f(x_i))^2\]

Beware over-fitting!

⟹ $f$ affine function: $f: x ⟼ ax +b$

Linear model: $Y = X^T β + ε$ where the noise $ε ≝ (ε_1, \ldots, ε_n) ∈ ℝ^n$, $(ε_i)$ i.i.d

  • $𝔼((ε_j)) = 0$
  • $Var(ε) ≝ (Cov(ε_i, ε_j))_{i,j} = σ^2 I_n$

Assumption: $X ∈ ℝ^{n×p}$ is injective: $rank(X) = p$

($Im(X)$ is a vector subspace of $ℝ^n$ of dimension $p ≤ n$)

Ordinary least square estimator:
\[\widehat{β}∈ {\rm argmin}_β \widehat{R}_n(β) = {\rm argmin}_{β∈ℝ^p} \underbrace{\frac 1 n \sum\limits_{i=1 }^n (y_i - x_i^T β)^2}_{ = \frac 1 n \Vert Y - X β\Vert_2^2}\]

Proposition:

\[\widehat{β} = (X^T X)^{-1} X^T Y\]

Proof: $\widehat{β}$ is the minimum of $\widehat{R}_n$ ⟹ it should cancel its gradient:

\[\nabla \widehat{R}_n(β) = 0\] \[\begin{align*} \nabla \widehat{R}_n(β) & = 2 (X^T X)β - 2 X^T Y \\ \underbrace{(X^T X)}_{\text{if it is invertible}} \widehat{β} & = X^T Y \\ \widehat{β} & =(X^T X)^{-1}(X^T Y ) \end{align*}\]

So $\widehat{β}$ is an extremum.

But as $\frac n 2 \nabla^2 \widehat{R}_n(β) = X^T X$ is positive-definite matrix, $\widehat{β}$ is then a minimum (and it’s unique, since the matrix is strictly positive).


We found the best approximation of $Y ∈ ℝ^n$ as $Xβ$.

$X \widehat{β} = \underbrace{P_{Im(X)}(Y)}_{\text{orthogonal projection of } Y \text{ on } Im(X)}$

Moreover:

\[P_{Im(X)}(Y) = X \widehat{β} \\ = \underbrace{X(X^T X)^{-1} X^T}_{\text{orthogonal projection matrix on } Im(X)} Y\]

Is $\widehat{β}$ good?:

To simplify calculations, we assume that $X$ is deterministic.

\[𝔼(\widehat{β}) = 𝔼((X^T X)^{-1} X^T Y) = (X^T X)^{-1} X^T 𝔼(Y) \\ = (X^T X)^{-1} X^T(X β + \underbrace{𝔼(ε)}_{ =0}) = β\]

So $\widehat{β}$ is an unbiased estimation of $β$.

\[Var(\widehat{β}) = Var((X^T X)^{-1} X^T Y) = (X^T X)^{-1} X^T \underbrace{Var(Y)}_{ ≝ 𝔼((Y-𝔼(Y)) (Y-𝔼(Y))^T) = Var(ε) = σ^2 Id} X (X^T X)^{-1} \\ = σ^2 (X^T X)^{-1}\]

What is the prediction risk of $\widehat{β}$ ?

  • $Y_{n+1} ∈ ℝ^n$
  • $X_{n+1} ∈ ℝ^p$
\[\begin{align*} R(\widehat{β}) & = 𝔼((Y_{n+1} - X_{n+1}^T \widehat{β})^2) \\ & = 𝔼((Y_{n+1} - X_{n+1}^T β + X_{n+1}^T β - X_{n+1}^T \widehat{β})^2) \\ & = 𝔼((Y_{n+1} - X_{n+1}^T β)^2 + (X_{n+1}^T (β - \widehat{β}))^2 + 2(Y_{n+1} - X_{n+1}^T β) (X_{n+1}^T (β - \widehat{β}) ) \end{align*}\]

But

\[𝔼(\underbrace{(Y_{n+1} - X_{n+1}^T β)}_{ = ε_{n+1}} (X_{n+1}^T (β - \widehat{β})) \overset{\text{independence}}{=} \underbrace{𝔼(ε_{n+1})}_{= 0} 𝔼(X_{n+1}^T (β - \widehat{β})) = 0\]

Independence:

  • $ε_1, \ldots, ε_n$ and $ε_{n+1}$ are independent

  • $Xβ$ and $X_{n+1}$ are deterministic

  • $\widehat{β} ≝ (X^T X)^{-1} X^T Y$ is independent of $ε_{n+1}$ (only depends on $ε_1, \ldots, ε_n$)

So

\[\begin{align*} R(\widehat{β}) & = 𝔼(ε_{n+1}^2) + 𝔼((X_{n+1}^T (β - \widehat{β}))^2) \\ &= Var(ε_{n+1}) + X_{n+1}^T \underbrace{𝔼((β - \widehat{β})(β - \widehat{β})^T)}_{= 𝔼((𝔼(\widehat{β}) - \widehat{β})(𝔼(\widehat{β}) - \widehat{β})^T) = Var(\widehat{β})} X_{n+1} \\ &= σ^2(1 + X_{n+1}^T(X^T X)^{-1} X_{n+1}) \end{align*}\]

As

  • $𝔼(\widehat{β}) = β$
  • $Var(\widehat{β}) = σ^2 (X^T X)^{-1}$

Th (Gauss-Markov): $\widehat{β}$ is optimal in the sense that its variance is minimal among all linear unbiased estimators.

Can we estimate $σ^2$ ?

\[σ^2 = Var(ε) = Var(Y_{n+1}) = 𝔼((Y_{n+1} - 𝔼(Y_{n+1}))^2) \\ = 𝔼((Y_{n+1} - X_{n+1}^T β)^2)\] \[\widehat{σ}^2 = \frac 1 n \sum\limits_{ i=1 }^n (y_i - x_i^T \widehat{β})^2 = \frac{\Vert Y - X\widehat{β} \Vert_2^2}{n}\] \[n 𝔼(\widehat{σ}^2) = 𝔼(\Vert Y - X \widehat{β}\Vert_2^2) = 𝔼(Tr(\Vert Y - X \widehat{β}\Vert_2^2)) \\ = 𝔼(Tr((Y - X \widehat{β})^T (Y - X \widehat{β}))) = 𝔼(Tr((Y - X \widehat{β}) (Y - X \widehat{β})^T))\]

But $X \widehat{β} = P_{Im(X)}(Y)$, so:

\[n 𝔼(\widehat{σ}^2) = 𝔼(Tr((Y - P_{Im(X)}(Y)) (Y - P_{Im(X)}(Y))^T)) \\ = 𝔼(Tr(\underbrace{P_{Im(X)^\bot}Y}_{= P_{Im(X)^\bot}(Y - Xβ)} (P_{Im(X)^\bot}(Y - Xβ))^T)) \\ = 𝔼(Tr(P_{Im(X)^\bot}(Y - Xβ) (Y - Xβ)^T P_{Im(X)^\bot}^T)) \\ = Tr(𝔼(P_{Im(X)^\bot}(Y - Xβ) (Y - Xβ)^T P_{Im(X)^\bot}^T)) \\ = Tr(P_{Im(X)^\bot} \underbrace{𝔼((Y - Xβ) (Y - Xβ)^T)}_{ = Var(Y) = Var(ε) = σ^2 Id} \underbrace{P_{Im(X)^\bot}^T}_{ = P_{Im(X)^\bot}}) \\ = σ^2 Tr(P_{Im(X)^\bot}) \\ = σ^2 (n-p)\]

So this estimator is biased:

\[𝔼(\widehat{σ}^2) = \frac{n-p}{n} σ^2\]

So we define the following unbiased estimator :

\[\widehat{σ}_{new}^2 = \frac{\Vert Y - X \widehat{β} \Vert_2^2}{n - p}\]

The case of Gaussian noise

$ε_i \leadsto 𝒩(0, σ^2)$

NB: this assumption is legitimate, because of the central limit theorem: noises are i.i.d

The maximum likelihood estimator of $β$ and $σ^2$ are $\widehat{β} = (X^T X)^{-1} X^T Y$ (the Ordinary Least Square) and $\frac{\Vert Y - X \widehat{β} \Vert_2^2}{n - p}$ for $σ^2$ (it is biased: divied by $n$ and not $n-p$)

What if the relationship is not linear?

$X = [1, Temperature, Temperature^2, Temperature^3]$

We fit any polynomial by adding transformation of the coordinates into $X$:

\[Y = a 1 + b X + c X^2 + \ldots\]

⟶ Spline regression

What if $(X^T X)^{-1}$ is not invertible ?

\[\widehat{β} ∈ {\rm argmin}_β \Vert Y - X β\Vert_2^2\]

But in practice, we rather use:

Regularisation:
\[\widehat{β} ∈ {\rm argmin}_β \underbrace{\Vert Y - X β\Vert_2^2 + λ\Vert β\Vert_2^2}_{\text{Ridge}}\]

For each $λ$, there exists a $δ$ s.t.

\[{\rm argmin}_{\Vert β \Vert_2^2 ≤ δ} \Vert Y - X β\Vert_2^2 = {\rm argmin}_{\Vert β \Vert_2^2 ≤ δ} \Vert Y - X β\Vert_2^2 + λ\Vert β\Vert_2^2\]
Lasso:
\[\widehat{β} ∈ {\rm argmin}_β \Vert Y - X β\Vert_2^2 + λ\Vert β\Vert_1\]
\[\widehat{β}_{Ridge} = (X^T X + λ I_p)^{-1} X^T Y\]

The choice of $λ$ is important

QR decomposition

$\widehat{β} = (X^T X)^{-1} X^T Y$

⟶ QR decomposition: $X = Q R$ ($Q$ orthogonal, $R$ upper triangular)

so that:

\[R \widehat{β} = Q^T Y\]

Gradient Descent

  • $\widehat{β}_0 = 0$

  • \[β_{i+1} = β_i - η \nabla \widehat{R}_n (β_i)\]

If you choose $η$ wisely ⟹ converges

Stochastic gradient descent: if you don’t want to compute the gradient entirely.

Logistic regression

  • Binary Classification: where $Y ∈ \lbrace 0, 1 \rbrace^n$

⟶ square loss for reals ⟹ not good here

\[\widehat{R}_n(f) = \frac 1 n \sum\limits_{ i =1 }^n 1_{2 Y_i - 1 ≠ sign(f(X_i))}\]

Issue: it’s not convex, so minimization problem way too hard! (NP-hard)

Heaviside function (not convex) replaced by a logistic loss function (which is convex)

Logistic loss:
\[l(f(X), Y) = Y \log(1 + {\rm e}^{-f(X)}) + (1-Y) \log(1 +{\rm e}^{f(X)})\]
\[\widehat{β}_{logistic} = {\rm argmin}_β \Big\lbrace \frac 1 n \sum\limits_{ i = 1 }^n l(X_i^T β, Y_i) \Big\rbrace\]

Then we predict $Y=1$ if $X_i^T β ≥ 0$ and $0$ if $X_i^T β < 0$

Same trick than before if the data can’t be split by a linear function ⟶ transformation of the coordinates into $X$

Nice probabilistic interpretation of logistic regression:

\[\frac{ℙ(Y = 1 \mid X)}{ℙ(Y = 0 \mid X)} \leadsto \text{ Bayes } \frac{ℙ(X_{n+1} \mid Y_{n+1} = 1)}{ℙ(X_{n+1} \mid Y_{n+1}=0)}\]

must be the number of times you’re more likely to be in a category than in the other

So we want the log to be linear:

\[\log \frac{ℙ(X_{n+1} \mid Y_{n+1} = 1)}{ℙ(X_{n+1} \mid Y_{n+1}=0)} = X β\]

With logistic regression, we cannot compute $\widehat{β}$ ⟶ the only solution to solve it is to do gradient descent (or Newton-Raphton method).

Leave a comment