Lecture 2: Regression
Linear Regression and Logistic regression: particular cases of empirical risk regression
I. Linear regression
Ex:
- EDF: understand what’s the relation between consumption of electricity and the weather (the colder it is, the more electricity is consumed)
Goal: find
\[{\rm argmin}_{f ∈ F} \widehat{R}_n(f) = \frac 1 n \sum\limits_{ i =1 }^n (x_i - f(x_i))^2\]Beware over-fitting!
⟹ $f$ affine function: $f: x ⟼ ax +b$
Linear model: $Y = X^T β + ε$ where the noise $ε ≝ (ε_1, \ldots, ε_n) ∈ ℝ^n$, $(ε_i)$ i.i.d
- $𝔼((ε_j)) = 0$
- $Var(ε) ≝ (Cov(ε_i, ε_j))_{i,j} = σ^2 I_n$
Assumption: $X ∈ ℝ^{n×p}$ is injective: $rank(X) = p$
($Im(X)$ is a vector subspace of $ℝ^n$ of dimension $p ≤ n$)
- Ordinary least square estimator:
- \[\widehat{β}∈ {\rm argmin}_β \widehat{R}_n(β) = {\rm argmin}_{β∈ℝ^p} \underbrace{\frac 1 n \sum\limits_{i=1 }^n (y_i - x_i^T β)^2}_{ = \frac 1 n \Vert Y - X β\Vert_2^2}\]
Proposition:
\[\widehat{β} = (X^T X)^{-1} X^T Y\]
Proof: $\widehat{β}$ is the minimum of $\widehat{R}_n$ ⟹ it should cancel its gradient:
\[\nabla \widehat{R}_n(β) = 0\] \[\begin{align*} \nabla \widehat{R}_n(β) & = 2 (X^T X)β - 2 X^T Y \\ \underbrace{(X^T X)}_{\text{if it is invertible}} \widehat{β} & = X^T Y \\ \widehat{β} & =(X^T X)^{-1}(X^T Y ) \end{align*}\]So $\widehat{β}$ is an extremum.
But as $\frac n 2 \nabla^2 \widehat{R}_n(β) = X^T X$ is positive-definite matrix, $\widehat{β}$ is then a minimum (and it’s unique, since the matrix is strictly positive).
We found the best approximation of $Y ∈ ℝ^n$ as $Xβ$.
$X \widehat{β} = \underbrace{P_{Im(X)}(Y)}_{\text{orthogonal projection of } Y \text{ on } Im(X)}$
Moreover:
\[P_{Im(X)}(Y) = X \widehat{β} \\ = \underbrace{X(X^T X)^{-1} X^T}_{\text{orthogonal projection matrix on } Im(X)} Y\]Is $\widehat{β}$ good?:
To simplify calculations, we assume that $X$ is deterministic.
\[𝔼(\widehat{β}) = 𝔼((X^T X)^{-1} X^T Y) = (X^T X)^{-1} X^T 𝔼(Y) \\ = (X^T X)^{-1} X^T(X β + \underbrace{𝔼(ε)}_{ =0}) = β\]So $\widehat{β}$ is an unbiased estimation of $β$.
\[Var(\widehat{β}) = Var((X^T X)^{-1} X^T Y) = (X^T X)^{-1} X^T \underbrace{Var(Y)}_{ ≝ 𝔼((Y-𝔼(Y)) (Y-𝔼(Y))^T) = Var(ε) = σ^2 Id} X (X^T X)^{-1} \\ = σ^2 (X^T X)^{-1}\]What is the prediction risk of $\widehat{β}$ ?
- $Y_{n+1} ∈ ℝ^n$
- $X_{n+1} ∈ ℝ^p$
But
\[𝔼(\underbrace{(Y_{n+1} - X_{n+1}^T β)}_{ = ε_{n+1}} (X_{n+1}^T (β - \widehat{β})) \overset{\text{independence}}{=} \underbrace{𝔼(ε_{n+1})}_{= 0} 𝔼(X_{n+1}^T (β - \widehat{β})) = 0\]Independence:
-
$ε_1, \ldots, ε_n$ and $ε_{n+1}$ are independent
-
$Xβ$ and $X_{n+1}$ are deterministic
-
$\widehat{β} ≝ (X^T X)^{-1} X^T Y$ is independent of $ε_{n+1}$ (only depends on $ε_1, \ldots, ε_n$)
So
\[\begin{align*} R(\widehat{β}) & = 𝔼(ε_{n+1}^2) + 𝔼((X_{n+1}^T (β - \widehat{β}))^2) \\ &= Var(ε_{n+1}) + X_{n+1}^T \underbrace{𝔼((β - \widehat{β})(β - \widehat{β})^T)}_{= 𝔼((𝔼(\widehat{β}) - \widehat{β})(𝔼(\widehat{β}) - \widehat{β})^T) = Var(\widehat{β})} X_{n+1} \\ &= σ^2(1 + X_{n+1}^T(X^T X)^{-1} X_{n+1}) \end{align*}\]As
- $𝔼(\widehat{β}) = β$
- $Var(\widehat{β}) = σ^2 (X^T X)^{-1}$
Th (Gauss-Markov): $\widehat{β}$ is optimal in the sense that its variance is minimal among all linear unbiased estimators.
Can we estimate $σ^2$ ?
\[σ^2 = Var(ε) = Var(Y_{n+1}) = 𝔼((Y_{n+1} - 𝔼(Y_{n+1}))^2) \\ = 𝔼((Y_{n+1} - X_{n+1}^T β)^2)\] \[\widehat{σ}^2 = \frac 1 n \sum\limits_{ i=1 }^n (y_i - x_i^T \widehat{β})^2 = \frac{\Vert Y - X\widehat{β} \Vert_2^2}{n}\] \[n 𝔼(\widehat{σ}^2) = 𝔼(\Vert Y - X \widehat{β}\Vert_2^2) = 𝔼(Tr(\Vert Y - X \widehat{β}\Vert_2^2)) \\ = 𝔼(Tr((Y - X \widehat{β})^T (Y - X \widehat{β}))) = 𝔼(Tr((Y - X \widehat{β}) (Y - X \widehat{β})^T))\]But $X \widehat{β} = P_{Im(X)}(Y)$, so:
\[n 𝔼(\widehat{σ}^2) = 𝔼(Tr((Y - P_{Im(X)}(Y)) (Y - P_{Im(X)}(Y))^T)) \\ = 𝔼(Tr(\underbrace{P_{Im(X)^\bot}Y}_{= P_{Im(X)^\bot}(Y - Xβ)} (P_{Im(X)^\bot}(Y - Xβ))^T)) \\ = 𝔼(Tr(P_{Im(X)^\bot}(Y - Xβ) (Y - Xβ)^T P_{Im(X)^\bot}^T)) \\ = Tr(𝔼(P_{Im(X)^\bot}(Y - Xβ) (Y - Xβ)^T P_{Im(X)^\bot}^T)) \\ = Tr(P_{Im(X)^\bot} \underbrace{𝔼((Y - Xβ) (Y - Xβ)^T)}_{ = Var(Y) = Var(ε) = σ^2 Id} \underbrace{P_{Im(X)^\bot}^T}_{ = P_{Im(X)^\bot}}) \\ = σ^2 Tr(P_{Im(X)^\bot}) \\ = σ^2 (n-p)\]So this estimator is biased:
\[𝔼(\widehat{σ}^2) = \frac{n-p}{n} σ^2\]So we define the following unbiased estimator :
\[\widehat{σ}_{new}^2 = \frac{\Vert Y - X \widehat{β} \Vert_2^2}{n - p}\]The case of Gaussian noise
$ε_i \leadsto 𝒩(0, σ^2)$
NB: this assumption is legitimate, because of the central limit theorem: noises are i.i.d
The maximum likelihood estimator of $β$ and $σ^2$ are $\widehat{β} = (X^T X)^{-1} X^T Y$ (the Ordinary Least Square) and $\frac{\Vert Y - X \widehat{β} \Vert_2^2}{n - p}$ for $σ^2$ (it is biased: divied by $n$ and not $n-p$)
What if the relationship is not linear?
$X = [1, Temperature, Temperature^2, Temperature^3]$
We fit any polynomial by adding transformation of the coordinates into $X$:
\[Y = a 1 + b X + c X^2 + \ldots\]⟶ Spline regression
What if $(X^T X)^{-1}$ is not invertible ?
\[\widehat{β} ∈ {\rm argmin}_β \Vert Y - X β\Vert_2^2\]But in practice, we rather use:
- Regularisation:
- \[\widehat{β} ∈ {\rm argmin}_β \underbrace{\Vert Y - X β\Vert_2^2 + λ\Vert β\Vert_2^2}_{\text{Ridge}}\]
For each $λ$, there exists a $δ$ s.t.
\[{\rm argmin}_{\Vert β \Vert_2^2 ≤ δ} \Vert Y - X β\Vert_2^2 = {\rm argmin}_{\Vert β \Vert_2^2 ≤ δ} \Vert Y - X β\Vert_2^2 + λ\Vert β\Vert_2^2\]- Lasso:
- \[\widehat{β} ∈ {\rm argmin}_β \Vert Y - X β\Vert_2^2 + λ\Vert β\Vert_1\]
The choice of $λ$ is important
QR decomposition
$\widehat{β} = (X^T X)^{-1} X^T Y$
⟶ QR decomposition: $X = Q R$ ($Q$ orthogonal, $R$ upper triangular)
so that:
\[R \widehat{β} = Q^T Y\]Gradient Descent
-
$\widehat{β}_0 = 0$
- \[β_{i+1} = β_i - η \nabla \widehat{R}_n (β_i)\]
If you choose $η$ wisely ⟹ converges
Stochastic gradient descent: if you don’t want to compute the gradient entirely.
Logistic regression
- Binary Classification: where $Y ∈ \lbrace 0, 1 \rbrace^n$
⟶ square loss for reals ⟹ not good here
\[\widehat{R}_n(f) = \frac 1 n \sum\limits_{ i =1 }^n 1_{2 Y_i - 1 ≠ sign(f(X_i))}\]Issue: it’s not convex, so minimization problem way too hard! (NP-hard)
Heaviside function (not convex) replaced by a logistic loss function (which is convex)
- Logistic loss:
- \[l(f(X), Y) = Y \log(1 + {\rm e}^{-f(X)}) + (1-Y) \log(1 +{\rm e}^{f(X)})\]
Then we predict $Y=1$ if $X_i^T β ≥ 0$ and $0$ if $X_i^T β < 0$
Same trick than before if the data can’t be split by a linear function ⟶ transformation of the coordinates into $X$
Nice probabilistic interpretation of logistic regression:
\[\frac{ℙ(Y = 1 \mid X)}{ℙ(Y = 0 \mid X)} \leadsto \text{ Bayes } \frac{ℙ(X_{n+1} \mid Y_{n+1} = 1)}{ℙ(X_{n+1} \mid Y_{n+1}=0)}\]must be the number of times you’re more likely to be in a category than in the other
So we want the log to be linear:
\[\log \frac{ℙ(X_{n+1} \mid Y_{n+1} = 1)}{ℙ(X_{n+1} \mid Y_{n+1}=0)} = X β\]With logistic regression, we cannot compute $\widehat{β}$ ⟶ the only solution to solve it is to do gradient descent (or Newton-Raphton method).
Leave a comment