Lecture 2: Regression

Linear Regression and Logistic regression: particular cases of empirical risk regression

I. Linear regression

Ex:

  • EDF: understand what’s the relation between consumption of electricity and the weather (the colder it is, the more electricity is consumed)
Y(Y1,,Yn)Tn X(X1,,Xn)Tn×d

Goal: find

argminfFR^n(f)=1ni=1n(xif(xi))2

Beware over-fitting!

f affine function: f:xax+b

Linear model: Y=XTβ+ε where the noise ε(ε1,,εn)n, (εi) i.i.d

  • 𝔼((εj))=0
  • Var(ε)(Cov(εi,εj))i,j=σ2In

Assumption: Xn×p is injective: rank(X)=p

(Im(X) is a vector subspace of n of dimension pn)

Ordinary least square estimator:
β^argminβR^n(β)=argminβp1ni=1n(yixiTβ)2=1nYXβ22

Proposition:

β^=(XTX)1XTY

Proof: β^ is the minimum of R^n ⟹ it should cancel its gradient:

R^n(β)=0 R^n(β)=2(XTX)β2XTY(XTX)if it is invertibleβ^=XTYβ^=(XTX)1(XTY)

So β^ is an extremum.

But as n22R^n(β)=XTX is positive-definite matrix, β^ is then a minimum (and it’s unique, since the matrix is strictly positive).


We found the best approximation of Yn as Xβ.

Xβ^=PIm(X)(Y)orthogonal projection of Y on Im(X)

Moreover:

PIm(X)(Y)=Xβ^=X(XTX)1XTorthogonal projection matrix on Im(X)Y

Is β^ good?:

To simplify calculations, we assume that X is deterministic.

𝔼(β^)=𝔼((XTX)1XTY)=(XTX)1XT𝔼(Y)=(XTX)1XT(Xβ+𝔼(ε)=0)=β

So β^ is an unbiased estimation of β.

Var(β^)=Var((XTX)1XTY)=(XTX)1XTVar(Y)𝔼((Y𝔼(Y))(Y𝔼(Y))T)=Var(ε)=σ2IdX(XTX)1=σ2(XTX)1

What is the prediction risk of β^ ?

  • Yn+1n
  • Xn+1p
R(β^)=𝔼((Yn+1Xn+1Tβ^)2)=𝔼((Yn+1Xn+1Tβ+Xn+1TβXn+1Tβ^)2)=𝔼((Yn+1Xn+1Tβ)2+(Xn+1T(ββ^))2+2(Yn+1Xn+1Tβ)(Xn+1T(ββ^))

But

𝔼((Yn+1Xn+1Tβ)=εn+1(Xn+1T(ββ^))=independence𝔼(εn+1)=0𝔼(Xn+1T(ββ^))=0

Independence:

  • ε1,,εn and εn+1 are independent

  • Xβ and Xn+1 are deterministic

  • β^(XTX)1XTY is independent of εn+1 (only depends on ε1,,εn)

So

R(β^)=𝔼(εn+12)+𝔼((Xn+1T(ββ^))2)=Var(εn+1)+Xn+1T𝔼((ββ^)(ββ^)T)=𝔼((𝔼(β^)β^)(𝔼(β^)β^)T)=Var(β^)Xn+1=σ2(1+Xn+1T(XTX)1Xn+1)

As

  • 𝔼(β^)=β
  • Var(β^)=σ2(XTX)1

Th (Gauss-Markov): β^ is optimal in the sense that its variance is minimal among all linear unbiased estimators.

Can we estimate σ2 ?

σ2=Var(ε)=Var(Yn+1)=𝔼((Yn+1𝔼(Yn+1))2)=𝔼((Yn+1Xn+1Tβ)2) σ^2=1ni=1n(yixiTβ^)2=YXβ^22n n𝔼(σ^2)=𝔼(YXβ^22)=𝔼(Tr(YXβ^22))=𝔼(Tr((YXβ^)T(YXβ^)))=𝔼(Tr((YXβ^)(YXβ^)T))

But Xβ^=PIm(X)(Y), so:

n𝔼(σ^2)=𝔼(Tr((YPIm(X)(Y))(YPIm(X)(Y))T))=𝔼(Tr(PIm(X)Y=PIm(X)(YXβ)(PIm(X)(YXβ))T))=𝔼(Tr(PIm(X)(YXβ)(YXβ)TPIm(X)T))=Tr(𝔼(PIm(X)(YXβ)(YXβ)TPIm(X)T))=Tr(PIm(X)𝔼((YXβ)(YXβ)T)=Var(Y)=Var(ε)=σ2IdPIm(X)T=PIm(X))=σ2Tr(PIm(X))=σ2(np)

So this estimator is biased:

𝔼(σ^2)=npnσ2

So we define the following unbiased estimator :

σ^new2=YXβ^22np

The case of Gaussian noise

εi𝒩(0,σ2)

NB: this assumption is legitimate, because of the central limit theorem: noises are i.i.d

The maximum likelihood estimator of β and σ2 are β^=(XTX)1XTY (the Ordinary Least Square) and YXβ^22np for σ2 (it is biased: divied by n and not np)

What if the relationship is not linear?

X=[1,Temperature,Temperature2,Temperature3]

We fit any polynomial by adding transformation of the coordinates into X:

Y=a1+bX+cX2+

⟶ Spline regression

What if (XTX)1 is not invertible ?

β^argminβYXβ22

But in practice, we rather use:

Regularisation:
β^argminβYXβ22+λβ22Ridge

For each λ, there exists a δ s.t.

argminβ22δYXβ22=argminβ22δYXβ22+λβ22
Lasso:
β^argminβYXβ22+λβ1
β^Ridge=(XTX+λIp)1XTY

The choice of λ is important

QR decomposition

β^=(XTX)1XTY

⟶ QR decomposition: X=QR (Q orthogonal, R upper triangular)

so that:

Rβ^=QTY

Gradient Descent

  • β^0=0

  • βi+1=βiηR^n(βi)

If you choose η wisely ⟹ converges

Stochastic gradient descent: if you don’t want to compute the gradient entirely.

Logistic regression

  • Binary Classification: where Y{0,1}n

⟶ square loss for reals ⟹ not good here

R^n(f)=1ni=1n12Yi1sign(f(Xi))

Issue: it’s not convex, so minimization problem way too hard! (NP-hard)

Heaviside function (not convex) replaced by a logistic loss function (which is convex)

Logistic loss:
l(f(X),Y)=Ylog(1+ef(X))+(1Y)log(1+ef(X))
β^logistic=argminβ{1ni=1nl(XiTβ,Yi)}

Then we predict Y=1 if XiTβ0 and 0 if XiTβ<0

Same trick than before if the data can’t be split by a linear function ⟶ transformation of the coordinates into X

Nice probabilistic interpretation of logistic regression:

(Y=1X)(Y=0X) Bayes (Xn+1Yn+1=1)(Xn+1Yn+1=0)

must be the number of times you’re more likely to be in a category than in the other

So we want the log to be linear:

log(Xn+1Yn+1=1)(Xn+1Yn+1=0)=Xβ

With logistic regression, we cannot compute β^ ⟶ the only solution to solve it is to do gradient descent (or Newton-Raphton method).

Leave a comment