# TP : Coordinate descent for LASSO Regression¶

The goal of this TP is to study coordinate descent for the LASSO Regression, first theoretically and then empirically by implementing the algorithm on a dataset.

Given a dataset $(x_i,y_i)_{i = 1 \dots n} \in \mathbb{R}^d \times \mathbb{R}$, the LASSO regression problem is given by $$\min_{(\beta_0,\beta) \in \mathbb{R}^{d+1}} \frac{1}{2n} \sum_{i=1}^n (y_i - \beta_0 - \beta^T x_i)^2 + \lambda {\lVert}\beta{\rVert}_1$$ where ${\lVert}\beta{\rVert}_1 = \sum_{j=1}^d {\lvert}\beta_j{\rvert}$. Note that the intercept $\beta_0$ is not penalized by the $L^1$ norm.

# Derivation of the optimization algorithm¶

## 1)¶

Let's assume first that the design matrix $X$ satisfies $X^TX = I$. Prove that the lasso problem has a closed form solution given by $$\hat \beta_j = sign(\bar \beta_j) ({\lvert}\bar \beta_j{\rvert} - \lambda)_+$$ where $\bar \beta$ is the solution of the ordinary least squares problem (without regression).

$$\min_{β∈ ℝ^d} \frac 1 2 \sum\limits_{ i } (y_i - β^T x_i)^2 + λ \Vert β \Vert_1$$

$$\min_{β∈ ℝ^d} \frac 1 2 \Vert Y - Xβ \Vert^2 + λ \Vert β \Vert_1$$

Assume that $X^T X = I$

NB: $\overline{β}$ result of the OLS regression:

$$\overline{β} = (X^T X)^{-1} X^T Y \\ = X^T Y$$

\begin{align*} \Vert Y - X β \Vert^2 & = \Vert Y \Vert^2 + \Vert Xβ \Vert^2 - 2 (Xβ)^T Y \\ & = \Vert Y \Vert^2 + β^T β - 2 β^T \overline{β} \end{align*}

So

\begin{align*} minimize_{β∈ ℝ^d} \frac 1 2 \Vert Y - Xβ \Vert^2 + λ \Vert β \Vert_1 & ⟺ minimize_{β∈ ℝ^d} \frac 1 2 β^T β - β^T \overline{β} + λ \Vert β \Vert_1 \\ & ⟺ minimize_{β∈ ℝ^d} \frac 1 2 \sum\limits_{ j=1 }^d β_j^2 - \sum\limits_{ j=1 }^d β_j \overline{β}_j + λ \sum\limits_{ j=1 }^d \vert β_j \vert \\ & ⟺ minimize_{β∈ ℝ^d} \sum\limits_{ j=1 }^d \underbrace{β_j\left(\frac 1 2 β_j - \overline{β}_j\right) + λ \vert β_j \vert}_{≝ f_j(β_j)} \\ \end{align*}

And it suffices to minimize $f_j$ for all $j$

Problem: $x ⟼ \vert x \vert$ is not differentiable at $x = 0$

Trick: note that $β_j^\ast$ has the same sign as $\overline{β}_j$

Indeed: one wants to minimize $\frac 1 2 β_j^2 - \overline{β}_j β_j + λ \vert β_j \vert$, so $\overline{β}_j β_j$ has to be non-negative

So the $f_j$ are defined on $ℝ^\ast_+$ and $ℝ^\ast_-$, on which they are differentiable.

### Case 1: $\overline{β}_j, β^\ast_j > 0$¶

$$f_j'(β_j) = β_j - \overline{β}_j + λ$$

so that

$$β^\ast_j = \max\left(\overline{β}_j - λ, 0\right) = \left(\overline{β}_j - λ\right)_+$$

### Case 2: $\overline{β}_j, β^\ast_j < 0$¶

$$f_j'(β_j) = β_j - \overline{β}_j - λ$$

$$β^\ast_j = \min\left(\overline{β}_j + λ, 0\right) = -\max\left(-\overline{β}_j - λ, 0\right)\\ = -\left(\vert \overline{β}_j \vert - λ \right)_+ \\$$

In either case,

$$β^\ast_j = sign(\overline{β}_j) \left(\vert \overline{β}_j \vert - λ \right)_+$$

## 2)¶

Now in the general case, compute the gradient of the objective function of the LASSO (we will denote by $X_j$ be the $j-th$ column of $X$).

$$\min_{β∈ ℝ^d} \frac 1 2 \sum\limits_{ i } \underbrace{(y_i - β^T x_i)^2}_{\text{Residual Sum of Squared}} + \underbrace{λ \Vert β \Vert_1}_{\text{reg}}$$

$$\nabla \text{RSS} = - X^T (Y - X β)$$

\begin{align*} \frac{\partial \text{RSS}}{\partial β_j} & = j\text{-th coord of } \nabla\text{RSS}\\ & = - \sum\limits_{ i=1 }^n x_{i, j} (y_i - β^T x_i) \\ & = \underbrace{-\sum\limits_{ i=1 }^n x_{i,j} y_i + \sum\limits_{ i=1 }^n x_{i,j} \left(\sum\limits_{ k≠j } β_k x_{ik}\right)}_{≝ \; ρ_j} + \underbrace{\sum\limits_{ i=1 }^n x_{ij}^2}_{≝ \; r_j} β_j \end{align*}

Let us denote by $F$ the function to minimize (RSS+reg).

One cannot analytically resolve $\nabla F = 0$.

NB: If

$$f(x) = \vert x \vert$$

Then one sets:

$$\partial f(x) ≝ \begin{cases} -1 &&\text{ if } x<0 \\ [-1, 1] &&\text{ if } x=0\\ 1 &&\text{ else if } x >0 \end{cases}$$

$$\frac{\partial F}{\partial β_j}(β_j) = ρ_j + r_j β_j + \begin{cases} -λ &&\text{ if } β_j<0 \\ [-λ, λ] &&\text{ if } β_j=0\\ λ &&\text{ else if } β_j >0 \end{cases} \\ = \begin{cases} ρ_j + r_j β_j -λ &&\text{ if } β_j<0 \\ [ρ_j-λ, ρ_j+λ] &&\text{ if } β_j=0\\ ρ_j + r_j β_j +λ &&\text{ else if } β_j >0 \end{cases}$$

## 3)¶

Given a fixed $\beta \in \mathbb{R}^d$, solve the j-th coordinate optimisation problem given by $$\min_{\beta_j \in \mathbb{R}} \frac{1}{2n} \sum_{i=1}^n (y_i - \beta_0 - \beta^T x_i)^2 + \lambda {\lVert}\beta{\rVert}_1$$

### Case 1: $β_j < 0$¶

$$FOC \text{ (First Order Condition)} ⟹ β_j = \frac{-ρ_j + λ}{r_j}$$

But $β_j < 0$, so $$ρ_j > λ$$

### Case 2: $β_j > 0$¶

$$FOC ⟹ β_j = -\frac{ρ_j + λ}{r_j}$$

But $β_j > 0$, so $$ρ_j < -λ$$

### Case 3: $β_j = 0$¶

$$FOC ⟹ 0 ∈ [ρ_j-λ, ρ_j+λ]$$

So $$ρ_j ∈ [-λ, λ]$$

$$β_j = \underbrace{\begin{cases} -\frac{ρ_j + λ}{r_j} &&\text{ if } ρ_j<-λ \\ 0 &&\text{ if } -λ ≤ ρ_j ≤ λ\\ \frac{-ρ_j + λ}{r_j} &&\text{ else if } ρ_j > λ \end{cases}}_{≝ \underbrace{st}_{\rlap{\text{soft thresholding}}}\left(\frac{ρ_j}{r_j}, \; λ\right)}$$

## 4)¶

Derive an iterative algorithm to solve the LASSO problem

One iteratively minimizes on each coordinate for each $j$, one sets $(β_k)_{k≠j}$ and minimizes $F(β)$ on the j-th coordinate.

Stopping criterion: $β - β_{new} < ε$

compute r_j for all j
while not converged:
for j=1, ..., d:
compute ρ_j
β_j = st(ρ_j/r_j, λ)

• As $$\frac{\partial \text{RSS}}{\partial β_j} = - \sum\limits_{ i } y_i x_{ij} + \sum\limits_{ i } x_{ij} \Big(\sum\limits_{ k } β_k x_{ik}\Big)$$ for $β^{OLS}$: $$β^{OLS}_j = -\frac{ρ_j}{r_j}$$
• $$r_j ≝ \sum\limits_{ i } x_{ij}^2$$
• $$ρ_j ≝ -\sum\limits_{ i=1 }^n x_{i,j} y_i + \sum\limits_{ i=1 }^n x_{i,j} \left(\sum\limits_{ k≠j } β_k x_{ik}\right)$$

# Implementation¶

We'll consider the polynomial regression problem from TP1, but this time we'll add L1 penalty to avoid overfitting.

## 5)¶

Generate data $(x_i,y_i) \in \mathbb{R}^2$ where $x$ is a random vector and $y = f(x) + \varepsilon$ with $f$ and $\varepsilon$ are the function and white noise of your choice. Display the data along with $f$.

In [1]:
# /!\: I'm using PYTHON3

import plotly
import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go

import matplotlib.pyplot as plt
import scipy.io as sio
from scipy import stats
%matplotlib inline

plotly.offline.init_notebook_mode(connected=True)
from IPython.core.display import display, HTML
# The polling here is to ensure that plotly.js has already been loaded before
# setting display alignment in order to avoid a race condition.
display(HTML(
'<script>'
'var waitForPlotly = setInterval( function() {'
'if( typeof(window.Plotly) !== "undefined" ){'
'MathJax.Hub.Config({ SVG: { font: "STIX-Web" }, displayAlign: "center" });'
'MathJax.Hub.Queue(["setRenderer", MathJax.Hub, "SVG"]);'
'clearInterval(waitForPlotly);'
'}}, 250 );'
'</script>'
))

In [2]:
# Number of points
nb_points = 50

# Data generation

x_inf, x_sup = 0, 5 # x_inf <= x_i <= x_sup

x = (x_sup - x_inf) * np.random.rand(nb_points) + x_inf

f = np.vectorize(lambda x: np.exp(1.1*x)*np.sinc(1.5*x))

y = f(x) + np.random.randn(nb_points)

# Splitting the data set into a training set and a testing set
ind = {}
ind['train'] = range(0, int(0.5 * nb_points))
ind['test'] = range(int(0.5 * nb_points), nb_points)

# Plotting with Plotly
x_linspace = np.linspace(x_inf, x_sup, nb_points)

trace_points = []

for test_train in ['test', 'train']:
trace_points.append(
go.Scatter(
x = x[ind[test_train]],
y = y[ind[test_train]],
mode = 'markers',
marker = dict(
size = 10,
line = dict(
width = 2,
color = 'rgb(0, 0, 0)'
)
),
name = '{}ing set'.format(test_train.capitalize())
)
)

trace_target = [go.Scatter(
x = x_linspace,
y = f(x_linspace),
mode = 'lines',
name = 'Targeted function f',
line = dict(
color = 'red',
shape = 'spline',
dash = 'longdash'
)
)]

def layout(title='Data points'):
return dict(
title = title,
xaxis = dict(
title = '$x$',
ticklen = 5,
zeroline = False,
gridwidth = 2,
),
yaxis = dict(
title = '$y$',
range = [int(min(np.min(f(x_linspace)), np.min(y)))-1.5, int(max(np.max(f(x_linspace)), np.max(y)))+1.5],
ticklen = 5,
gridwidth = 2,
),
legend = dict(
orientation = 'h',
y = -0.2
)
)

fig = dict(data=trace_points+trace_target, layout=layout())
plotly.offline.iplot(fig)


## 6)¶

Solve the linear regression problem for polynomials of various degrees $d$ and plot the associated solutions.

In [3]:
# Vandermonde design matrix
max_deg = 7
X = np.vander(x, max_deg+1)

# Standardization
mean     = np.mean(X[ind['train'], :], axis=0)
mean[-1] = 0 # to conserve the biais (constant term)
std_dev = np.std(X[ind['train'], :], axis=0)
std_dev[-1] = 1 # to avoid dividing by zero

X = (X-mean)/std_dev

# Matrix to draw the polynomial functions
P  = np.vander(x_linspace, max_deg+1)
P  = (P-mean)/std_dev

trace_poly = []

# Colorscale for polynomials
def colorscale_list(cmap, number_colors=max_deg, return_rgb_only=False):
cm = plt.get_cmap(cmap)
colors = [np.array(cm(i/number_colors)) for i in range(1, number_colors+1)]
rgb_colors_plotly = []
rgb_colors_only = []
for i, c in enumerate(colors):
col = 'rgb{}'.format(tuple(255*c[:-1]))
rgb_colors_only.append(col)
rgb_colors_plotly.append([i/number_colors, col])
rgb_colors_plotly.append([(i+1)/number_colors, col])
return rgb_colors_only if return_rgb_only else rgb_colors_plotly

colors = colorscale_list('Greens', number_colors=max_deg+3, return_rgb_only=True)

W_poly = []
for i in range(1, max_deg+1):
# Linear regression
W_i = np.linalg.solve(X[ind['train'],-(i+1):].T.dot(X[ind['train'],-(i+1):]),\
X[ind['train'],-(i+1):].T.dot(y[ind['train']]))

W_poly.append(W_i)

# Plot
trace_poly.append(
go.Scatter(
x = x_linspace,
y = P[:, -(i+1):].dot(W_i),
mode = 'lines',
name = 'Poly. of deg. {}'.format(i),
line = dict(
width = 3,
color = colors[i+1],
shape = 'spline',
dash = 'solid'
)
)
)

fig = dict(data=trace_points+trace_target+trace_poly, layout=layout(title='Linear Regression'))
plotly.offline.iplot(fig)


## 7)¶

Implement the coordinate descent algorithm from question 4 to solve the lasso problem.

Advice: first, implement auxiliary functions soft_thresholding and lasso_step which solve the partial problem for a given index $j$.

compute r_j for all j
while not converged:
for j=1, ..., d:
compute ρ_j
β_j = st(ρ_j/r_j, λ)

• $r_j ≝ \sum\limits_{ i } x_{ij}^2$
• $ρ_j ≝ -\sum\limits_{ i=1 }^n x_{i,j} y_i + \sum\limits_{ i=1 }^n x_{i,j} \left(\sum\limits_{ k≠j } β_k x_{ik}\right)$
In [4]:
def coordinate_descent(x=X, y=y, lam=0.00001, eps=1e-3):
n, d = x.shape
beta = np.zeros(d)

r = np.linalg.norm(x, axis=0)**2

def soft_thresholding(rho_j, j):
if rho_j < -lam:
return -(rho_j+lam)/r[j]
elif -lam <= rho_j <= lam:
return 0
else:
return (-rho_j+lam)/r[j]

# get all k indices different from j, for all j
masks = [np.arange(d) != j for j in range(d)]

def lasso_step(j):
rho_j = x[:, j].dot(-y + x[:, masks[j]].dot(beta[masks[j]]))
return soft_thresholding(rho_j, j)

while True:
# stopping criterion: maximum value of how beta[j] changed
max_change = 0

for j in range(d):
new_beta_j = lasso_step(j)
max_change = max(max_change, abs(beta[j]-new_beta_j))
beta[j] = new_beta_j

if max_change < eps:
break

return beta


## 8)¶

Plot the solution of the lasso problem for different values of $\lambda$ and compare it to linear regression

In [141]:
lam_list = [1e-4, 0.001, 0.1, 1, 10]
colors_lasso = colorscale_list('Purples', number_colors=len(lam_list)+3, return_rgb_only=True)
W_lam = []
trace_lasso = []

for i, lam in enumerate(lam_list):
# Lasso regression
W_i = coordinate_descent(lam=lam)
W_poly.append(W_i)

# Plot
trace_lasso.append(
go.Scatter(
x = x_linspace,
y = P.dot(W_i),
mode = 'lines',
name = 'LASSO λ={}'.format(lam),
line = dict(
width = 3,
color = colors_lasso[i+1],
shape = 'spline',
dash = 'solid'
)
)
)

fig = dict(data=trace_points+trace_target+trace_lasso, layout=layout(title='LASSO Regression'))
plotly.offline.iplot(fig)

In [142]:
# Comparison with linear regression

fig = dict(data=trace_points+trace_target+trace_lasso\
+[trace_poly[l] for l in range(0, len(trace_poly), 2)], layout=layout(title="LASSO vs Linear Regression"))
plotly.offline.iplot(fig)


The larger the $λ$ parameter, the smaller the overfitting, as expected.