TP : Coordinate descent for LASSO Regression

The goal of this TP is to study coordinate descent for the LASSO Regression, first theoretically and then empirically by implementing the algorithm on a dataset.

Given a dataset $(x_i,y_i)_{i = 1 \dots n} \in \mathbb{R}^d \times \mathbb{R}$, the LASSO regression problem is given by $$ \min_{(\beta_0,\beta) \in \mathbb{R}^{d+1}} \frac{1}{2n} \sum_{i=1}^n (y_i - \beta_0 - \beta^T x_i)^2 + \lambda {\lVert}\beta{\rVert}_1$$ where ${\lVert}\beta{\rVert}_1 = \sum_{j=1}^d {\lvert}\beta_j{\rvert}$. Note that the intercept $\beta_0$ is not penalized by the $L^1$ norm.

Derivation of the optimization algorithm

1)

Let's assume first that the design matrix $X$ satisfies $X^TX = I$. Prove that the lasso problem has a closed form solution given by $$ \hat \beta_j = sign(\bar \beta_j) ({\lvert}\bar \beta_j{\rvert} - \lambda)_+$$ where $\bar \beta$ is the solution of the ordinary least squares problem (without regression).

$$\min_{β∈ ℝ^d} \frac 1 2 \sum\limits_{ i } (y_i - β^T x_i)^2 + λ \Vert β \Vert_1$$

$$\min_{β∈ ℝ^d} \frac 1 2 \Vert Y - Xβ \Vert^2 + λ \Vert β \Vert_1$$

Assume that $X^T X = I$

NB: $\overline{β}$ result of the OLS regression:

$$\overline{β} = (X^T X)^{-1} X^T Y \\ = X^T Y$$

$$ \begin{align*} \Vert Y - X β \Vert^2 & = \Vert Y \Vert^2 + \Vert Xβ \Vert^2 - 2 (Xβ)^T Y \\ & = \Vert Y \Vert^2 + β^T β - 2 β^T \overline{β} \end{align*} $$

So

$$ \begin{align*} minimize_{β∈ ℝ^d} \frac 1 2 \Vert Y - Xβ \Vert^2 + λ \Vert β \Vert_1 & ⟺ minimize_{β∈ ℝ^d} \frac 1 2 β^T β - β^T \overline{β} + λ \Vert β \Vert_1 \\ & ⟺ minimize_{β∈ ℝ^d} \frac 1 2 \sum\limits_{ j=1 }^d β_j^2 - \sum\limits_{ j=1 }^d β_j \overline{β}_j + λ \sum\limits_{ j=1 }^d \vert β_j \vert \\ & ⟺ minimize_{β∈ ℝ^d} \sum\limits_{ j=1 }^d \underbrace{β_j\left(\frac 1 2 β_j - \overline{β}_j\right) + λ \vert β_j \vert}_{≝ f_j(β_j)} \\ \end{align*} $$

And it suffices to minimize $f_j$ for all $j$

Problem: $x ⟼ \vert x \vert$ is not differentiable at $x = 0$

Trick: note that $β_j^\ast$ has the same sign as $\overline{β}_j$

Indeed: one wants to minimize $\frac 1 2 β_j^2 - \overline{β}_j β_j + λ \vert β_j \vert$, so $\overline{β}_j β_j$ has to be non-negative

So the $f_j$ are defined on $ℝ^\ast_+$ and $ℝ^\ast_-$, on which they are differentiable.

Case 1: $\overline{β}_j, β^\ast_j > 0$

$$f_j'(β_j) = β_j - \overline{β}_j + λ$$

so that

$$β^\ast_j = \max\left(\overline{β}_j - λ, 0\right) = \left(\overline{β}_j - λ\right)_+$$

Case 2: $\overline{β}_j, β^\ast_j < 0$

$$f_j'(β_j) = β_j - \overline{β}_j - λ$$

$$β^\ast_j = \min\left(\overline{β}_j + λ, 0\right) = -\max\left(-\overline{β}_j - λ, 0\right)\\ = -\left(\vert \overline{β}_j \vert - λ \right)_+ \\ $$

In either case,

$$β^\ast_j = sign(\overline{β}_j) \left(\vert \overline{β}_j \vert - λ \right)_+$$

2)

Now in the general case, compute the gradient of the objective function of the LASSO (we will denote by $X_j$ be the $j-th$ column of $X$).

$$\min_{β∈ ℝ^d} \frac 1 2 \sum\limits_{ i } \underbrace{(y_i - β^T x_i)^2}_{\text{Residual Sum of Squared}} + \underbrace{λ \Vert β \Vert_1}_{\text{reg}}$$

$$\nabla \text{RSS} = - X^T (Y - X β)$$

$$ \begin{align*} \frac{\partial \text{RSS}}{\partial β_j} & = j\text{-th coord of } \nabla\text{RSS}\\ & = - \sum\limits_{ i=1 }^n x_{i, j} (y_i - β^T x_i) \\ & = \underbrace{-\sum\limits_{ i=1 }^n x_{i,j} y_i + \sum\limits_{ i=1 }^n x_{i,j} \left(\sum\limits_{ k≠j } β_k x_{ik}\right)}_{≝ \; ρ_j} + \underbrace{\sum\limits_{ i=1 }^n x_{ij}^2}_{≝ \; r_j} β_j \end{align*} $$

Let us denote by $F$ the function to minimize (RSS+reg).

One cannot analytically resolve $\nabla F = 0$.


NB: If

$$f(x) = \vert x \vert$$

Then one sets:

$$ \partial f(x) ≝ \begin{cases} -1 &&\text{ if } x<0 \\ [-1, 1] &&\text{ if } x=0\\ 1 &&\text{ else if } x >0 \end{cases}$$


$$\frac{\partial F}{\partial β_j}(β_j) = ρ_j + r_j β_j + \begin{cases} -λ &&\text{ if } β_j<0 \\ [-λ, λ] &&\text{ if } β_j=0\\ λ &&\text{ else if } β_j >0 \end{cases} \\ = \begin{cases} ρ_j + r_j β_j -λ &&\text{ if } β_j<0 \\ [ρ_j-λ, ρ_j+λ] &&\text{ if } β_j=0\\ ρ_j + r_j β_j +λ &&\text{ else if } β_j >0 \end{cases}$$

3)

Given a fixed $\beta \in \mathbb{R}^d$, solve the j-th coordinate optimisation problem given by $$ \min_{\beta_j \in \mathbb{R}} \frac{1}{2n} \sum_{i=1}^n (y_i - \beta_0 - \beta^T x_i)^2 + \lambda {\lVert}\beta{\rVert}_1$$

Case 1: $β_j < 0$

$$FOC \text{ (First Order Condition)} ⟹ β_j = \frac{-ρ_j + λ}{r_j}$$

But $β_j < 0$, so $$ρ_j > λ$$

Case 2: $β_j > 0$

$$FOC ⟹ β_j = -\frac{ρ_j + λ}{r_j}$$

But $β_j > 0$, so $$ρ_j < -λ$$

Case 3: $β_j = 0$

$$FOC ⟹ 0 ∈ [ρ_j-λ, ρ_j+λ]$$

So $$ρ_j ∈ [-λ, λ]$$


$$β_j = \underbrace{\begin{cases} -\frac{ρ_j + λ}{r_j} &&\text{ if } ρ_j<-λ \\ 0 &&\text{ if } -λ ≤ ρ_j ≤ λ\\ \frac{-ρ_j + λ}{r_j} &&\text{ else if } ρ_j > λ \end{cases}}_{≝ \underbrace{st}_{\rlap{\text{soft thresholding}}}\left(\frac{ρ_j}{r_j}, \; λ\right)} $$

4)

Derive an iterative algorithm to solve the LASSO problem

One iteratively minimizes on each coordinate for each $j$, one sets $(β_k)_{k≠j}$ and minimizes $F(β)$ on the j-th coordinate.

Stopping criterion: $β - β_{new} < ε$

compute r_j for all j
while not converged:
    for j=1, ..., d:
        compute ρ_j
        β_j = st(ρ_j/r_j, λ)
  • As $$\frac{\partial \text{RSS}}{\partial β_j} = - \sum\limits_{ i } y_i x_{ij} + \sum\limits_{ i } x_{ij} \Big(\sum\limits_{ k } β_k x_{ik}\Big)$$ for $β^{OLS}$: $$β^{OLS}_j = -\frac{ρ_j}{r_j}$$
  • $$r_j ≝ \sum\limits_{ i } x_{ij}^2$$
  • $$ρ_j ≝ -\sum\limits_{ i=1 }^n x_{i,j} y_i + \sum\limits_{ i=1 }^n x_{i,j} \left(\sum\limits_{ k≠j } β_k x_{ik}\right)$$

Implementation

We'll consider the polynomial regression problem from TP1, but this time we'll add L1 penalty to avoid overfitting.

5)

Generate data $(x_i,y_i) \in \mathbb{R}^2$ where $x$ is a random vector and $y = f(x) + \varepsilon$ with $f$ and $\varepsilon$ are the function and white noise of your choice. Display the data along with $f$.

In [1]:
# /!\: I'm using PYTHON3

import plotly
import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go

import matplotlib.pyplot as plt
import scipy.io as sio
from scipy import stats
%matplotlib inline

plotly.offline.init_notebook_mode(connected=True)
from IPython.core.display import display, HTML
# The polling here is to ensure that plotly.js has already been loaded before
# setting display alignment in order to avoid a race condition.
display(HTML(
    '<script>'
        'var waitForPlotly = setInterval( function() {'
            'if( typeof(window.Plotly) !== "undefined" ){'
                'MathJax.Hub.Config({ SVG: { font: "STIX-Web" }, displayAlign: "center" });'
                'MathJax.Hub.Queue(["setRenderer", MathJax.Hub, "SVG"]);'
                'clearInterval(waitForPlotly);'
            '}}, 250 );'
    '</script>'
))
In [2]:
# Number of points
nb_points = 50

# Data generation

x_inf, x_sup = 0, 5 # x_inf <= x_i <= x_sup

x = (x_sup - x_inf) * np.random.rand(nb_points) + x_inf

f = np.vectorize(lambda x: np.exp(1.1*x)*np.sinc(1.5*x))

y = f(x) + np.random.randn(nb_points)


# Splitting the data set into a training set and a testing set
ind = {}
ind['train'] = range(0, int(0.5 * nb_points))
ind['test'] = range(int(0.5 * nb_points), nb_points)


# Plotting with Plotly
x_linspace = np.linspace(x_inf, x_sup, nb_points)
    
trace_points = []

for test_train in ['test', 'train']:
    trace_points.append(
        go.Scatter(
            x = x[ind[test_train]],
            y = y[ind[test_train]],
            mode = 'markers',
            marker = dict(
                size = 10,
                line = dict(
                    width = 2,
                    color = 'rgb(0, 0, 0)'
                )
            ),
            name = '{}ing set'.format(test_train.capitalize())
        )
    )


trace_target = [go.Scatter(
    x = x_linspace,
    y = f(x_linspace),
    mode = 'lines',
    name = 'Targeted function f',
    line = dict(
        color = 'red',
        shape = 'spline',
        dash = 'longdash'
        )
    )]

def layout(title='Data points'):
    return dict(
        title = title,
        xaxis = dict(
            title = '$x$',
            ticklen = 5,
            zeroline = False,
            gridwidth = 2,
            ),
        yaxis = dict(
            title = '$y$',
            range = [int(min(np.min(f(x_linspace)), np.min(y)))-1.5, int(max(np.max(f(x_linspace)), np.max(y)))+1.5],
            ticklen = 5,
            gridwidth = 2,
            ),
        legend = dict(
            orientation = 'h',
            y = -0.2
        )
    )

fig = dict(data=trace_points+trace_target, layout=layout())
plotly.offline.iplot(fig)

6)

Solve the linear regression problem for polynomials of various degrees $d$ and plot the associated solutions.

In [3]:
# Vandermonde design matrix
max_deg = 7
X = np.vander(x, max_deg+1)

# Standardization
mean     = np.mean(X[ind['train'], :], axis=0)
mean[-1] = 0 # to conserve the biais (constant term)
std_dev = np.std(X[ind['train'], :], axis=0)
std_dev[-1] = 1 # to avoid dividing by zero

X = (X-mean)/std_dev


# Matrix to draw the polynomial functions
P  = np.vander(x_linspace, max_deg+1)
P  = (P-mean)/std_dev

trace_poly = []

# Colorscale for polynomials
def colorscale_list(cmap, number_colors=max_deg, return_rgb_only=False):
    cm = plt.get_cmap(cmap)
    colors = [np.array(cm(i/number_colors)) for i in range(1, number_colors+1)]
    rgb_colors_plotly = []
    rgb_colors_only = []
    for i, c in enumerate(colors):
        col = 'rgb{}'.format(tuple(255*c[:-1]))
        rgb_colors_only.append(col)
        rgb_colors_plotly.append([i/number_colors, col])
        rgb_colors_plotly.append([(i+1)/number_colors, col])
    return rgb_colors_only if return_rgb_only else rgb_colors_plotly

colors = colorscale_list('Greens', number_colors=max_deg+3, return_rgb_only=True)

W_poly = []
for i in range(1, max_deg+1):
    # Linear regression
    W_i = np.linalg.solve(X[ind['train'],-(i+1):].T.dot(X[ind['train'],-(i+1):]),\
                          X[ind['train'],-(i+1):].T.dot(y[ind['train']]))
    
    W_poly.append(W_i)
    
    # Plot
    trace_poly.append(
        go.Scatter(
            x = x_linspace,
            y = P[:, -(i+1):].dot(W_i), 
            mode = 'lines',
            name = 'Poly. of deg. {}'.format(i),
            line = dict(
                width = 3,
                color = colors[i+1],
                shape = 'spline',
                dash = 'solid'
            )
        )
    )

fig = dict(data=trace_points+trace_target+trace_poly, layout=layout(title='Linear Regression'))
plotly.offline.iplot(fig)

7)

Implement the coordinate descent algorithm from question 4 to solve the lasso problem.

Advice: first, implement auxiliary functions soft_thresholding and lasso_step which solve the partial problem for a given index $j$.

compute r_j for all j
while not converged:
    for j=1, ..., d:
        compute ρ_j
        β_j = st(ρ_j/r_j, λ)
  • $r_j ≝ \sum\limits_{ i } x_{ij}^2$
  • $ρ_j ≝ -\sum\limits_{ i=1 }^n x_{i,j} y_i + \sum\limits_{ i=1 }^n x_{i,j} \left(\sum\limits_{ k≠j } β_k x_{ik}\right)$
In [4]:
def coordinate_descent(x=X, y=y, lam=0.00001, eps=1e-3):
    n, d = x.shape
    beta = np.zeros(d)
    
    r = np.linalg.norm(x, axis=0)**2
    
    def soft_thresholding(rho_j, j):
        if rho_j < -lam:
            return -(rho_j+lam)/r[j]
        elif -lam <= rho_j <= lam:
            return 0
        else:
            return (-rho_j+lam)/r[j]
    
    # get all k indices different from j, for all j
    masks = [np.arange(d) != j for j in range(d)]

    def lasso_step(j):
        rho_j = x[:, j].dot(-y + x[:, masks[j]].dot(beta[masks[j]]))
        return soft_thresholding(rho_j, j)
    
    while True:
        # stopping criterion: maximum value of how beta[j] changed
        max_change = 0
        
        for j in range(d):
            new_beta_j = lasso_step(j)
            max_change = max(max_change, abs(beta[j]-new_beta_j))
            beta[j] = new_beta_j
        
        if max_change < eps:
            break
    
    return beta

8)

Plot the solution of the lasso problem for different values of $\lambda$ and compare it to linear regression

In [141]:
lam_list = [1e-4, 0.001, 0.1, 1, 10]
colors_lasso = colorscale_list('Purples', number_colors=len(lam_list)+3, return_rgb_only=True)
W_lam = []
trace_lasso = []

for i, lam in enumerate(lam_list):
    # Lasso regression
    W_i = coordinate_descent(lam=lam)
    W_poly.append(W_i)
    
    # Plot
    trace_lasso.append(
        go.Scatter(
            x = x_linspace,
            y = P.dot(W_i), 
            mode = 'lines',
            name = 'LASSO λ={}'.format(lam),
            line = dict(
                width = 3,
                color = colors_lasso[i+1],
                shape = 'spline',
                dash = 'solid'
            )
        )
    )

fig = dict(data=trace_points+trace_target+trace_lasso, layout=layout(title='LASSO Regression'))
plotly.offline.iplot(fig)
In [142]:
# Comparison with linear regression

fig = dict(data=trace_points+trace_target+trace_lasso\
           +[trace_poly[l] for l in range(0, len(trace_poly), 2)], layout=layout(title="LASSO vs Linear Regression"))
plotly.offline.iplot(fig)

The larger the $λ$ parameter, the smaller the overfitting, as expected.