Lecture 5: Convex analysis/optimization

Convex optimization

Unconstrained

Toolbox ⟹ CVX
- CVX: package if you to get the solution to such a problem: $\begin{cases} \sup c^T x
  Ax = b
  x ≥ 0 \end{cases}$
Ellipsoid
- in 1D: dichotomy
- in higher dimension: $E_k ≝ \lbrace (x - x_k)^T P_k^{-1} (x - x_k) ≤ 1 \rbrace$ where $P_k$ is positive def.
- ex: if $P_k = σ^2 I_n$ ⟶ ball of radius $σ$
```
 Algorithm:
 make sure that x* ∈ E_k
 reduce the "size" of E_k
 f(x) ≥ f(x_k) + f'(x_k)^T (x - x_k)
 Find E_{k+1} as the minimum volume ellipsoid containing E_k ∩ {(x-x_k)^T f'(x_k) ≤ 0}
```
$d(x_k, x^\ast) = O(\exp(- \frac{k}{12 n^2}))$

NB: contrary to gradient descent, it cannot “get lucky”, in that the $O(⋯)$ is almost equality.
Gradient descent
Newton

Gradient descent

Algorithm:

$x_{k+1} = x_k - γ f’(x_k)$
- Choice of $γ$:
  - constant
  - line-search
    - exact: $\inf_{γ≥0} f(x_k - γ f’(x_k))$
    - inexact

Proposition: Assume

$f$ cvx and $C^2$

all eigenvalues of $f’‘(x)$ are in $(μ, L)$ for all $x$ (where $μ$ is the smallest eigenvalue and $L$ the largest one)

NB:

always $μ ≥ 0$
$μ > 0 ⟺ f \text{ strongly convex}$

Algorithm:

$x_{k+1} = x_k - γ f’(x_k)$
if $γ = \frac 1 L$, then $\begin{cases} \Vert x_k - x_\ast \Vert^\ast ≤ (1 - \frac μ L)^k \Vert x_0 - x_\ast \Vert^2 \\ f(x_k) - f(x_\ast) ≤ \frac L k \Vert x_0 - x_\ast \Vert^2 \end{cases}$

Summary:

If $f$ strongly convex, gradient descent is linearly/geometrically convergent (i.e. warning: linealry means times $\exp(- \ast k)$ (the number of digits grows linearly))

$O(n)$ per iteration

if $f$ is not convex, then there’s convergence to a stationary point only

Newton’s method

Idea: optimize local quadratic Taylor expansion

Advantages:

no parameter
quadratically convergent: $c \Vert x_{k+1} - x_\ast \Vert ≤ (c \Vert x_k - x_\ast \Vert)^2$

Disavantages:

Instable far from $x^\ast$
$O(n^3)$ per iteration: each step is very expensive

Ridge regression estimator $\hat{w}$:: \[\hat{w} ≝ argmin \frac{1}{2n} \Vert y - Xw \Vert^2_2 + \frac λ 2 \Vert w \Vert^2_2\]

In TP, we showed that if $λ = 0$:

\[\hat{w}_0 = (X^T X)^{-1} X^T y\]

Solve first order condition ($\nabla = 0$)

$F$ is convex, so $w^\ast$ satisfies $\nabla F(w^\ast) = 0$.

\[\begin{align*} 0 = \nabla F (w) & = \frac{-1}{n} X^T (y - Xw) + λ w \\ ⟹ & \hat{w} = (λn I + X^T X)^{-1} X^T y\\ \end{align*}\]

$f$ is $λ$-strongly cvx:: \[f - \frac λ 2 \Vert \bullet \Vert^2 \text{ is convex }\]

Halting conditions for gradient descent:

$\Vert w_n - w_{n+1} \Vert ≤ ε$
$\Vert \nabla F\Vert ≤ ε$
$\Vert w_n - w^\ast \Vert$ (kind of cheating: if we already know the gradient)

\[\nabla F (w) = \frac{-1}{n} X^T (y - Xw) + λ w\]

Gradient descent:

\[w_{n+1} = w_n - γ \nabla F (w_n)\]

Optimal stepsize

\[\begin{align*} γ^\ast_n & = argmin_γ F(ω_{n+1}) \\ & = argmin_γ F(w_n - γ \nabla F(w_n)) \\ & = argmin_γ \frac{1}{2n} \Vert y - X(w_n - γ \nabla F(w_n)) \Vert^2_2 + \frac λ 2 \Vert w_n - γ \nabla F(w_n) \Vert^2_2 \\ & = argmin_γ \underbrace{\frac{1}{2n} \Vert y - X(w_n + \frac{γ}{n} X^T (y - Xw_n) - γ λ w_n) \Vert^2_2 + \frac λ 2 \Vert w_n + \frac{γ}{n} X^T (y - Xw_n) - γ λ w_n \Vert^2_2}_{= g(γ)} \\ \end{align*}\]

Share on

Twitter Facebook Google+ LinkedIn

Convex optimization

Unconstrained

Gradient descent

Newton’s method

Share on

Leave a comment