Lecture 5: Convex analysis/optimization
Convex optimization
Unconstrained
-
Toolbox ⟹ CVX
- CVX: package if you to get the solution to such a problem: $\begin{cases}
\sup c^T x
Ax = b
x ≥ 0 \end{cases}$
- CVX: package if you to get the solution to such a problem: $\begin{cases}
\sup c^T x
-
Ellipsoid
- in 1D: dichotomy
- in higher dimension: $E_k ≝ \lbrace (x - x_k)^T P_k^{-1} (x - x_k) ≤ 1 \rbrace$ where $P_k$ is positive def.
- ex: if $P_k = σ^2 I_n$ ⟶ ball of radius $σ$
Algorithm: make sure that x* ∈ E_k reduce the "size" of E_k f(x) ≥ f(x_k) + f'(x_k)^T (x - x_k) Find E_{k+1} as the minimum volume ellipsoid containing E_k ∩ {(x-x_k)^T f'(x_k) ≤ 0}
$d(x_k, x^\ast) = O(\exp(- \frac{k}{12 n^2}))$
NB: contrary to gradient descent, it cannot “get lucky”, in that the $O(⋯)$ is almost equality.
- Gradient descent
- Newton
Gradient descent
Algorithm:
- $x_{k+1} = x_k - γ f’(x_k)$
- Choice of $γ$:
- constant
- line-search
- exact: $\inf_{γ≥0} f(x_k - γ f’(x_k))$
- inexact
- Choice of $γ$:
Proposition: Assume
- $f$ cvx and $C^2$
- all eigenvalues of $f’‘(x)$ are in $(μ, L)$ for all $x$ (where $μ$ is the smallest eigenvalue and $L$ the largest one)
NB:
- always $μ ≥ 0$
- $μ > 0 ⟺ f \text{ strongly convex}$
Algorithm:
- $x_{k+1} = x_k - γ f’(x_k)$
- if $γ = \frac 1 L$, then \(\begin{cases} \Vert x_k - x_\ast \Vert^\ast ≤ (1 - \frac μ L)^k \Vert x_0 - x_\ast \Vert^2 \\ f(x_k) - f(x_\ast) ≤ \frac L k \Vert x_0 - x_\ast \Vert^2 \end{cases}\)
Summary:
If $f$ strongly convex, gradient descent is linearly/geometrically convergent (i.e. warning: linealry means times $\exp(- \ast k)$ (the number of digits grows linearly))
- $O(n)$ per iteration
- if $f$ is not convex, then there’s convergence to a stationary point only
Newton’s method
Idea: optimize local quadratic Taylor expansion
Advantages:
- no parameter
- quadratically convergent: \(c \Vert x_{k+1} - x_\ast \Vert ≤ (c \Vert x_k - x_\ast \Vert)^2\)
Disavantages:
- Instable far from $x^\ast$
- $O(n^3)$ per iteration: each step is very expensive
- Ridge regression estimator $\hat{w}$:
- \[\hat{w} ≝ argmin \frac{1}{2n} \Vert y - Xw \Vert^2_2 + \frac λ 2 \Vert w \Vert^2_2\]
In TP, we showed that if $λ = 0$:
\[\hat{w}_0 = (X^T X)^{-1} X^T y\]Solve first order condition ($\nabla = 0$)
$F$ is convex, so $w^\ast$ satisfies $\nabla F(w^\ast) = 0$.
\[\begin{align*} 0 = \nabla F (w) & = \frac{-1}{n} X^T (y - Xw) + λ w \\ ⟹ & \hat{w} = (λn I + X^T X)^{-1} X^T y\\ \end{align*}\]- $f$ is $λ$-strongly cvx:
- \[f - \frac λ 2 \Vert \bullet \Vert^2 \text{ is convex }\]
Halting conditions for gradient descent:
- $\Vert w_n - w_{n+1} \Vert ≤ ε$
- $\Vert \nabla F\Vert ≤ ε$
- $\Vert w_n - w^\ast \Vert$ (kind of cheating: if we already know the gradient)
Gradient descent:
\[w_{n+1} = w_n - γ \nabla F (w_n)\]Optimal stepsize
\[\begin{align*} γ^\ast_n & = argmin_γ F(ω_{n+1}) \\ & = argmin_γ F(w_n - γ \nabla F(w_n)) \\ & = argmin_γ \frac{1}{2n} \Vert y - X(w_n - γ \nabla F(w_n)) \Vert^2_2 + \frac λ 2 \Vert w_n - γ \nabla F(w_n) \Vert^2_2 \\ & = argmin_γ \underbrace{\frac{1}{2n} \Vert y - X(w_n + \frac{γ}{n} X^T (y - Xw_n) - γ λ w_n) \Vert^2_2 + \frac λ 2 \Vert w_n + \frac{γ}{n} X^T (y - Xw_n) - γ λ w_n \Vert^2_2}_{= g(γ)} \\ \end{align*}\]
Leave a comment