Lecture 5: Convex analysis/optimization

Convex optimization

Unconstrained

  1. Toolbox ⟹ CVX

    • CVX: package if you to get the solution to such a problem: $\begin{cases} \sup c^T x
      Ax = b
      x ≥ 0 \end{cases}$
  2. Ellipsoid

    • in 1D: dichotomy
    • in higher dimension: $E_k ≝ \lbrace (x - x_k)^T P_k^{-1} (x - x_k) ≤ 1 \rbrace$ where $P_k$ is positive def.
    • ex: if $P_k = σ^2 I_n$ ⟶ ball of radius $σ$
     Algorithm:
     make sure that x* ∈ E_k
     reduce the "size" of E_k
     f(x) ≥ f(x_k) + f'(x_k)^T (x - x_k)
     Find E_{k+1} as the minimum volume ellipsoid containing E_k ∩ {(x-x_k)^T f'(x_k) ≤ 0}
    

    $d(x_k, x^\ast) = O(\exp(- \frac{k}{12 n^2}))$

    NB: contrary to gradient descent, it cannot “get lucky”, in that the $O(⋯)$ is almost equality.

  3. Gradient descent
  4. Newton

Gradient descent

Algorithm:

  • $x_{k+1} = x_k - γ f’(x_k)$
    • Choice of $γ$:
      • constant
      • line-search
        • exact: $\inf_{γ≥0} f(x_k - γ f’(x_k))$
        • inexact

Proposition: Assume

  • $f$ cvx and $C^2$
  • all eigenvalues of $f’‘(x)$ are in $(μ, L)$ for all $x$ (where $μ$ is the smallest eigenvalue and $L$ the largest one)

NB:

  • always $μ ≥ 0$
  • $μ > 0 ⟺ f \text{ strongly convex}$

Algorithm:

  • $x_{k+1} = x_k - γ f’(x_k)$
  • if $γ = \frac 1 L$, then \(\begin{cases} \Vert x_k - x_\ast \Vert^\ast ≤ (1 - \frac μ L)^k \Vert x_0 - x_\ast \Vert^2 \\ f(x_k) - f(x_\ast) ≤ \frac L k \Vert x_0 - x_\ast \Vert^2 \end{cases}\)

Summary:

  • If $f$ strongly convex, gradient descent is linearly/geometrically convergent (i.e. warning: linealry means times $\exp(- \ast k)$ (the number of digits grows linearly))

    • $O(n)$ per iteration
    • if $f$ is not convex, then there’s convergence to a stationary point only

Newton’s method

Idea: optimize local quadratic Taylor expansion

Advantages:

  • no parameter
  • quadratically convergent: \(c \Vert x_{k+1} - x_\ast \Vert ≤ (c \Vert x_k - x_\ast \Vert)^2\)

Disavantages:

  • Instable far from $x^\ast$
  • $O(n^3)$ per iteration: each step is very expensive

Ridge regression estimator $\hat{w}$:
\[\hat{w} ≝ argmin \frac{1}{2n} \Vert y - Xw \Vert^2_2 + \frac λ 2 \Vert w \Vert^2_2\]

In TP, we showed that if $λ = 0$:

\[\hat{w}_0 = (X^T X)^{-1} X^T y\]

Solve first order condition ($\nabla = 0$)

$F$ is convex, so $w^\ast$ satisfies $\nabla F(w^\ast) = 0$.

\[\begin{align*} 0 = \nabla F (w) & = \frac{-1}{n} X^T (y - Xw) + λ w \\ ⟹ & \hat{w} = (λn I + X^T X)^{-1} X^T y\\ \end{align*}\]
$f$ is $λ$-strongly cvx:
\[f - \frac λ 2 \Vert \bullet \Vert^2 \text{ is convex }\]

Halting conditions for gradient descent:

  1. $\Vert w_n - w_{n+1} \Vert ≤ ε$
  2. $\Vert \nabla F\Vert ≤ ε$
  3. $\Vert w_n - w^\ast \Vert$ (kind of cheating: if we already know the gradient)
\[\nabla F (w) = \frac{-1}{n} X^T (y - Xw) + λ w\]

Gradient descent:

\[w_{n+1} = w_n - γ \nabla F (w_n)\]

Optimal stepsize

\[\begin{align*} γ^\ast_n & = argmin_γ F(ω_{n+1}) \\ & = argmin_γ F(w_n - γ \nabla F(w_n)) \\ & = argmin_γ \frac{1}{2n} \Vert y - X(w_n - γ \nabla F(w_n)) \Vert^2_2 + \frac λ 2 \Vert w_n - γ \nabla F(w_n) \Vert^2_2 \\ & = argmin_γ \underbrace{\frac{1}{2n} \Vert y - X(w_n + \frac{γ}{n} X^T (y - Xw_n) - γ λ w_n) \Vert^2_2 + \frac λ 2 \Vert w_n + \frac{γ}{n} X^T (y - Xw_n) - γ λ w_n \Vert^2_2}_{= g(γ)} \\ \end{align*}\]

Leave a comment