Lecture 4: Convex analysis/optimization

Convex functions

Neural networks: non-convex

We want the loss function to be convex:

\[\min_f \frac 1 n \sum\limits_{ i=1 }^n l(y_i, f(x_i)) + λ Ω(f)\]

Convex set $K ⊆ ℝ^n$:: iff $∀(x, y) ∈ K, ∀α∈[0, 1], \; αx + (1-α)y ∈ K$

Examples:

$ℝ^n$
$\lbrace x \mid \Vert x \Vert \rbrace ≤ 1$
Affine subspace: $A^T x = b$
Half-space: $a^Tx ≤ b$
Polytopes $∈ℝ^n$

Property: convex sets are stable under intersection

NB: We’ll always try to see our sets as intersections of convex sets

Convex hull of a set $A ⊆ ℝ^n$:: the smallest convex set containing $A$

Property: $hull(A) = \text{set of barycenters of points in } A \\ ≝ \left\lbrace \sum\limits_{ i∈ I } α_i x_i \mid x_i ∈ A, α_i ≥ 0, \sum\limits_{ i∈I } α_i = 1 \right\rbrace$

Hahn-Banach Theorem: if $C$ and $D$ are two disjoint convex sets, they can be separated by a hyperplane

i.e. $∃a, b; \; \begin{cases} C ⊆ \lbrace a^Tx ≤ b \rbrace \\ D ⊆ \lbrace a^Tx ≥ b \rbrace \end{cases}$

Convex functions $f: D ⊆ ℝ^n ⟶ ℝ$ is convex:

iff

$D$ is convex
\[∀(x, y) ∈ D, ∀α∈[0, 1], \; f(αx + (1-α)y) ≤ αf(x) + (1-α)f(y)\]

Strict convexity:

$α ∈ ]0, 1[$
$f(αx + (1-α)y) < αf(x) + (1-α)f(y)$

NB: Warning: translation in English:

positif (French) ⟺ non negative (English)
strictement positif (French) ⟺ positive (English)
$f$ croissante (French) ⟺ $f$ non decreasing (English)

$μ$-Strong convexity:

$α ∈ [0, 1]$
$f(αx + (1-α)y) ≤ αf(x) + (1-α)f(y) - μ \frac{α(1-α)}{2} \Vert x - y \Vert^2$

Classic examples

in 1D: $x, x^2, − \log(x), \exp(x), \log(1 + \exp(−x)), \vert x \vert^p \text{ where } p ≥ 1$
in $n$-D:
- affine $a^T x + b$
- quadratic $x Q^T x$ if $Q$ is positive (semi-definite)
- log-sum-exp: $\log\left(\sum\limits_{ i=1 }^n \exp(x_i)\right)$
- max
- norms

How to recognize a convex function?

rarely used: in $ℝ$, if $f$ is differentiable: $f \text{ cvx } ⟺ f' \text{ is non-decreasing}$

Warning: this doesn’t work

$\inf(f_0(x), f_1(x)) = \inf_{y ∈ \lbrace 0, 1 \rbrace} y f_1 (x) + (1-y) f_0(x)$ since $\lbrace 0, 1 \rbrace$ is not convex, and $(x, y) ⟼ y f_1 (x) + (1-y) f_0(x)$ is not convex in $x$ AND $y$

Jensen’s inequality for a convex function $f$:: if $Z$ is a random vector contained in the domain of $f$: $f(𝔼 Z) ≤ 𝔼 f(Z)$

Optimization problems

Let $f: ℝ^n ⟶ ℝ$ be any function.

$minimize_{x ∈ ℝ^n} f(x)$: optimization problem: question whether the min exists or not (not a number)
$\min_{x ∈ ℝ^n} f(x)$: does not always exist (ex: $\exp(-\bullet^2)$)
$\inf_{x ∈ ℝ^n} f(x)$: always exists

$x_0$ local minimum:: iff there exists an open neighborhood $V \ni x_0$ s.t. $f(x_0) = \min_{x ∈ V} f(x)$
$x_0$ global minimum:: iff $f(x_0) = \min_{x ∈ ℝ^n} f(x)$

If $f$ is differentiable:

$x_0$ is a stationary point/saddle point:: iff $f’(x_0) = 0$

Property:

local minimum ⟹ stationary point (but the converse is not true)

if $f$ is convex, local minimum ⟺ global minimum ⟺ stationary point

Contrained optimization problems

Problem $minimize_{x ∈ D ⊆ ℝ^n} f(x)$

Assignment problem

$p$ tasks
$p$ machines
there is a cost matrix $C ∈ ℝ^{p×p}$ s.t. $C(i, j)$ is the cost of the task $i$ for the machine $j$

Goal: find a bijection $σ: \lbrace 1, ⋯, p \rbrace ⟶ \lbrace 1, ⋯, p \rbrace$ s.t.

task $i$ is assigned to machine $σ(i)$
\[minimize \sum\limits_{ i=1 }^n C_{i, σ(i)}\]

Permutation matrices $M ∈ 𝔐_p(ℝ)$ where

$M_{i, j} = 0$ except when $j = σ(i)$
$M_{i, σ(i)} = 1$

Facts (doubly stochastic matrices):

\[\sum\limits_{ i= 1}^n C_{i, σ(i)} = \sum\limits_{i, j = 1}^n C_{i, j}(M_{σ})_{i, j}\]

\[∀i, \sum\limits_{ j } M_{i, j} = \sum\limits_{ j } M_{j, i} = 1\]

\[M ≥ 0\]

Goal:

\[minimize_{M \text{ bistochastic}} \underbrace{a^T M}_{\text{linear function}}\]

As the set of bistochastic matrices is convex, we want to minimize our linear on the convex hull

Birkhoff theorem: the set of bistochastic matrices is the hull of permutation matrices

Lagrangian duality

$D$ may not be a vector space (ex: $\lbrace 0, 1 \rbrace^n$)

\[minimize_{x∈D} f(x)\]

such that (constraints):

$h_i(x) = 0$ for all $1 ≤ i ≤m$: equality constraints
$g_j(x) ≤ 0$ for all $1 ≤ j ≤r$: inequality constraints

notation: $D^\ast$ is the subset of $D$ such that the constraints are satisfied (feasible set)

Ex: in the example above, $D$ was the set of all matrices, $D^\ast$ the set of bistochastic matrices

For

$x ∈ D$: primal
$λ ∈ ℝ^m$: dual
$μ ∈ ℝ^r_+$: langragian multipliers

Lagrangian:: \[ℒ(x, λ, μ) ≝ f(x) + \sum\limits_{ i =1 }^m λ_i h_i(x) + \sum\limits_{ j =1 }^r μ_j g_j(x) \\ = f(x) + λ^T h(x) + μ^T g(x)\]

With respect to $λ$: as soon as $h ≠ 0$, the maximum is $+∞$ (in $±∞$). Else, it is $0$

With respect to $μ$: as soon as $g \not ≤ 0$, the maximum is $+∞$. Else, it is $0$.

Property: for all $x∈ D$: $\sup_{λ, μ ∈ ℝ^m × ℝ^r_+} ℒ(x, λ, μ) = \begin{cases} f(x) \text{ if } x ∈ D^\ast \\ +∞ \text{ else} \end{cases}$

\[\text{The original problem} ⟺ \min_{x ∈ D} \sup_{λ, μ ∈ ℝ^m × ℝ^r_+} ℒ(x, λ, μ)\]

⟶ a min-max problem

We’ll turn it into a max-min problem.

\[p^\ast ≝ \underbrace{\inf_{x ∈ D} \sup_{λ, μ ∈ ℝ^m × ℝ^r_+} ℒ(x, λ, μ)}_{\text{primal problem}} ≥ \underbrace{\max_{λ, μ ∈ ℝ^m × ℝ^r_+} \inf_{x ∈ D} ℒ(x, λ, μ)}_{\text{dual problem}} ≝ d^\ast\]

Weak duality:: the primal value ≥ the dual one

Dual problem: $d^\ast ≝ \max_{λ, μ ∈ ℝ^m × ℝ^r_+} \underbrace{\min_{x ∈ D} ℒ(x, λ, μ)}_{\text{concave functions}}$

NB: the dual problem is easier, if we have a small number of constraints

Strong duality:: the primal value = the dual one

NB: convexity is needed, for strong duality

Slater’s conditions:

$D$ convex, $f$ convex
$∀i, h_i$ affine (equalities)
$∀j, g_j$ convex (inequalities)
strict feasibility: there exists a strictly feasible point: $\overline{x} ∈ D^\ast$ s.t. $∀j, \; g_j(\overline{x}) < 0$

then there is strong duality

NB: Strict feasibility avoids cases such as $g(x) ≝ x^2$

There are multiple duals (multiple choices to represent your constraints):

Ex:

Difficult:

\[\inf_ω \Vert y - X ω \Vert^2 + λ(\Vert w \Vert - 1)\]

s.t. $\Vert w \Vert ≤ 1$

Easier (and equivalent):

\[\inf_ω \Vert y - X ω \Vert^2 + λ(\Vert w \Vert^2 - 1)\]

s.t. $\Vert w \Vert^2 ≤ 1$

Dual of dual = primal

Ex:

\[\inf_x f(x) \text{ s.t. } g(x)≤0\] \[𝒜 ≝ \lbrace (u, v) ∈ ℝ^2 \mid ∃x; \; f(x) ≤ v, g(x) ≤ u\rbrace\] \[ℒ(x, g) = f(x) + λ g(x)\] \[q(λ) ≝ \inf_x ℒ(x, g) = \inf_{(u,v) ∈ 𝒜} v + λu\]

Karush-Kühn-Tucker (KKT) conditions

\[x^\ast ∈ D, (λ^\ast, μ^\ast) ∈ ℝ^n × ℝ^r\]

are primal/dual optimal iff

$x^\ast$ feasible and $(λ^\ast, μ^\ast)$ feasible
$x^\ast$ minimizes $ℒ(x, λ^\ast, μ^\ast)$
complementary slackness: $∀j, μ^\ast_j g_j(x^\ast) = 0$

Share on

Twitter Facebook Google+ LinkedIn