Lecture 4: Convex analysis/optimization

Convex functions

Neural networks: non-convex

We want the loss function to be convex:

\min_f \frac 1 n \sum\limits_{ i=1 }^n l(y_i, f(x_i)) + λ Ω(f)
Convex set $K ⊆ ℝ^n$:

iff ∀(x, y) ∈ K, ∀α∈[0, 1], \; αx + (1-α)y ∈ K

Examples:

  • $ℝ^n$
  • $\lbrace x \mid \Vert x \Vert \rbrace ≤ 1$
  • Affine subspace: $A^T x = b$
  • Half-space: $a^Tx ≤ b$
  • Polytopes $∈ℝ^n$

Property: convex sets are stable under intersection

NB: We’ll always try to see our sets as intersections of convex sets

Convex hull of a set $A ⊆ ℝ^n$:

the smallest convex set containing $A$

Property: hull(A) = \text{set of barycenters of points in } A \\ ≝ \left\lbrace \sum\limits_{ i∈ I } α_i x_i \mid x_i ∈ A, α_i ≥ 0, \sum\limits_{ i∈I } α_i = 1 \right\rbrace


Hahn-Banach Theorem: if $C$ and $D$ are two disjoint convex sets, they can be separated by a hyperplane

i.e. ∃a, b; \; \begin{cases} C ⊆ \lbrace a^Tx ≤ b \rbrace \\ D ⊆ \lbrace a^Tx ≥ b \rbrace \end{cases}

Convex functions $f: D ⊆ ℝ^n ⟶ ℝ$ is convex:

iff

  • $D$ is convex
  • ∀(x, y) ∈ D, ∀α∈[0, 1], \; f(αx + (1-α)y) ≤ αf(x) + (1-α)f(y)
Strict convexity:
  • $α ∈ ]0, 1[$
  • $f(αx + (1-α)y) < αf(x) + (1-α)f(y)$

NB: Warning: translation in English:

  • positif (French) ⟺ non negative (English)
  • strictement positif (French) ⟺ positive (English)

  • $f$ croissante (French) ⟺ $f$ non decreasing (English)
$μ$-Strong convexity:
  • $α ∈ [0, 1]$
  • $f(αx + (1-α)y) ≤ αf(x) + (1-α)f(y) - μ \frac{α(1-α)}{2} \Vert x - y \Vert^2$

Classic examples

  • in 1D: $x, x^2, − \log(x), \exp(x), \log(1 + \exp(−x)), \vert x \vert^p \text{ where } p ≥ 1$

  • in $n$-D:

    • affine $a^T x + b$
    • quadratic $x Q^T x$ if $Q$ is positive (semi-definite)
    • log-sum-exp: $\log\left(\sum\limits_{ i=1 }^n \exp(x_i)\right)$
    • max
    • norms

How to recognize a convex function?

  • rarely used: in $ℝ$, if $f$ is differentiable: f \text{ cvx } ⟺ f' \text{ is non-decreasing}

Warning: this doesn’t work

\inf(f_0(x), f_1(x)) = \inf_{y ∈ \lbrace 0, 1 \rbrace} y f_1 (x) + (1-y) f_0(x) since $\lbrace 0, 1 \rbrace$ is not convex, and $(x, y) ⟼ y f_1 (x) + (1-y) f_0(x)$ is not convex in $x$ AND $y$

Jensen’s inequality for a convex function $f$:

if $Z$ is a random vector contained in the domain of $f$: f(𝔼 Z) ≤ 𝔼 f(Z)

Optimization problems

Let $f: ℝ^n ⟶ ℝ$ be any function.

  • $minimize_{x ∈ ℝ^n} f(x)$: optimization problem: question whether the min exists or not (not a number)
  • $\min_{x ∈ ℝ^n} f(x)$: does not always exist (ex: $\exp(-\bullet^2)$)
  • $\inf_{x ∈ ℝ^n} f(x)$: always exists
$x_0$ local minimum:

iff there exists an open neighborhood $V \ni x_0$ s.t. f(x_0) = \min_{x ∈ V} f(x)

$x_0$ global minimum:

iff f(x_0) = \min_{x ∈ ℝ^n} f(x)

If $f$ is differentiable:

$x_0$ is a stationary point/saddle point:

iff $f’(x_0) = 0$

Property:

  1. local minimum ⟹ stationary point (but the converse is not true)
  2. if $f$ is convex, local minimum ⟺ global minimum ⟺ stationary point

Contrained optimization problems

Problem minimize_{x ∈ D ⊆ ℝ^n} f(x)

Assignment problem

  • $p$ tasks
  • $p$ machines
  • there is a cost matrix $C ∈ ℝ^{p×p}$ s.t. $C(i, j)$ is the cost of the task $i$ for the machine $j$

Goal: find a bijection $σ: \lbrace 1, ⋯, p \rbrace ⟶ \lbrace 1, ⋯, p \rbrace$ s.t.

  • task $i$ is assigned to machine $σ(i)$
  • minimize \sum\limits_{ i=1 }^n C_{i, σ(i)}

Permutation matrices $M ∈ 𝔐_p(ℝ)$ where

  • $M_{i, j} = 0$ except when $j = σ(i)$
  • $M_{i, σ(i)} = 1$

Facts (doubly stochastic matrices):

  1. \sum\limits_{ i= 1}^n C_{i, σ(i)} = \sum\limits_{i, j = 1}^n C_{i, j}(M_{σ})_{i, j}
  2. ∀i, \sum\limits_{ j } M_{i, j} = \sum\limits_{ j } M_{j, i} = 1
  3. M ≥ 0

Goal:

minimize_{M \text{ bistochastic}} \underbrace{a^T M}_{\text{linear function}}

As the set of bistochastic matrices is convex, we want to minimize our linear on the convex hull

Birkhoff theorem: the set of bistochastic matrices is the hull of permutation matrices

Lagrangian duality

$D$ may not be a vector space (ex: $\lbrace 0, 1 \rbrace^n$)

minimize_{x∈D} f(x)

such that (constraints):

  1. $h_i(x) = 0$ for all $1 ≤ i ≤m$: equality constraints
  2. $g_j(x) ≤ 0$ for all $1 ≤ j ≤r$: inequality constraints

notation: $D^\ast$ is the subset of $D$ such that the constraints are satisfied (feasible set)

Ex: in the example above, $D$ was the set of all matrices, $D^\ast$ the set of bistochastic matrices

For

  • $x ∈ D$: primal
  • $λ ∈ ℝ^m$: dual
  • $μ ∈ ℝ^r_+$: langragian multipliers
Lagrangian:
ℒ(x, λ, μ) ≝ f(x) + \sum\limits_{ i =1 }^m λ_i h_i(x) + \sum\limits_{ j =1 }^r μ_j g_j(x) \\ = f(x) + λ^T h(x) + μ^T g(x)

With respect to $λ$: as soon as $h ≠ 0$, the maximum is $+∞$ (in $±∞$). Else, it is $0$

With respect to $μ$: as soon as $g \not ≤ 0$, the maximum is $+∞$. Else, it is $0$.

Property: for all $x∈ D$: \sup_{λ, μ ∈ ℝ^m × ℝ^r_+} ℒ(x, λ, μ) = \begin{cases} f(x) \text{ if } x ∈ D^\ast \\ +∞ \text{ else} \end{cases}

\text{The original problem} ⟺ \min_{x ∈ D} \sup_{λ, μ ∈ ℝ^m × ℝ^r_+} ℒ(x, λ, μ)

⟶ a min-max problem

We’ll turn it into a max-min problem.

p^\ast ≝ \underbrace{\inf_{x ∈ D} \sup_{λ, μ ∈ ℝ^m × ℝ^r_+} ℒ(x, λ, μ)}_{\text{primal problem}} ≥ \underbrace{\max_{λ, μ ∈ ℝ^m × ℝ^r_+} \inf_{x ∈ D} ℒ(x, λ, μ)}_{\text{dual problem}} ≝ d^\ast
Weak duality:

the primal value ≥ the dual one

Dual problem: d^\ast ≝ \max_{λ, μ ∈ ℝ^m × ℝ^r_+} \underbrace{\min_{x ∈ D} ℒ(x, λ, μ)}_{\text{concave functions}}

NB: the dual problem is easier, if we have a small number of constraints

Strong duality:

the primal value = the dual one

NB: convexity is needed, for strong duality

Slater’s conditions:

if

  • $D$ convex, $f$ convex
  • $∀i, h_i$ affine (equalities)
  • $∀j, g_j$ convex (inequalities)
  • strict feasibility: there exists a strictly feasible point: $\overline{x} ∈ D^\ast$ s.t. ∀j, \; g_j(\overline{x}) < 0

then there is strong duality

NB: Strict feasibility avoids cases such as $g(x) ≝ x^2$

There are multiple duals (multiple choices to represent your constraints):

Ex:

Difficult:

\inf_ω \Vert y - X ω \Vert^2 + λ(\Vert w \Vert - 1)

s.t. \Vert w \Vert ≤ 1

Easier (and equivalent):

\inf_ω \Vert y - X ω \Vert^2 + λ(\Vert w \Vert^2 - 1)

s.t. \Vert w \Vert^2 ≤ 1


Dual of dual = primal

Ex:

\inf_x f(x) \text{ s.t. } g(x)≤0
𝒜 ≝ \lbrace (u, v) ∈ ℝ^2 \mid ∃x; \; f(x) ≤ v, g(x) ≤ u\rbrace
ℒ(x, g) = f(x) + λ g(x)
q(λ) ≝ \inf_x ℒ(x, g) = \inf_{(u,v) ∈ 𝒜} v + λu

Karush-Kühn-Tucker (KKT) conditions

x^\ast ∈ D, (λ^\ast, μ^\ast) ∈ ℝ^n × ℝ^r

are primal/dual optimal iff

  1. $x^\ast$ feasible and $(λ^\ast, μ^\ast)$ feasible
  2. $x^\ast$ minimizes $ℒ(x, λ^\ast, μ^\ast)$
  3. complementary slackness: ∀j, μ^\ast_j g_j(x^\ast) = 0

Leave a comment