Lecture 4: Convex analysis/optimization
Convex functions
Neural networks: non-convex
We want the loss function to be convex:
\[\min_f \frac 1 n \sum\limits_{ i=1 }^n l(y_i, f(x_i)) + λ Ω(f)\]- Convex set $K ⊆ ℝ^n$:
-
iff \(∀(x, y) ∈ K, ∀α∈[0, 1], \; αx + (1-α)y ∈ K\)
Examples:
- $ℝ^n$
- $\lbrace x \mid \Vert x \Vert \rbrace ≤ 1$
- Affine subspace: $A^T x = b$
- Half-space: $a^Tx ≤ b$
- Polytopes $∈ℝ^n$
Property: convex sets are stable under intersection
NB: We’ll always try to see our sets as intersections of convex sets
- Convex hull of a set $A ⊆ ℝ^n$:
-
the smallest convex set containing $A$
Property: \(hull(A) = \text{set of barycenters of points in } A \\ ≝ \left\lbrace \sum\limits_{ i∈ I } α_i x_i \mid x_i ∈ A, α_i ≥ 0, \sum\limits_{ i∈I } α_i = 1 \right\rbrace\)
Hahn-Banach Theorem: if $C$ and $D$ are two disjoint convex sets, they can be separated by a hyperplane
i.e. \(∃a, b; \; \begin{cases} C ⊆ \lbrace a^Tx ≤ b \rbrace \\ D ⊆ \lbrace a^Tx ≥ b \rbrace \end{cases}\)
- Convex functions $f: D ⊆ ℝ^n ⟶ ℝ$ is convex:
-
iff
- $D$ is convex
- \[∀(x, y) ∈ D, ∀α∈[0, 1], \; f(αx + (1-α)y) ≤ αf(x) + (1-α)f(y)\]
- Strict convexity:
-
- $α ∈ ]0, 1[$
- $f(αx + (1-α)y) < αf(x) + (1-α)f(y)$
NB: Warning: translation in English:
- positif (French) ⟺ non negative (English)
-
strictement positif (French) ⟺ positive (English)
- $f$ croissante (French) ⟺ $f$ non decreasing (English)
- $μ$-Strong convexity:
-
- $α ∈ [0, 1]$
- $f(αx + (1-α)y) ≤ αf(x) + (1-α)f(y) - μ \frac{α(1-α)}{2} \Vert x - y \Vert^2$
Classic examples
-
in 1D: $x, x^2, − \log(x), \exp(x), \log(1 + \exp(−x)), \vert x \vert^p \text{ where } p ≥ 1$
-
in $n$-D:
- affine $a^T x + b$
- quadratic $x Q^T x$ if $Q$ is positive (semi-definite)
- log-sum-exp: $\log\left(\sum\limits_{ i=1 }^n \exp(x_i)\right)$
- max
- norms
How to recognize a convex function?
- rarely used: in $ℝ$, if $f$ is differentiable: \(f \text{ cvx } ⟺ f' \text{ is non-decreasing}\)
Warning: this doesn’t work
\(\inf(f_0(x), f_1(x)) = \inf_{y ∈ \lbrace 0, 1 \rbrace} y f_1 (x) + (1-y) f_0(x)\) since $\lbrace 0, 1 \rbrace$ is not convex, and $(x, y) ⟼ y f_1 (x) + (1-y) f_0(x)$ is not convex in $x$ AND $y$
- Jensen’s inequality for a convex function $f$:
-
if $Z$ is a random vector contained in the domain of $f$: \(f(𝔼 Z) ≤ 𝔼 f(Z)\)
Optimization problems
Let $f: ℝ^n ⟶ ℝ$ be any function.
- $minimize_{x ∈ ℝ^n} f(x)$: optimization problem: question whether the min exists or not (not a number)
- $\min_{x ∈ ℝ^n} f(x)$: does not always exist (ex: $\exp(-\bullet^2)$)
- $\inf_{x ∈ ℝ^n} f(x)$: always exists
- $x_0$ local minimum:
-
iff there exists an open neighborhood $V \ni x_0$ s.t. \(f(x_0) = \min_{x ∈ V} f(x)\)
- $x_0$ global minimum:
-
iff \(f(x_0) = \min_{x ∈ ℝ^n} f(x)\)
If $f$ is differentiable:
- $x_0$ is a stationary point/saddle point:
-
iff $f’(x_0) = 0$
Property:
- local minimum ⟹ stationary point (but the converse is not true)
- if $f$ is convex, local minimum ⟺ global minimum ⟺ stationary point
Contrained optimization problems
Problem \(minimize_{x ∈ D ⊆ ℝ^n} f(x)\)
Assignment problem
- $p$ tasks
- $p$ machines
- there is a cost matrix $C ∈ ℝ^{p×p}$ s.t. $C(i, j)$ is the cost of the task $i$ for the machine $j$
Goal: find a bijection $σ: \lbrace 1, ⋯, p \rbrace ⟶ \lbrace 1, ⋯, p \rbrace$ s.t.
- task $i$ is assigned to machine $σ(i)$
- \[minimize \sum\limits_{ i=1 }^n C_{i, σ(i)}\]
Permutation matrices $M ∈ 𝔐_p(ℝ)$ where
- $M_{i, j} = 0$ except when $j = σ(i)$
- $M_{i, σ(i)} = 1$
Facts (doubly stochastic matrices):
- \[\sum\limits_{ i= 1}^n C_{i, σ(i)} = \sum\limits_{i, j = 1}^n C_{i, j}(M_{σ})_{i, j}\]
- \[∀i, \sum\limits_{ j } M_{i, j} = \sum\limits_{ j } M_{j, i} = 1\]
- \[M ≥ 0\]
Goal:
\[minimize_{M \text{ bistochastic}} \underbrace{a^T M}_{\text{linear function}}\]As the set of bistochastic matrices is convex, we want to minimize our linear on the convex hull
Birkhoff theorem: the set of bistochastic matrices is the hull of permutation matrices
Lagrangian duality
$D$ may not be a vector space (ex: $\lbrace 0, 1 \rbrace^n$)
\[minimize_{x∈D} f(x)\]such that (constraints):
- $h_i(x) = 0$ for all $1 ≤ i ≤m$: equality constraints
- $g_j(x) ≤ 0$ for all $1 ≤ j ≤r$: inequality constraints
notation: $D^\ast$ is the subset of $D$ such that the constraints are satisfied (feasible set)
Ex: in the example above, $D$ was the set of all matrices, $D^\ast$ the set of bistochastic matrices
For
- $x ∈ D$: primal
- $λ ∈ ℝ^m$: dual
- $μ ∈ ℝ^r_+$: langragian multipliers
- Lagrangian:
- \[ℒ(x, λ, μ) ≝ f(x) + \sum\limits_{ i =1 }^m λ_i h_i(x) + \sum\limits_{ j =1 }^r μ_j g_j(x) \\ = f(x) + λ^T h(x) + μ^T g(x)\]
With respect to $λ$: as soon as $h ≠ 0$, the maximum is $+∞$ (in $±∞$). Else, it is $0$
With respect to $μ$: as soon as $g \not ≤ 0$, the maximum is $+∞$. Else, it is $0$.
\[\text{The original problem} ⟺ \min_{x ∈ D} \sup_{λ, μ ∈ ℝ^m × ℝ^r_+} ℒ(x, λ, μ)\]Property: for all $x∈ D$: \(\sup_{λ, μ ∈ ℝ^m × ℝ^r_+} ℒ(x, λ, μ) = \begin{cases} f(x) \text{ if } x ∈ D^\ast \\ +∞ \text{ else} \end{cases}\)
⟶ a min-max problem
We’ll turn it into a max-min problem.
\[p^\ast ≝ \underbrace{\inf_{x ∈ D} \sup_{λ, μ ∈ ℝ^m × ℝ^r_+} ℒ(x, λ, μ)}_{\text{primal problem}} ≥ \underbrace{\max_{λ, μ ∈ ℝ^m × ℝ^r_+} \inf_{x ∈ D} ℒ(x, λ, μ)}_{\text{dual problem}} ≝ d^\ast\]- Weak duality:
-
the primal value ≥ the dual one
Dual problem: \(d^\ast ≝ \max_{λ, μ ∈ ℝ^m × ℝ^r_+} \underbrace{\min_{x ∈ D} ℒ(x, λ, μ)}_{\text{concave functions}}\)
NB: the dual problem is easier, if we have a small number of constraints
- Strong duality:
-
the primal value = the dual one
NB: convexity is needed, for strong duality
- Slater’s conditions:
-
if
- $D$ convex, $f$ convex
- $∀i, h_i$ affine (equalities)
- $∀j, g_j$ convex (inequalities)
- strict feasibility: there exists a strictly feasible point: $\overline{x} ∈ D^\ast$ s.t. \(∀j, \; g_j(\overline{x}) < 0\)
then there is strong duality
NB: Strict feasibility avoids cases such as $g(x) ≝ x^2$
There are multiple duals (multiple choices to represent your constraints):
Ex:
Difficult:
\[\inf_ω \Vert y - X ω \Vert^2 + λ(\Vert w \Vert - 1)\]s.t. \(\Vert w \Vert ≤ 1\)
Easier (and equivalent):
\[\inf_ω \Vert y - X ω \Vert^2 + λ(\Vert w \Vert^2 - 1)\]s.t. \(\Vert w \Vert^2 ≤ 1\)
Dual of dual = primal
Ex:
\[\inf_x f(x) \text{ s.t. } g(x)≤0\] \[𝒜 ≝ \lbrace (u, v) ∈ ℝ^2 \mid ∃x; \; f(x) ≤ v, g(x) ≤ u\rbrace\] \[ℒ(x, g) = f(x) + λ g(x)\] \[q(λ) ≝ \inf_x ℒ(x, g) = \inf_{(u,v) ∈ 𝒜} v + λu\]Karush-Kühn-Tucker (KKT) conditions
\[x^\ast ∈ D, (λ^\ast, μ^\ast) ∈ ℝ^n × ℝ^r\]are primal/dual optimal iff
- $x^\ast$ feasible and $(λ^\ast, μ^\ast)$ feasible
- $x^\ast$ minimizes $ℒ(x, λ^\ast, μ^\ast)$
- complementary slackness: \(∀j, μ^\ast_j g_j(x^\ast) = 0\)
Leave a comment