# Convex functions

Neural networks: non-convex

We want the loss function to be convex:

$\min_f \frac 1 n \sum\limits_{ i=1 }^n l(y_i, f(x_i)) + λ Ω(f)$
Convex set $K ⊆ ℝ^n$:

iff $∀(x, y) ∈ K, ∀α∈[0, 1], \; αx + (1-α)y ∈ K$

Examples:

• $ℝ^n$
• $\lbrace x \mid \Vert x \Vert \rbrace ≤ 1$
• Affine subspace: $A^T x = b$
• Half-space: $a^Tx ≤ b$
• Polytopes $∈ℝ^n$

Property: convex sets are stable under intersection

NB: We’ll always try to see our sets as intersections of convex sets

Convex hull of a set $A ⊆ ℝ^n$:

the smallest convex set containing $A$

Property: $hull(A) = \text{set of barycenters of points in } A \\ ≝ \left\lbrace \sum\limits_{ i∈ I } α_i x_i \mid x_i ∈ A, α_i ≥ 0, \sum\limits_{ i∈I } α_i = 1 \right\rbrace$

Hahn-Banach Theorem: if $C$ and $D$ are two disjoint convex sets, they can be separated by a hyperplane

i.e. $∃a, b; \; \begin{cases} C ⊆ \lbrace a^Tx ≤ b \rbrace \\ D ⊆ \lbrace a^Tx ≥ b \rbrace \end{cases}$

Convex functions $f: D ⊆ ℝ^n ⟶ ℝ$ is convex:

iff

• $D$ is convex
• $∀(x, y) ∈ D, ∀α∈[0, 1], \; f(αx + (1-α)y) ≤ αf(x) + (1-α)f(y)$
Strict convexity:
• $α ∈ ]0, 1[$
• $f(αx + (1-α)y) < αf(x) + (1-α)f(y)$

NB: Warning: translation in English:

• positif (French) ⟺ non negative (English)
• strictement positif (French) ⟺ positive (English)

• $f$ croissante (French) ⟺ $f$ non decreasing (English)
$μ$-Strong convexity:
• $α ∈ [0, 1]$
• $f(αx + (1-α)y) ≤ αf(x) + (1-α)f(y) - μ \frac{α(1-α)}{2} \Vert x - y \Vert^2$

Classic examples

• in 1D: $x, x^2, − \log(x), \exp(x), \log(1 + \exp(−x)), \vert x \vert^p \text{ where } p ≥ 1$

• in $n$-D:

• affine $a^T x + b$
• quadratic $x Q^T x$ if $Q$ is positive (semi-definite)
• log-sum-exp: $\log\left(\sum\limits_{ i=1 }^n \exp(x_i)\right)$
• max
• norms

### How to recognize a convex function?

• rarely used: in $ℝ$, if $f$ is differentiable: $f \text{ cvx } ⟺ f' \text{ is non-decreasing}$

Warning: this doesn’t work

$\inf(f_0(x), f_1(x)) = \inf_{y ∈ \lbrace 0, 1 \rbrace} y f_1 (x) + (1-y) f_0(x)$ since $\lbrace 0, 1 \rbrace$ is not convex, and $(x, y) ⟼ y f_1 (x) + (1-y) f_0(x)$ is not convex in $x$ AND $y$

Jensen’s inequality for a convex function $f$:

if $Z$ is a random vector contained in the domain of $f$: $f(𝔼 Z) ≤ 𝔼 f(Z)$

# Optimization problems

Let $f: ℝ^n ⟶ ℝ$ be any function.

• $minimize_{x ∈ ℝ^n} f(x)$: optimization problem: question whether the min exists or not (not a number)
• $\min_{x ∈ ℝ^n} f(x)$: does not always exist (ex: $\exp(-\bullet^2)$)
• $\inf_{x ∈ ℝ^n} f(x)$: always exists
$x_0$ local minimum:

iff there exists an open neighborhood $V \ni x_0$ s.t. $f(x_0) = \min_{x ∈ V} f(x)$

$x_0$ global minimum:

iff $f(x_0) = \min_{x ∈ ℝ^n} f(x)$

If $f$ is differentiable:

$x_0$ is a stationary point/saddle point:

iff $f’(x_0) = 0$

Property:

1. local minimum ⟹ stationary point (but the converse is not true)
2. if $f$ is convex, local minimum ⟺ global minimum ⟺ stationary point

## Contrained optimization problems

Problem $minimize_{x ∈ D ⊆ ℝ^n} f(x)$

### Assignment problem

• $p$ tasks
• $p$ machines
• there is a cost matrix $C ∈ ℝ^{p×p}$ s.t. $C(i, j)$ is the cost of the task $i$ for the machine $j$

Goal: find a bijection $σ: \lbrace 1, ⋯, p \rbrace ⟶ \lbrace 1, ⋯, p \rbrace$ s.t.

• task $i$ is assigned to machine $σ(i)$
• $minimize \sum\limits_{ i=1 }^n C_{i, σ(i)}$

Permutation matrices $M ∈ 𝔐_p(ℝ)$ where

• $M_{i, j} = 0$ except when $j = σ(i)$
• $M_{i, σ(i)} = 1$

Facts (doubly stochastic matrices):

1. $\sum\limits_{ i= 1}^n C_{i, σ(i)} = \sum\limits_{i, j = 1}^n C_{i, j}(M_{σ})_{i, j}$
2. $∀i, \sum\limits_{ j } M_{i, j} = \sum\limits_{ j } M_{j, i} = 1$
3. $M ≥ 0$

Goal:

$minimize_{M \text{ bistochastic}} \underbrace{a^T M}_{\text{linear function}}$

As the set of bistochastic matrices is convex, we want to minimize our linear on the convex hull

Birkhoff theorem: the set of bistochastic matrices is the hull of permutation matrices

# Lagrangian duality

$D$ may not be a vector space (ex: $\lbrace 0, 1 \rbrace^n$)

$minimize_{x∈D} f(x)$

such that (constraints):

1. $h_i(x) = 0$ for all $1 ≤ i ≤m$: equality constraints
2. $g_j(x) ≤ 0$ for all $1 ≤ j ≤r$: inequality constraints

notation: $D^\ast$ is the subset of $D$ such that the constraints are satisfied (feasible set)

Ex: in the example above, $D$ was the set of all matrices, $D^\ast$ the set of bistochastic matrices

For

• $x ∈ D$: primal
• $λ ∈ ℝ^m$: dual
• $μ ∈ ℝ^r_+$: langragian multipliers
Lagrangian:
$ℒ(x, λ, μ) ≝ f(x) + \sum\limits_{ i =1 }^m λ_i h_i(x) + \sum\limits_{ j =1 }^r μ_j g_j(x) \\ = f(x) + λ^T h(x) + μ^T g(x)$

With respect to $λ$: as soon as $h ≠ 0$, the maximum is $+∞$ (in $±∞$). Else, it is $0$

With respect to $μ$: as soon as $g \not ≤ 0$, the maximum is $+∞$. Else, it is $0$.

Property: for all $x∈ D$: $\sup_{λ, μ ∈ ℝ^m × ℝ^r_+} ℒ(x, λ, μ) = \begin{cases} f(x) \text{ if } x ∈ D^\ast \\ +∞ \text{ else} \end{cases}$

$\text{The original problem} ⟺ \min_{x ∈ D} \sup_{λ, μ ∈ ℝ^m × ℝ^r_+} ℒ(x, λ, μ)$

⟶ a min-max problem

We’ll turn it into a max-min problem.

$p^\ast ≝ \underbrace{\inf_{x ∈ D} \sup_{λ, μ ∈ ℝ^m × ℝ^r_+} ℒ(x, λ, μ)}_{\text{primal problem}} ≥ \underbrace{\max_{λ, μ ∈ ℝ^m × ℝ^r_+} \inf_{x ∈ D} ℒ(x, λ, μ)}_{\text{dual problem}} ≝ d^\ast$
Weak duality:

the primal value ≥ the dual one

Dual problem: $d^\ast ≝ \max_{λ, μ ∈ ℝ^m × ℝ^r_+} \underbrace{\min_{x ∈ D} ℒ(x, λ, μ)}_{\text{concave functions}}$

NB: the dual problem is easier, if we have a small number of constraints

Strong duality:

the primal value = the dual one

NB: convexity is needed, for strong duality

Slater’s conditions:

if

• $D$ convex, $f$ convex
• $∀i, h_i$ affine (equalities)
• $∀j, g_j$ convex (inequalities)
• strict feasibility: there exists a strictly feasible point: $\overline{x} ∈ D^\ast$ s.t. $∀j, \; g_j(\overline{x}) < 0$

then there is strong duality

NB: Strict feasibility avoids cases such as $g(x) ≝ x^2$

There are multiple duals (multiple choices to represent your constraints):

Ex:

Difficult:

$\inf_ω \Vert y - X ω \Vert^2 + λ(\Vert w \Vert - 1)$

s.t. $\Vert w \Vert ≤ 1$

Easier (and equivalent):

$\inf_ω \Vert y - X ω \Vert^2 + λ(\Vert w \Vert^2 - 1)$

s.t. $\Vert w \Vert^2 ≤ 1$

Dual of dual = primal

Ex:

$\inf_x f(x) \text{ s.t. } g(x)≤0$ $𝒜 ≝ \lbrace (u, v) ∈ ℝ^2 \mid ∃x; \; f(x) ≤ v, g(x) ≤ u\rbrace$ $ℒ(x, g) = f(x) + λ g(x)$ $q(λ) ≝ \inf_x ℒ(x, g) = \inf_{(u,v) ∈ 𝒜} v + λu$

## Karush-Kühn-Tucker (KKT) conditions

$x^\ast ∈ D, (λ^\ast, μ^\ast) ∈ ℝ^n × ℝ^r$

are primal/dual optimal iff

1. $x^\ast$ feasible and $(λ^\ast, μ^\ast)$ feasible
2. $x^\ast$ minimizes $ℒ(x, λ^\ast, μ^\ast)$
3. complementary slackness: $∀j, μ^\ast_j g_j(x^\ast) = 0$

Tags:

Updated: