Exercises 6: Learning theory and PAC bounds

Maximum Likelihood and Linear discriminative analysis

1. Gaussian in one variable:

p(y) = \frac{1}{\sqrt{2π} σ} \exp\left(- \frac 1 2 \frac{(y - μ)^2}{σ^2}\right)

$(x_1, ⋯, x_n)$ iid sample from the model $\sim 𝒩(μ, σ^2)$

\begin{align*} - \log L(x_1, ⋯, x_n, μ, σ^2) &= - \sum\limits_{ i=1 }^n \log \frac{1}{\sqrt{2π σ^2}} \exp \left(- \frac 1 2 \frac{(x_i - μ)^2}{σ^2}\right) \\ &= \underbrace{n \, \log (\sqrt{2π}) + n \, \log σ + \frac 1 2 \sum\limits_{ i }\frac{(x_i - μ)^2}{σ^2}}_{ ≝ \, l(μ, σ)} \end{align*}

Minimize $l$ with respect to $μ$:

\begin{align*} &\frac{\partial l}{\partial μ}(\hat{μ}) = 0 \\ ⟹ & \sum\limits_{ i } (x_i - \hat{μ}) = 0 \\ ⟹ & \hat{μ} = \frac 1 n \sum\limits_{ i } x_i \end{align*}

Minimize $l$ wrt $λ ≝ \frac{1}{σ^2}$:

The function is convex wrt $λ$:

\begin{align*} &\frac{\partial l}{\partial λ}(\hat{λ}) = 0 \\ ⟹ & -\frac{n}{2λ} + \frac 1 2 \sum\limits_{ i } (x_i - μ)^2 = 0 \\ ⟹ & \hat{λ}^{-1} = \hat{σ}^2 = \frac 1 n \sum\limits_{ i } (x_i - μ)^2 \end{align*}

BONUS: Regression:

Y = β^T x + ε

where

  • $ε \sim 𝒩(0, σ^2)$
  • $y \mid x \sim 𝒩(β^T x, σ^2)$

Same as before, but replace $μ$ by $β^Tx$:

l(β, σ^2) = -n \log (\sqrt{2π σ^2}) - \frac{1}{2σ^2} \sum\limits_{ i=1 }^n (y_i - β^T x_i)^2

Optimize over $β$:

\max - \frac{1}{2 σ^2} \sum\limits_{ i }(y_i - β^T x_i)^2 \\ ⟺ \min \sum\limits_{ i } (y_i - β^T x_i)^2

⟶ Least Squares problem

$\hat{β} = (X^T X)^{-1} X^T y$$

Optimize over $σ^2$:

\hat{σ}^2 = \frac 1 n \sum\limits_{ i=1 }^n (y_i - \hat{β}^T x_i)^2

(same computation as before)

BONUS: Discrete model

Two random variables $z$ and $x$

  • $z ∈ \lbrace 1, ⋯, M \rbrace$
  • $x ∈ \lbrace 1, ⋯, K \rbrace$
p(z = m) = π_m \\ p(x = k \mid z=m) = θ_{k,m}

Samples $(z_i, x_i)_{1 ≤ i ≤ n}$

Instead consider:

(z_i, x_i) = (z_i^m, x_i^k)_{1≤ m ≤ M, 1 ≤ k ≤ K}

where

  • $z_i = (0, ⋯, 1, 0, ⋯, 0) ∈ \lbrace 0, 1 \rbrace^M$ (in position $m$)

  • $x_i = (0, ⋯, 1, 0, ⋯, 0) ∈ \lbrace 0, 1 \rbrace^K$ (in position $k$)

Basically:

  • z_i^n = \begin{cases} 1 \text{ if } z_i = m \\ 0 \text{ else} \end{cases}
  • x_i^n = \begin{cases} 1 \text{ if } x_i = k \\ 0 \text{ else} \end{cases}

Likelihood ?:

\begin{align*} l(π, θ) &= \log\left(\prod_{i=1}^n p(x_i, z_i \mid π, θ)\right) \\ \end{align*}
p(x_i, z_i \mid π, θ) = \prod_{k=1}^K \prod_{m=1}^M π_m^{z_i^m} θ_{k,m}^{z_i^m, x_i^k}

where

π_m^{z_i^m} = \begin{cases} π_m &&\text{ if } z_i^m = 1 ⟺ z_i = m \\ 1 &&\text{ else (i.e. } z_i^m = 0 \text{ )} \end{cases}

and similarly for $θ_{k,m}^{z_i^m, x_i^k}$

since $p(x_i = k, z_i = m \mid π, θ) = π_k θ_{k, m}$

So:

\begin{align*} l(π, θ) &= \log\left(\prod_{i=1}^n p(x_i, z_i \mid π, θ)\right) \\ &= \log\left(\prod_{i=1}^n \prod_{k=1}^K \prod_{m=1}^M π_m^{z_i^m} θ_{k,m}^{z_i^m, x_i^k}\right) \\ &= \sum\limits_{ i=1 }^n \sum\limits_{ m=1 }^M z_i^m \log π_m + \sum\limits_{ i=1 }^n \sum\limits_{ m=1 }^M \sum\limits_{ k=1 }^K z_i^m x_i^k \log θ_{k,m} \\ &= \sum\limits_{ m =1 }^M N_m \log π_m + \sum\limits_{ m =1 }^M \sum\limits_{ k =1 }^K N_{m, k} \log θ_{k,m} \end{align*}

where

  • $N_m$ is the number of samples s.t. $z = m$

  • $N_{m, k}$ is the number of samples s.t. $z = m, x= k$

Therefore:

l(π, θ) = \underbrace{\sum\limits_{ m =1 }^M N_m \log π_m}_{\text{ only depends on } π} + \underbrace{\sum\limits_{ m =1 }^M \sum\limits_{ k =1 }^K N_{m, k} \log θ_{k,m}}_{\text{ only depends on } θ}

$\max l(π, θ)$ with the constraints

  • $\sum\limits_{ m=1 }^M π_m = 1$
  • $\sum\limits_{ k=1 }^M θ_{m, k} = 1$ for all $m$

Two problems:

  1. \max_π \sum\limits_{ m =1 }^M N_m \log π_m

    s.t. $\sum\limits_{ m=1 }^M π_m = 1$


  1. \max_θ \sum\limits_{ k =1 }^K \sum\limits_{ m =1 }^M N_{m, k} \log θ_m

    s.t. $\sum\limits_{ m=1 }^M θ_{m, k} = 1$ for all $1 ≤ m ≤ M$

1.

Lagrangian:

ℒ(π, λ) = \sum\limits_{ m =1 }^M N_m \log π_m + λ \left(\sum\limits_{ m=1 }^M π_m - 1\right)

First order conditions:

  • $\frac{\partial ℒ}{\partial π_m} = 0$
  • $\frac{\partial ℒ}{\partial λ} = 0$

  • $\frac{N_m}{π_m} + λ = 0$
  • $\sum\limits_{ m=1 }^M π_m = 1$

  • $π_m = \frac{N_m}{λ}$
  • $1 = \sum\limits_{ m } π_m = \sum\limits_{ m } \frac{N_m}{λ} ⟹ λ = n$

π_m = \frac{N_m}{n}

Leave a comment