Exercises 6: Learning theory and PAC bounds
Maximum Likelihood and Linear discriminative analysis
1. Gaussian in one variable:
\[p(y) = \frac{1}{\sqrt{2π} σ} \exp\left(- \frac 1 2 \frac{(y - μ)^2}{σ^2}\right)\]$(x_1, ⋯, x_n)$ iid sample from the model $\sim 𝒩(μ, σ^2)$
\[\begin{align*} - \log L(x_1, ⋯, x_n, μ, σ^2) &= - \sum\limits_{ i=1 }^n \log \frac{1}{\sqrt{2π σ^2}} \exp \left(- \frac 1 2 \frac{(x_i - μ)^2}{σ^2}\right) \\ &= \underbrace{n \, \log (\sqrt{2π}) + n \, \log σ + \frac 1 2 \sum\limits_{ i }\frac{(x_i - μ)^2}{σ^2}}_{ ≝ \, l(μ, σ)} \end{align*}\]Minimize $l$ with respect to $μ$:
\[\begin{align*} &\frac{\partial l}{\partial μ}(\hat{μ}) = 0 \\ ⟹ & \sum\limits_{ i } (x_i - \hat{μ}) = 0 \\ ⟹ & \hat{μ} = \frac 1 n \sum\limits_{ i } x_i \end{align*}\]Minimize $l$ wrt $λ ≝ \frac{1}{σ^2}$:
The function is convex wrt $λ$:
\[\begin{align*} &\frac{\partial l}{\partial λ}(\hat{λ}) = 0 \\ ⟹ & -\frac{n}{2λ} + \frac 1 2 \sum\limits_{ i } (x_i - μ)^2 = 0 \\ ⟹ & \hat{λ}^{-1} = \hat{σ}^2 = \frac 1 n \sum\limits_{ i } (x_i - μ)^2 \end{align*}\]BONUS: Regression:
\[Y = β^T x + ε\]where
- $ε \sim 𝒩(0, σ^2)$
- $y \mid x \sim 𝒩(β^T x, σ^2)$
Same as before, but replace $μ$ by $β^Tx$:
\[l(β, σ^2) = -n \log (\sqrt{2π σ^2}) - \frac{1}{2σ^2} \sum\limits_{ i=1 }^n (y_i - β^T x_i)^2\]Optimize over $β$:
\[\max - \frac{1}{2 σ^2} \sum\limits_{ i }(y_i - β^T x_i)^2 \\ ⟺ \min \sum\limits_{ i } (y_i - β^T x_i)^2\]⟶ Least Squares problem
$\hat{β} = (X^T X)^{-1} X^T y$$
Optimize over $σ^2$:
\[\hat{σ}^2 = \frac 1 n \sum\limits_{ i=1 }^n (y_i - \hat{β}^T x_i)^2\](same computation as before)
BONUS: Discrete model
Two random variables $z$ and $x$
- $z ∈ \lbrace 1, ⋯, M \rbrace$
- $x ∈ \lbrace 1, ⋯, K \rbrace$
Samples $(z_i, x_i)_{1 ≤ i ≤ n}$
Instead consider:
\[(z_i, x_i) = (z_i^m, x_i^k)_{1≤ m ≤ M, 1 ≤ k ≤ K}\]where
-
$z_i = (0, ⋯, 1, 0, ⋯, 0) ∈ \lbrace 0, 1 \rbrace^M$ (in position $m$)
-
$x_i = (0, ⋯, 1, 0, ⋯, 0) ∈ \lbrace 0, 1 \rbrace^K$ (in position $k$)
Basically:
- \[z_i^n = \begin{cases} 1 \text{ if } z_i = m \\ 0 \text{ else} \end{cases}\]
- \[x_i^n = \begin{cases} 1 \text{ if } x_i = k \\ 0 \text{ else} \end{cases}\]
Likelihood ?:
\[\begin{align*} l(π, θ) &= \log\left(\prod_{i=1}^n p(x_i, z_i \mid π, θ)\right) \\ \end{align*}\] \[p(x_i, z_i \mid π, θ) = \prod_{k=1}^K \prod_{m=1}^M π_m^{z_i^m} θ_{k,m}^{z_i^m, x_i^k}\]where
\[π_m^{z_i^m} = \begin{cases} π_m &&\text{ if } z_i^m = 1 ⟺ z_i = m \\ 1 &&\text{ else (i.e. } z_i^m = 0 \text{ )} \end{cases}\]and similarly for $θ_{k,m}^{z_i^m, x_i^k}$
since $p(x_i = k, z_i = m \mid π, θ) = π_k θ_{k, m}$
So:
\[\begin{align*} l(π, θ) &= \log\left(\prod_{i=1}^n p(x_i, z_i \mid π, θ)\right) \\ &= \log\left(\prod_{i=1}^n \prod_{k=1}^K \prod_{m=1}^M π_m^{z_i^m} θ_{k,m}^{z_i^m, x_i^k}\right) \\ &= \sum\limits_{ i=1 }^n \sum\limits_{ m=1 }^M z_i^m \log π_m + \sum\limits_{ i=1 }^n \sum\limits_{ m=1 }^M \sum\limits_{ k=1 }^K z_i^m x_i^k \log θ_{k,m} \\ &= \sum\limits_{ m =1 }^M N_m \log π_m + \sum\limits_{ m =1 }^M \sum\limits_{ k =1 }^K N_{m, k} \log θ_{k,m} \end{align*}\]where
-
$N_m$ is the number of samples s.t. $z = m$
-
$N_{m, k}$ is the number of samples s.t. $z = m, x= k$
Therefore:
\[l(π, θ) = \underbrace{\sum\limits_{ m =1 }^M N_m \log π_m}_{\text{ only depends on } π} + \underbrace{\sum\limits_{ m =1 }^M \sum\limits_{ k =1 }^K N_{m, k} \log θ_{k,m}}_{\text{ only depends on } θ}\]$\max l(π, θ)$ with the constraints
- $\sum\limits_{ m=1 }^M π_m = 1$
- $\sum\limits_{ k=1 }^M θ_{m, k} = 1$ for all $m$
Two problems:
-
\[\max_π \sum\limits_{ m =1 }^M N_m \log π_m\]
s.t. $\sum\limits_{ m=1 }^M π_m = 1$
-
\[\max_θ \sum\limits_{ k =1 }^K \sum\limits_{ m =1 }^M N_{m, k} \log θ_m\]
s.t. $\sum\limits_{ m=1 }^M θ_{m, k} = 1$ for all $1 ≤ m ≤ M$
1.
Lagrangian:
\[ℒ(π, λ) = \sum\limits_{ m =1 }^M N_m \log π_m + λ \left(\sum\limits_{ m=1 }^M π_m - 1\right)\]First order conditions:
- $\frac{\partial ℒ}{\partial π_m} = 0$
- $\frac{\partial ℒ}{\partial λ} = 0$
⟺
- $\frac{N_m}{π_m} + λ = 0$
- $\sum\limits_{ m=1 }^M π_m = 1$
⟺
- $π_m = \frac{N_m}{λ}$
- $1 = \sum\limits_{ m } π_m = \sum\limits_{ m } \frac{N_m}{λ} ⟹ λ = n$
⟹ \(π_m = \frac{N_m}{n}\)
Leave a comment