Lecture 7: Maximum likelihood
Maximum likelihood
Inference model:
Bernoulli model:
-
Coin $ℙ(Y=1) = p∈[0,1]$ ($Y ∈ \lbrace 0, 1 \rbrace$)
- $p>1/2 ⟹ 1$
- $p≤1/2 ⟹ 0$
Other approach (goal of this course):
Estimation:
\[y_1, ⋯, y_n ∈ \lbrace 0, 1 \rbrace ⟹ \text{ compute } p ≝ \frac 1 n \sum\limits_{ i } y_i\]Model
$𝒴$ measurable space with reference measure: $μ$
- Lesbesgues if $𝒴 ⊆ ℝ^d$
- Counting measure if $𝒴$ is finite
Collection:
-
$p_θ(\bullet): 𝒴 ⟶ ℝ$ densities wrt $μ$
- $p_θ ≥ 0$
- $\int p_θ dμ = 1$
Examples:
Bernoulli:
- $𝒴 = \lbrace 0, 1 \rbrace$
-
$θ ≝ p ∈ [0,1]$
- $p_θ(y) = ℙ(Y=y)$
Multinomial:
- $𝒴 = \lbrace 1, ⋯, k \rbrace$
-
$θ ∈ \text{ simplex } \lbrace θ ∈ ℝ^k \mid θ ≥ 0, \sum\limits_{ i } θ_i = 1\rbrace$
- $p_θ(y)= ℙ(Y=y) = θ_y$
Gaussian:
- $𝒴 = ℝ$
-
$p(y) = \frac{1}{\sqrt{2π}σ} \exp(- \frac 1 2 \frac{(x-μ)^2}{σ^2})$
- $μ$: mean
- $σ$: standard deviation
- $σ^2$: variance
Multivariate:
- $𝒴 = ℝ^d$
-
$p(y) = \frac{1}{(2π)^{d/2}} \frac{1}{\sqrt{\det Σ}} \exp(- \frac 1 2 (x-μ)^T Σ^{-1} (x-μ))$
-
$Σ$: covariance matrix (positive definite)
-
$x-μ$: mean vector in $ℝ^d$
-
$d=2$: if $Σ = σ^2 Id$: level sets are circles
-
Likelihood
- Likelihood (french: vraisemblance):
-
-
of an observation: $L(θ) = p_θ$
-
of an iid sample: $y_1, ⋯, y_n$: \(L(θ) = \prod_i p_θ(y_i)\)
-
- Log likelihood:
- \[\log L(θ) = \sum\limits_{ i } \log p_θ(y_i)\]
NB: ex: for a Bernoulli distribution with $p = 1/2$: likelihood = $\frac{1}{2^n}$ ⟹ below machine precision. That’s why one takes the $\log$ (and also to turn products into sums).
Maximum likelihood principle
Estimate $θ$ by maximizing $L(θ)$
- Risk interpretation:
- \[-\frac{\log L(θ)}{n} = \frac 1 n \sum\limits_{ i =1 }^n -\log p_θ (y_i)\]
NB: it is the average of $R_θ(y_i) ≝ - \log p_0(y)$
With “infinite” data for the “population case”: $y_1, ⋯, y_n$ sampled from $p_{θ_\ast}$ ($θ_\ast$ is fixed but unknown): risk interpretation tends to $𝔼_y R_θ(y)$. So that one tries to:
Minimize
\[\underbrace{𝔼_y R_θ(y)}_{ ≝ \; g(θ)} = - 𝔼_y \log p_θ(y)\]Question: is $θ_\ast$ minimizing $g(θ)$?
\[g(θ) - g(θ_\ast) = 𝔼_{y \sim p_{θ_\ast}}(- \log p_θ(y)) - 𝔼_{y \sim p_{θ_\ast}}(- \log p_{θ_\ast}(y))\\ = 𝔼_{y \sim p_{θ_\ast}}\left(\frac{\log p_{θ_\ast}(y)}{\log p_θ(y)}\right) = \int p_{θ_\ast} \frac{\log p_{θ_\ast}(y)}{\log p_θ(y)} dμ(y)\]- Kullback-Leibler divergence between $p_{θ_\ast} ≝ p$ and $p_θ ≝ q$:
- \[KL(p, q) = \int p(y) \frac{\log p(y)}{\log q(y)} dμ(y)\]
Property: \(KL(p, q) ≥ 0\) with equality iff $p=q$
Proof: Jensen inequality
Examples:
1. Bernoulli
-
$𝒴 ∈ \lbrace 0, 1 \rbrace$
-
$ℙ(Y=1) = p ∈ [0, 1]$
-
$p(y) = p^y (1-p)^{1-y}$
ML ⟺ minimizing KL with empirical distribution $θ$
\[- \log L(θ) = - \sum\limits_{ i } \log p_θ(y_i) = - n \; 𝔼_{\text{empirical distrib.} \hat{p}} \log p_θ(y) \\ = - n \; \underbrace{𝔼_{\hat{p}(y)} \log \frac{p_θ(y)}{\hat{p}(y)}}_{ = -KL(p_θ, \hat{p})} + n \; 𝔼_{\hat{p}(y)} \log \hat{p}(y) \\\]where
\[\hat{p}(y) = \frac 1 n \sum\limits_{ i } δ(y = y_i)\]Bernoulli:
-
$\hat{p}(y = 1) = \frac 1 n \sum\limits_{ i } \underbrace{y_i}_{= δ(y_i = 1)} = p$
-
$\hat{p}(y = 0) = \frac 1 n \sum\limits_{ i } (1 - y_i)$
Multinomial:
-
$𝒴 ∈ \lbrace 1, ⋯, k \rbrace$
-
$\hat{p}(y) = \text{frequency of observations = } y$
Gaussian in one variable:
- \[p_θ(y) = \frac{1}{\sqrt{2π σ^2}} \exp\left(- \frac 1 2 \frac{(y - μ)^2}{σ^2}\right)\]
- \[- \log L(θ) = - \sum\limits_{ i=1 }^n \log \frac{1}{\sqrt{2π σ^2}} \exp \left(- \frac 1 2 \frac{(y_i - μ)^2}{σ^2}\right) \\ = α + n \, \log σ + \frac 1 2 \sum\limits_{ i }\frac{(y_i - μ)^2}{σ^2}\]
Minimize wrt $μ$:
Derivative wrt $μ$ = 0 ⟹ \(\sum\limits_{ i } (y_i - μ) = 0 ⟹ μ = \frac 1 n \sum\limits_{ i } y_i\)
Minimize wrt $λ = \frac{1}{σ^2}$:
The function is convex wrt $λ$:
So that: derivative wrt $λ$ = 0 ⟹
\[-\frac{n}{2λ} + \frac 1 2 \sum\limits_{ i } (y_i - μ)^2 = 0 ⟹ λ^{-1} = σ^2 = \frac 1 n \sum\limits_{ i } (y_i - μ)^2\]Multivariate Gaussian
- $μ ∈ ℝ^p$
- $Σ ∈ ℝ^{p×p}$ positive definite
Minimize wrt $μ$:
Derivative wrt $μ$ = 0 ⟹ \(\sum\limits_{ i } Σ^{-1} (y_i - μ) = 0 ⟹ μ = \frac 1 n \sum\limits_{ i } y_i\)
Minimize wrt $Λ = Σ^{-1}$:
The function is convex wrt $Λ$:
\[-\frac{n}{2} \log \det Λ + \frac 1 2 \sum\limits_{ i } \underbrace{\underbrace{(y_i - μ)^T}_{1 × p} \underbrace{Λ}_{p x p} \underbrace{(y_i - μ)}_{p × 1}}_{= Tr((y_i - μ)^T Λ (y_i - μ))}\]So that: derivative wrt $Λ$ = 0 ⟹
\[- \frac n 2 Λ^{-1} + \frac 1 2 \sum\limits_{ i } (y_i - μ)(y_i - μ)^T = 0\\ ⟹ Λ^{-1} = \frac 1 n \sum\limits_{ i } (y_i - μ)(y_i - μ)^T\]Conditional Maximum Likelihood (ML)
- $Y ∈ 𝒴$ output
- $X ∈ 𝒳$ input
- full ML model $p_θ(x, y)$
- conditional ML $p_θ(y \mid x)$
- $p(x)$ “unspecified”
Conditional ML for iid data:
\[maximize \; - \sum\limits_{ i } \log p_θ(y_i \mid x_i)\]We don’t worry about the distribution of $x$, as they are given (but downside is that if $y$ is given, we can’t tell anything about $x$).
Linear regression
- $𝒴 = ℝ$
- $𝒳 = ℝ^d$
$p(y \mid x)$: Gaussian with mean $μ(x) = w^T x + b$ and constant variance $σ^2$
Conditional ML
\[- \log L(θ) = - \sum\limits_{ i } \log p_θ(y_i \mid x_i)\]By writing
\[\begin{pmatrix} w \\ b \\ \end{pmatrix}^T \begin{pmatrix} x \\ 1 \\ \end{pmatrix}\]we make $μ$ linear. So in the following, we will assume $μ$ is linear.
\[- \log L(θ) = - \sum\limits_{ i } \log p_θ(y_i \mid x_i) \\ = - \sum\limits_{ i=1 }^n \log \frac{1}{\sqrt{2π σ^2}} \exp \left(- \frac 1 2 \frac{(y_i - w^T x_i)^2}{σ^2}\right) \\ = α + \frac n 2 \, \log(σ^2) + \frac 1 2 \sum\limits_{ i }\frac{(y - w^T x_i)^2}{σ^2}\]Logistic regression
- $𝒴 ∈ \lbrace 0, 1 \rbrace$
- $𝒳 ∈ ℝ^d$
where $σ$ is the sigmoid function
The benefits of logistic is that it gives an incertainty about our outputs.
Generative VS Discriminative
- pair $(x, y)$
- joint ML: $p_θ(x, y)$
-
conditional ML
- discriminative method: $p_θ(y \mid x) ⟶ \max$
- generative method (LDA): $p(x, y) = p_θ(y) \underbrace{p_θ(x \mid y)}_{\text{ if } y \text{ is discrete, called the “class conditional density”}}$
Classification: one makes the assumption:
\[p(x \mid y) \sim \text{ Gaussian: mean } μ_y \text{, cov matrix } Σ_y\]NB: if $𝒴∈ \lbrace 0, 1 \rbrace$ in 2D: $μ_0$ is the mean of the points labelled “0”, $Σ_0$ gives the radius of their distribution
But there’s a problem: if the $x$’s labelled $0$ are separated into two distinct groups, it doesn’t work ($μ_0$ is then in the middle, and the hyperplane separating $0$-labelled $x$’s and the $1$-labelled $x$’s might cut into the $0$-labelled $x$’s).
Questions:
- How do we get $p(y \mid x)$?
- estimating parameters
- Bayes rule:
- \[p(y \mid x) = \frac{p(x \mid y) p(y)}{p(x)}\]
Leave a comment