Lecture 7: Maximum likelihood

Maximum likelihood

Inference model:

Bernoulli model:

Coin $ℙ(Y=1) = p∈[0,1]$ ($Y ∈ \lbrace 0, 1 \rbrace$)
- $p>1/2 ⟹ 1$
- $p≤1/2 ⟹ 0$

\[y ∈ argmax \; ℙ(Y=y)\]

Other approach (goal of this course):

Estimation:

\[y_1, ⋯, y_n ∈ \lbrace 0, 1 \rbrace ⟹ \text{ compute } p ≝ \frac 1 n \sum\limits_{ i } y_i\]

Model

$𝒴$ measurable space with reference measure: $μ$

Lesbesgues if $𝒴 ⊆ ℝ^d$
Counting measure if $𝒴$ is finite

Collection:

$p_θ(\bullet): 𝒴 ⟶ ℝ$ densities wrt $μ$
- $p_θ ≥ 0$
- $\int p_θ dμ = 1$

Examples:

Bernoulli:

$𝒴 = \lbrace 0, 1 \rbrace$
$θ ≝ p ∈ [0,1]$
- $p_θ(y) = ℙ(Y=y)$

Multinomial:

$𝒴 = \lbrace 1, ⋯, k \rbrace$
$θ ∈ \text{ simplex } \lbrace θ ∈ ℝ^k \mid θ ≥ 0, \sum\limits_{ i } θ_i = 1\rbrace$
- $p_θ(y)= ℙ(Y=y) = θ_y$

Gaussian:

$𝒴 = ℝ$
$p(y) = \frac{1}{\sqrt{2π}σ} \exp(- \frac 1 2 \frac{(x-μ)^2}{σ^2})$
- $μ$: mean
- $σ$: standard deviation
- $σ^2$: variance

Multivariate:

$𝒴 = ℝ^d$
$p(y) = \frac{1}{(2π)^{d/2}} \frac{1}{\sqrt{\det Σ}} \exp(- \frac 1 2 (x-μ)^T Σ^{-1} (x-μ))$
- $Σ$: covariance matrix (positive definite)
- $x-μ$: mean vector in $ℝ^d$
- $d=2$: if $Σ = σ^2 Id$: level sets are circles

Likelihood

Likelihood (french: vraisemblance):

of an observation: $L(θ) = p_θ$
of an iid sample: $y_1, ⋯, y_n$: $L(θ) = \prod_i p_θ(y_i)$

Log likelihood:

\[\log L(θ) = \sum\limits_{ i } \log p_θ(y_i)\]

NB: ex: for a Bernoulli distribution with $p = 1/2$: likelihood = $\frac{1}{2^n}$ ⟹ below machine precision. That’s why one takes the $\log$ (and also to turn products into sums).

Maximum likelihood principle

Estimate $θ$ by maximizing $L(θ)$

Risk interpretation:: \[-\frac{\log L(θ)}{n} = \frac 1 n \sum\limits_{ i =1 }^n -\log p_θ (y_i)\]

NB: it is the average of $R_θ(y_i) ≝ - \log p_0(y)$

With “infinite” data for the “population case”: $y_1, ⋯, y_n$ sampled from $p_{θ_\ast}$ ($θ_\ast$ is fixed but unknown): risk interpretation tends to $𝔼_y R_θ(y)$. So that one tries to:

Minimize

\[\underbrace{𝔼_y R_θ(y)}_{ ≝ \; g(θ)} = - 𝔼_y \log p_θ(y)\]

Question: is $θ_\ast$ minimizing $g(θ)$?

\[g(θ) - g(θ_\ast) = 𝔼_{y \sim p_{θ_\ast}}(- \log p_θ(y)) - 𝔼_{y \sim p_{θ_\ast}}(- \log p_{θ_\ast}(y))\\ = 𝔼_{y \sim p_{θ_\ast}}\left(\frac{\log p_{θ_\ast}(y)}{\log p_θ(y)}\right) = \int p_{θ_\ast} \frac{\log p_{θ_\ast}(y)}{\log p_θ(y)} dμ(y)\]

Kullback-Leibler divergence between $p_{θ_\ast} ≝ p$ and $p_θ ≝ q$:: \[KL(p, q) = \int p(y) \frac{\log p(y)}{\log q(y)} dμ(y)\]

Property: $KL(p, q) ≥ 0$ with equality iff $p=q$

Proof: Jensen inequality

Examples:

1. Bernoulli

$𝒴 ∈ \lbrace 0, 1 \rbrace$
$ℙ(Y=1) = p ∈ [0, 1]$
$p(y) = p^y (1-p)^{1-y}$

\[- \log L(θ) = \sum\limits_{ i=1 }^n \log (p^{y_i} (1-p)^{1-y_i}) \\ = - \left(\sum\limits_{ i } y_i\right) \log p - \left(\sum\limits_{ i } 1 - y_i\right) \log(1-p)\]

ML ⟺ minimizing KL with empirical distribution $θ$

\[- \log L(θ) = - \sum\limits_{ i } \log p_θ(y_i) = - n \; 𝔼_{\text{empirical distrib.} \hat{p}} \log p_θ(y) \\ = - n \; \underbrace{𝔼_{\hat{p}(y)} \log \frac{p_θ(y)}{\hat{p}(y)}}_{ = -KL(p_θ, \hat{p})} + n \; 𝔼_{\hat{p}(y)} \log \hat{p}(y) \\\]

where

\[\hat{p}(y) = \frac 1 n \sum\limits_{ i } δ(y = y_i)\]

Bernoulli:

$\hat{p}(y = 1) = \frac 1 n \sum\limits_{ i } \underbrace{y_i}_{= δ(y_i = 1)} = p$
$\hat{p}(y = 0) = \frac 1 n \sum\limits_{ i } (1 - y_i)$

Multinomial:

$𝒴 ∈ \lbrace 1, ⋯, k \rbrace$
$\hat{p}(y) = \text{frequency of observations = } y$

Gaussian in one variable:

\[p_θ(y) = \frac{1}{\sqrt{2π σ^2}} \exp\left(- \frac 1 2 \frac{(y - μ)^2}{σ^2}\right)\]
\[- \log L(θ) = - \sum\limits_{ i=1 }^n \log \frac{1}{\sqrt{2π σ^2}} \exp \left(- \frac 1 2 \frac{(y_i - μ)^2}{σ^2}\right) \\ = α + n \, \log σ + \frac 1 2 \sum\limits_{ i }\frac{(y_i - μ)^2}{σ^2}\]

Minimize wrt $μ$:

Derivative wrt $μ$ = 0 ⟹ $\sum\limits_{ i } (y_i - μ) = 0 ⟹ μ = \frac 1 n \sum\limits_{ i } y_i$

Minimize wrt $λ = \frac{1}{σ^2}$:

The function is convex wrt $λ$:

So that: derivative wrt $λ$ = 0 ⟹

\[-\frac{n}{2λ} + \frac 1 2 \sum\limits_{ i } (y_i - μ)^2 = 0 ⟹ λ^{-1} = σ^2 = \frac 1 n \sum\limits_{ i } (y_i - μ)^2\]

Multivariate Gaussian

$μ ∈ ℝ^p$
$Σ ∈ ℝ^{p×p}$ positive definite

\[- \log L(θ) = - \sum\limits_{ i=1 }^n \log \frac{1}{(2π)^{d/2}} \frac{1}{\sqrt{\det Σ}} \exp \left(- \frac 1 2 (y_i - μ)^T Σ^{-1} (y_i - μ) \right) \\ = α + n \, \log \det Σ + \frac 1 2 \sum\limits_{ i } (y_i - μ)^T Σ^{-1} (y_i - μ)\]

Minimize wrt $μ$:

Derivative wrt $μ$ = 0 ⟹ $\sum\limits_{ i } Σ^{-1} (y_i - μ) = 0 ⟹ μ = \frac 1 n \sum\limits_{ i } y_i$

Minimize wrt $Λ = Σ^{-1}$:

The function is convex wrt $Λ$:

\[-\frac{n}{2} \log \det Λ + \frac 1 2 \sum\limits_{ i } \underbrace{\underbrace{(y_i - μ)^T}_{1 × p} \underbrace{Λ}_{p x p} \underbrace{(y_i - μ)}_{p × 1}}_{= Tr((y_i - μ)^T Λ (y_i - μ))}\]

So that: derivative wrt $Λ$ = 0 ⟹

\[- \frac n 2 Λ^{-1} + \frac 1 2 \sum\limits_{ i } (y_i - μ)(y_i - μ)^T = 0\\ ⟹ Λ^{-1} = \frac 1 n \sum\limits_{ i } (y_i - μ)(y_i - μ)^T\]

Conditional Maximum Likelihood (ML)

$Y ∈ 𝒴$ output
$X ∈ 𝒳$ input
full ML model $p_θ(x, y)$
conditional ML $p_θ(y \mid x)$
$p(x)$ “unspecified”

\[\log p(y, x) = \log(p(x) p_θ(y \mid x)) = \log(p(x)) + \log(p_θ(y \mid x))\]

Conditional ML for iid data:

\[maximize \; - \sum\limits_{ i } \log p_θ(y_i \mid x_i)\]

We don’t worry about the distribution of $x$, as they are given (but downside is that if $y$ is given, we can’t tell anything about $x$).

Linear regression

$𝒴 = ℝ$
$𝒳 = ℝ^d$

$p(y \mid x)$: Gaussian with mean $μ(x) = w^T x + b$ and constant variance $σ^2$

Conditional ML

\[- \log L(θ) = - \sum\limits_{ i } \log p_θ(y_i \mid x_i)\]

By writing

\[\begin{pmatrix} w \\ b \\ \end{pmatrix}^T \begin{pmatrix} x \\ 1 \\ \end{pmatrix}\]

we make $μ$ linear. So in the following, we will assume $μ$ is linear.

\[- \log L(θ) = - \sum\limits_{ i } \log p_θ(y_i \mid x_i) \\ = - \sum\limits_{ i=1 }^n \log \frac{1}{\sqrt{2π σ^2}} \exp \left(- \frac 1 2 \frac{(y_i - w^T x_i)^2}{σ^2}\right) \\ = α + \frac n 2 \, \log(σ^2) + \frac 1 2 \sum\limits_{ i }\frac{(y - w^T x_i)^2}{σ^2}\]

Logistic regression

$𝒴 ∈ \lbrace 0, 1 \rbrace$
$𝒳 ∈ ℝ^d$

\[p(y = 1 \mid x) = σ(w^T x + b) = \frac{1}{1 + \exp(-(w^T x + b))}\]

where $σ$ is the sigmoid function

The benefits of logistic is that it gives an incertainty about our outputs.

Generative VS Discriminative

pair $(x, y)$
joint ML: $p_θ(x, y)$
conditional ML
- discriminative method: $p_θ(y \mid x) ⟶ \max$
- generative method (LDA): $p(x, y) = p_θ(y) \underbrace{p_θ(x \mid y)}_{\text{ if } y \text{ is discrete, called the “class conditional density”}}$

Classification: one makes the assumption:

\[p(x \mid y) \sim \text{ Gaussian: mean } μ_y \text{, cov matrix } Σ_y\]

NB: if $𝒴∈ \lbrace 0, 1 \rbrace$ in 2D: $μ_0$ is the mean of the points labelled “0”, $Σ_0$ gives the radius of their distribution

But there’s a problem: if the $x$’s labelled $0$ are separated into two distinct groups, it doesn’t work ($μ_0$ is then in the middle, and the hyperplane separating $0$-labelled $x$’s and the $1$-labelled $x$’s might cut into the $0$-labelled $x$’s).

Questions:

How do we get $p(y \mid x)$?
estimating parameters

Bayes rule:: \[p(y \mid x) = \frac{p(x \mid y) p(y)}{p(x)}\]

Share on

Twitter Facebook Google+ LinkedIn

Maximum likelihood

Inference model:

Estimation:

Model

Likelihood

Maximum likelihood principle

1. Bernoulli

ML ⟺ minimizing KL with empirical distribution $θ$

Bernoulli:

Multinomial:

Gaussian in one variable:

Minimize wrt $μ$:

Minimize wrt $λ = \frac{1}{σ^2}$:

Multivariate Gaussian

Minimize wrt $μ$:

Minimize wrt $Λ = Σ^{-1}$:

Conditional Maximum Likelihood (ML)

Linear regression

Conditional ML

Logistic regression

Generative VS Discriminative

Share on

Leave a comment