# Maximum likelihood

### Inference model:

Bernoulli model:

• Coin $ℙ(Y=1) = p∈[0,1]$ ($Y ∈ \lbrace 0, 1 \rbrace$)

• $p>1/2 ⟹ 1$
• $p≤1/2 ⟹ 0$
$y ∈ argmax \; ℙ(Y=y)$

Other approach (goal of this course):

### Estimation:

$y_1, ⋯, y_n ∈ \lbrace 0, 1 \rbrace ⟹ \text{ compute } p ≝ \frac 1 n \sum\limits_{ i } y_i$

## Model

$𝒴$ measurable space with reference measure: $μ$

• Lesbesgues if $𝒴 ⊆ ℝ^d$
• Counting measure if $𝒴$ is finite

Collection:

• $p_θ(\bullet): 𝒴 ⟶ ℝ$ densities wrt $μ$

• $p_θ ≥ 0$
• $\int p_θ dμ = 1$

Examples:

Bernoulli:

• $𝒴 = \lbrace 0, 1 \rbrace$
• $θ ≝ p ∈ [0,1]$

• $p_θ(y) = ℙ(Y=y)$

Multinomial:

• $𝒴 = \lbrace 1, ⋯, k \rbrace$
• $θ ∈ \text{ simplex } \lbrace θ ∈ ℝ^k \mid θ ≥ 0, \sum\limits_{ i } θ_i = 1\rbrace$

• $p_θ(y)= ℙ(Y=y) = θ_y$

Gaussian:

• $𝒴 = ℝ$
• $p(y) = \frac{1}{\sqrt{2π}σ} \exp(- \frac 1 2 \frac{(x-μ)^2}{σ^2})$

• $μ$: mean
• $σ$: standard deviation
• $σ^2$: variance

Multivariate:

• $𝒴 = ℝ^d$
• $p(y) = \frac{1}{(2π)^{d/2}} \frac{1}{\sqrt{\det Σ}} \exp(- \frac 1 2 (x-μ)^T Σ^{-1} (x-μ))$

• $Σ$: covariance matrix (positive definite)

• $x-μ$: mean vector in $ℝ^d$

• $d=2$: if $Σ = σ^2 Id$: level sets are circles

## Likelihood

Likelihood (french: vraisemblance):
• of an observation: $L(θ) = p_θ$

• of an iid sample: $y_1, ⋯, y_n$: $L(θ) = \prod_i p_θ(y_i)$

Log likelihood:
$\log L(θ) = \sum\limits_{ i } \log p_θ(y_i)$

NB: ex: for a Bernoulli distribution with $p = 1/2$: likelihood = $\frac{1}{2^n}$ ⟹ below machine precision. That’s why one takes the $\log$ (and also to turn products into sums).

### Maximum likelihood principle

Estimate $θ$ by maximizing $L(θ)$

Risk interpretation:
$-\frac{\log L(θ)}{n} = \frac 1 n \sum\limits_{ i =1 }^n -\log p_θ (y_i)$

NB: it is the average of $R_θ(y_i) ≝ - \log p_0(y)$

With “infinite” data for the “population case”: $y_1, ⋯, y_n$ sampled from $p_{θ_\ast}$ ($θ_\ast$ is fixed but unknown): risk interpretation tends to $𝔼_y R_θ(y)$. So that one tries to:

Minimize

$\underbrace{𝔼_y R_θ(y)}_{ ≝ \; g(θ)} = - 𝔼_y \log p_θ(y)$

Question: is $θ_\ast$ minimizing $g(θ)$?

$g(θ) - g(θ_\ast) = 𝔼_{y \sim p_{θ_\ast}}(- \log p_θ(y)) - 𝔼_{y \sim p_{θ_\ast}}(- \log p_{θ_\ast}(y))\\ = 𝔼_{y \sim p_{θ_\ast}}\left(\frac{\log p_{θ_\ast}(y)}{\log p_θ(y)}\right) = \int p_{θ_\ast} \frac{\log p_{θ_\ast}(y)}{\log p_θ(y)} dμ(y)$
Kullback-Leibler divergence between $p_{θ_\ast} ≝ p$ and $p_θ ≝ q$:
$KL(p, q) = \int p(y) \frac{\log p(y)}{\log q(y)} dμ(y)$

Property: $KL(p, q) ≥ 0$ with equality iff $p=q$

Proof: Jensen inequality

Examples:

## 1. Bernoulli

• $𝒴 ∈ \lbrace 0, 1 \rbrace$

• $ℙ(Y=1) = p ∈ [0, 1]$

• $p(y) = p^y (1-p)^{1-y}$

$- \log L(θ) = \sum\limits_{ i=1 }^n \log (p^{y_i} (1-p)^{1-y_i}) \\ = - \left(\sum\limits_{ i } y_i\right) \log p - \left(\sum\limits_{ i } 1 - y_i\right) \log(1-p)$

### ML ⟺ minimizing KL with empirical distribution $θ$

$- \log L(θ) = - \sum\limits_{ i } \log p_θ(y_i) = - n \; 𝔼_{\text{empirical distrib.} \hat{p}} \log p_θ(y) \\ = - n \; \underbrace{𝔼_{\hat{p}(y)} \log \frac{p_θ(y)}{\hat{p}(y)}}_{ = -KL(p_θ, \hat{p})} + n \; 𝔼_{\hat{p}(y)} \log \hat{p}(y) \\$

where

$\hat{p}(y) = \frac 1 n \sum\limits_{ i } δ(y = y_i)$

## Bernoulli:

• $\hat{p}(y = 1) = \frac 1 n \sum\limits_{ i } \underbrace{y_i}_{= δ(y_i = 1)} = p$

• $\hat{p}(y = 0) = \frac 1 n \sum\limits_{ i } (1 - y_i)$

## Multinomial:

• $𝒴 ∈ \lbrace 1, ⋯, k \rbrace$

• $\hat{p}(y) = \text{frequency of observations = } y$

## Gaussian in one variable:

• $p_θ(y) = \frac{1}{\sqrt{2π σ^2}} \exp\left(- \frac 1 2 \frac{(y - μ)^2}{σ^2}\right)$
• $- \log L(θ) = - \sum\limits_{ i=1 }^n \log \frac{1}{\sqrt{2π σ^2}} \exp \left(- \frac 1 2 \frac{(y_i - μ)^2}{σ^2}\right) \\ = α + n \, \log σ + \frac 1 2 \sum\limits_{ i }\frac{(y_i - μ)^2}{σ^2}$

#### Minimize wrt $μ$:

Derivative wrt $μ$ = 0 ⟹ $\sum\limits_{ i } (y_i - μ) = 0 ⟹ μ = \frac 1 n \sum\limits_{ i } y_i$

#### Minimize wrt $λ = \frac{1}{σ^2}$:

The function is convex wrt $λ$:

So that: derivative wrt $λ$ = 0 ⟹

$-\frac{n}{2λ} + \frac 1 2 \sum\limits_{ i } (y_i - μ)^2 = 0 ⟹ λ^{-1} = σ^2 = \frac 1 n \sum\limits_{ i } (y_i - μ)^2$

## Multivariate Gaussian

• $μ ∈ ℝ^p$
• $Σ ∈ ℝ^{p×p}$ positive definite
$- \log L(θ) = - \sum\limits_{ i=1 }^n \log \frac{1}{(2π)^{d/2}} \frac{1}{\sqrt{\det Σ}} \exp \left(- \frac 1 2 (y_i - μ)^T Σ^{-1} (y_i - μ) \right) \\ = α + n \, \log \det Σ + \frac 1 2 \sum\limits_{ i } (y_i - μ)^T Σ^{-1} (y_i - μ)$

#### Minimize wrt $μ$:

Derivative wrt $μ$ = 0 ⟹ $\sum\limits_{ i } Σ^{-1} (y_i - μ) = 0 ⟹ μ = \frac 1 n \sum\limits_{ i } y_i$

#### Minimize wrt $Λ = Σ^{-1}$:

The function is convex wrt $Λ$:

$-\frac{n}{2} \log \det Λ + \frac 1 2 \sum\limits_{ i } \underbrace{\underbrace{(y_i - μ)^T}_{1 × p} \underbrace{Λ}_{p x p} \underbrace{(y_i - μ)}_{p × 1}}_{= Tr((y_i - μ)^T Λ (y_i - μ))}$

So that: derivative wrt $Λ$ = 0 ⟹

$- \frac n 2 Λ^{-1} + \frac 1 2 \sum\limits_{ i } (y_i - μ)(y_i - μ)^T = 0\\ ⟹ Λ^{-1} = \frac 1 n \sum\limits_{ i } (y_i - μ)(y_i - μ)^T$

# Conditional Maximum Likelihood (ML)

• $Y ∈ 𝒴$ output
• $X ∈ 𝒳$ input
• full ML model $p_θ(x, y)$
• conditional ML $p_θ(y \mid x)$
• $p(x)$ “unspecified”
$\log p(y, x) = \log(p(x) p_θ(y \mid x)) = \log(p(x)) + \log(p_θ(y \mid x))$

Conditional ML for iid data:

$maximize \; - \sum\limits_{ i } \log p_θ(y_i \mid x_i)$

We don’t worry about the distribution of $x$, as they are given (but downside is that if $y$ is given, we can’t tell anything about $x$).

## Linear regression

• $𝒴 = ℝ$
• $𝒳 = ℝ^d$

$p(y \mid x)$: Gaussian with mean $μ(x) = w^T x + b$ and constant variance $σ^2$

### Conditional ML

$- \log L(θ) = - \sum\limits_{ i } \log p_θ(y_i \mid x_i)$

By writing

$\begin{pmatrix} w \\ b \\ \end{pmatrix}^T \begin{pmatrix} x \\ 1 \\ \end{pmatrix}$

we make $μ$ linear. So in the following, we will assume $μ$ is linear.

$- \log L(θ) = - \sum\limits_{ i } \log p_θ(y_i \mid x_i) \\ = - \sum\limits_{ i=1 }^n \log \frac{1}{\sqrt{2π σ^2}} \exp \left(- \frac 1 2 \frac{(y_i - w^T x_i)^2}{σ^2}\right) \\ = α + \frac n 2 \, \log(σ^2) + \frac 1 2 \sum\limits_{ i }\frac{(y - w^T x_i)^2}{σ^2}$

## Logistic regression

• $𝒴 ∈ \lbrace 0, 1 \rbrace$
• $𝒳 ∈ ℝ^d$
$p(y = 1 \mid x) = σ(w^T x + b) = \frac{1}{1 + \exp(-(w^T x + b))}$

where $σ$ is the sigmoid function

The benefits of logistic is that it gives an incertainty about our outputs.

## Generative VS Discriminative

• pair $(x, y)$
• joint ML: $p_θ(x, y)$
• conditional ML

• discriminative method: $p_θ(y \mid x) ⟶ \max$
• generative method (LDA): $p(x, y) = p_θ(y) \underbrace{p_θ(x \mid y)}_{\text{ if } y \text{ is discrete, called the “class conditional density”}}$

Classification: one makes the assumption:

$p(x \mid y) \sim \text{ Gaussian: mean } μ_y \text{, cov matrix } Σ_y$

NB: if $𝒴∈ \lbrace 0, 1 \rbrace$ in 2D: $μ_0$ is the mean of the points labelled “0”, $Σ_0$ gives the radius of their distribution

But there’s a problem: if the $x$’s labelled $0$ are separated into two distinct groups, it doesn’t work ($μ_0$ is then in the middle, and the hyperplane separating $0$-labelled $x$’s and the $1$-labelled $x$’s might cut into the $0$-labelled $x$’s).

Questions:

1. How do we get $p(y \mid x)$?
2. estimating parameters
Bayes rule:
$p(y \mid x) = \frac{p(x \mid y) p(y)}{p(x)}$

Tags:

Updated: