Lecture 7: Maximum likelihood

Maximum likelihood

Inference model:

Bernoulli model:

  • Coin $ℙ(Y=1) = p∈[0,1]$ ($Y ∈ \lbrace 0, 1 \rbrace$)

    • $p>1/2 ⟹ 1$
    • $p≤1/2 ⟹ 0$
y ∈ argmax \; ℙ(Y=y)

Other approach (goal of this course):

Estimation:

y_1, ⋯, y_n ∈ \lbrace 0, 1 \rbrace ⟹ \text{ compute } p ≝ \frac 1 n \sum\limits_{ i } y_i

Model

$𝒴$ measurable space with reference measure: $μ$

  • Lesbesgues if $𝒴 ⊆ ℝ^d$
  • Counting measure if $𝒴$ is finite

Collection:

  • $p_θ(\bullet): 𝒴 ⟶ ℝ$ densities wrt $μ$

    • $p_θ ≥ 0$
    • $\int p_θ dμ = 1$

Examples:

Bernoulli:

  • $𝒴 = \lbrace 0, 1 \rbrace$
  • $θ ≝ p ∈ [0,1]$

    • $p_θ(y) = ℙ(Y=y)$

Multinomial:

  • $𝒴 = \lbrace 1, ⋯, k \rbrace$
  • $θ ∈ \text{ simplex } \lbrace θ ∈ ℝ^k \mid θ ≥ 0, \sum\limits_{ i } θ_i = 1\rbrace$

    • $p_θ(y)= ℙ(Y=y) = θ_y$

Gaussian:

  • $𝒴 = ℝ$
  • $p(y) = \frac{1}{\sqrt{2π}σ} \exp(- \frac 1 2 \frac{(x-μ)^2}{σ^2})$

    • $μ$: mean
    • $σ$: standard deviation
    • $σ^2$: variance

Multivariate:

  • $𝒴 = ℝ^d$
  • $p(y) = \frac{1}{(2π)^{d/2}} \frac{1}{\sqrt{\det Σ}} \exp(- \frac 1 2 (x-μ)^T Σ^{-1} (x-μ))$

    • $Σ$: covariance matrix (positive definite)

    • $x-μ$: mean vector in $ℝ^d$

    • $d=2$: if $Σ = σ^2 Id$: level sets are circles

Likelihood

Likelihood (french: vraisemblance):
  • of an observation: $L(θ) = p_θ$

  • of an iid sample: $y_1, ⋯, y_n$: L(θ) = \prod_i p_θ(y_i)

Log likelihood:
\log L(θ) = \sum\limits_{ i } \log p_θ(y_i)

NB: ex: for a Bernoulli distribution with $p = 1/2$: likelihood = $\frac{1}{2^n}$ ⟹ below machine precision. That’s why one takes the $\log$ (and also to turn products into sums).

Maximum likelihood principle

Estimate $θ$ by maximizing $L(θ)$

Risk interpretation:
-\frac{\log L(θ)}{n} = \frac 1 n \sum\limits_{ i =1 }^n -\log p_θ (y_i)

NB: it is the average of $R_θ(y_i) ≝ - \log p_0(y)$

With “infinite” data for the “population case”: $y_1, ⋯, y_n$ sampled from $p_{θ_\ast}$ ($θ_\ast$ is fixed but unknown): risk interpretation tends to $𝔼_y R_θ(y)$. So that one tries to:

Minimize

\underbrace{𝔼_y R_θ(y)}_{ ≝ \; g(θ)} = - 𝔼_y \log p_θ(y)

Question: is $θ_\ast$ minimizing $g(θ)$?

g(θ) - g(θ_\ast) = 𝔼_{y \sim p_{θ_\ast}}(- \log p_θ(y)) - 𝔼_{y \sim p_{θ_\ast}}(- \log p_{θ_\ast}(y))\\ = 𝔼_{y \sim p_{θ_\ast}}\left(\frac{\log p_{θ_\ast}(y)}{\log p_θ(y)}\right) = \int p_{θ_\ast} \frac{\log p_{θ_\ast}(y)}{\log p_θ(y)} dμ(y)
Kullback-Leibler divergence between $p_{θ_\ast} ≝ p$ and $p_θ ≝ q$:
KL(p, q) = \int p(y) \frac{\log p(y)}{\log q(y)} dμ(y)

Property: KL(p, q) ≥ 0 with equality iff $p=q$

Proof: Jensen inequality

Examples:

1. Bernoulli

  • $𝒴 ∈ \lbrace 0, 1 \rbrace$

  • $ℙ(Y=1) = p ∈ [0, 1]$

  • $p(y) = p^y (1-p)^{1-y}$

- \log L(θ) = \sum\limits_{ i=1 }^n \log (p^{y_i} (1-p)^{1-y_i}) \\ = - \left(\sum\limits_{ i } y_i\right) \log p - \left(\sum\limits_{ i } 1 - y_i\right) \log(1-p)

ML ⟺ minimizing KL with empirical distribution $θ$

- \log L(θ) = - \sum\limits_{ i } \log p_θ(y_i) = - n \; 𝔼_{\text{empirical distrib.} \hat{p}} \log p_θ(y) \\ = - n \; \underbrace{𝔼_{\hat{p}(y)} \log \frac{p_θ(y)}{\hat{p}(y)}}_{ = -KL(p_θ, \hat{p})} + n \; 𝔼_{\hat{p}(y)} \log \hat{p}(y) \\

where

\hat{p}(y) = \frac 1 n \sum\limits_{ i } δ(y = y_i)

Bernoulli:

  • $\hat{p}(y = 1) = \frac 1 n \sum\limits_{ i } \underbrace{y_i}_{= δ(y_i = 1)} = p$

  • $\hat{p}(y = 0) = \frac 1 n \sum\limits_{ i } (1 - y_i)$

Multinomial:

  • $𝒴 ∈ \lbrace 1, ⋯, k \rbrace$

  • $\hat{p}(y) = \text{frequency of observations = } y$

Gaussian in one variable:

  • p_θ(y) = \frac{1}{\sqrt{2π σ^2}} \exp\left(- \frac 1 2 \frac{(y - μ)^2}{σ^2}\right)
  • - \log L(θ) = - \sum\limits_{ i=1 }^n \log \frac{1}{\sqrt{2π σ^2}} \exp \left(- \frac 1 2 \frac{(y_i - μ)^2}{σ^2}\right) \\ = α + n \, \log σ + \frac 1 2 \sum\limits_{ i }\frac{(y_i - μ)^2}{σ^2}

Minimize wrt $μ$:

Derivative wrt $μ$ = 0 ⟹ \sum\limits_{ i } (y_i - μ) = 0 ⟹ μ = \frac 1 n \sum\limits_{ i } y_i

Minimize wrt $λ = \frac{1}{σ^2}$:

The function is convex wrt $λ$:

So that: derivative wrt $λ$ = 0 ⟹

-\frac{n}{2λ} + \frac 1 2 \sum\limits_{ i } (y_i - μ)^2 = 0 ⟹ λ^{-1} = σ^2 = \frac 1 n \sum\limits_{ i } (y_i - μ)^2

Multivariate Gaussian

  • $μ ∈ ℝ^p$
  • $Σ ∈ ℝ^{p×p}$ positive definite
- \log L(θ) = - \sum\limits_{ i=1 }^n \log \frac{1}{(2π)^{d/2}} \frac{1}{\sqrt{\det Σ}} \exp \left(- \frac 1 2 (y_i - μ)^T Σ^{-1} (y_i - μ) \right) \\ = α + n \, \log \det Σ + \frac 1 2 \sum\limits_{ i } (y_i - μ)^T Σ^{-1} (y_i - μ)

Minimize wrt $μ$:

Derivative wrt $μ$ = 0 ⟹ \sum\limits_{ i } Σ^{-1} (y_i - μ) = 0 ⟹ μ = \frac 1 n \sum\limits_{ i } y_i

Minimize wrt $Λ = Σ^{-1}$:

The function is convex wrt $Λ$:

-\frac{n}{2} \log \det Λ + \frac 1 2 \sum\limits_{ i } \underbrace{\underbrace{(y_i - μ)^T}_{1 × p} \underbrace{Λ}_{p x p} \underbrace{(y_i - μ)}_{p × 1}}_{= Tr((y_i - μ)^T Λ (y_i - μ))}

So that: derivative wrt $Λ$ = 0 ⟹

- \frac n 2 Λ^{-1} + \frac 1 2 \sum\limits_{ i } (y_i - μ)(y_i - μ)^T = 0\\ ⟹ Λ^{-1} = \frac 1 n \sum\limits_{ i } (y_i - μ)(y_i - μ)^T

Conditional Maximum Likelihood (ML)

  • $Y ∈ 𝒴$ output
  • $X ∈ 𝒳$ input
  • full ML model $p_θ(x, y)$
  • conditional ML $p_θ(y \mid x)$
  • $p(x)$ “unspecified”
\log p(y, x) = \log(p(x) p_θ(y \mid x)) = \log(p(x)) + \log(p_θ(y \mid x))

Conditional ML for iid data:

maximize \; - \sum\limits_{ i } \log p_θ(y_i \mid x_i)

We don’t worry about the distribution of $x$, as they are given (but downside is that if $y$ is given, we can’t tell anything about $x$).

Linear regression

  • $𝒴 = ℝ$
  • $𝒳 = ℝ^d$

$p(y \mid x)$: Gaussian with mean $μ(x) = w^T x + b$ and constant variance $σ^2$

Conditional ML

- \log L(θ) = - \sum\limits_{ i } \log p_θ(y_i \mid x_i)

By writing

\begin{pmatrix} w \\ b \\ \end{pmatrix}^T \begin{pmatrix} x \\ 1 \\ \end{pmatrix}

we make $μ$ linear. So in the following, we will assume $μ$ is linear.

- \log L(θ) = - \sum\limits_{ i } \log p_θ(y_i \mid x_i) \\ = - \sum\limits_{ i=1 }^n \log \frac{1}{\sqrt{2π σ^2}} \exp \left(- \frac 1 2 \frac{(y_i - w^T x_i)^2}{σ^2}\right) \\ = α + \frac n 2 \, \log(σ^2) + \frac 1 2 \sum\limits_{ i }\frac{(y - w^T x_i)^2}{σ^2}

Logistic regression

  • $𝒴 ∈ \lbrace 0, 1 \rbrace$
  • $𝒳 ∈ ℝ^d$
p(y = 1 \mid x) = σ(w^T x + b) = \frac{1}{1 + \exp(-(w^T x + b))}

where $σ$ is the sigmoid function

The benefits of logistic is that it gives an incertainty about our outputs.

Generative VS Discriminative

  • pair $(x, y)$
  • joint ML: $p_θ(x, y)$
  • conditional ML

    • discriminative method: $p_θ(y \mid x) ⟶ \max$
    • generative method (LDA): $p(x, y) = p_θ(y) \underbrace{p_θ(x \mid y)}_{\text{ if } y \text{ is discrete, called the “class conditional density”}}$

Classification: one makes the assumption:

p(x \mid y) \sim \text{ Gaussian: mean } μ_y \text{, cov matrix } Σ_y

NB: if $𝒴∈ \lbrace 0, 1 \rbrace$ in 2D: $μ_0$ is the mean of the points labelled “0”, $Σ_0$ gives the radius of their distribution

But there’s a problem: if the $x$’s labelled $0$ are separated into two distinct groups, it doesn’t work ($μ_0$ is then in the middle, and the hyperplane separating $0$-labelled $x$’s and the $1$-labelled $x$’s might cut into the $0$-labelled $x$’s).

Questions:

  1. How do we get $p(y \mid x)$?
  2. estimating parameters
Bayes rule:
p(y \mid x) = \frac{p(x \mid y) p(y)}{p(x)}

Leave a comment