Lecture 3: $k$-nearest neighbors

kNN: k-nearest neighbors

cf. Book Devroy, Gyorfi and Lugosi

Predictor of 1-NN:
f_n(x) ≝ Y_{argmin_{1 ≤ i ≤ n} \Vert x - X_i \Vert}
Risk:
ℛ(f) ≝ 𝔼_{X,Y}(1_{f(X) ≠ Y})

Bayes predictor: the optimal one

ℛ^\ast = \inf_f ℛ(f) = ℛ(f^\ast)

where f^\ast ≝ \begin{cases} 1 \text{ if } P(Y=1 \mid X) ≥ 1/2\\ 0 \text{ else} \end{cases}

A predictor $f$ is consistent:

if \lim_{n ⟶ ∞} 𝔼_{D_n}(ℛ(f_n)) = ℛ^\ast

What is $𝔼_{D_n}(ℛ(f_n))$?

Intuition:

  • when $n⟶∞$, \Vert \underbrace{X_n^{(1)}}_{ ≝ argmin_{X_i ∈ D_n} \Vert X_i - X \Vert} - X \Vert ⟶_{n⟶∞} 0

Let

η(x) ≝ P(Y=1 \mid X=x)
η(X) ≃ η(X_n^{(1)})
ℛ(f_n) = 𝔼 \Big(P(Y ≠ Y_n^{(1)} \mid X, X_n^{(1)}) \Big) = 𝔼 \Big( η(X)(1-η(X)) + (1-η(X))η(X) \Big) \\ = 2 𝔼(η(X)(1-η(X)))

𝔼(R(f)) = 𝔼(1_{f(X)≠Y}) = 𝔼(𝔼(1_{f(X)≠ Y} \mid X)) = 𝔼(P(f(X)≠ Y \mid X))\\ = 𝔼(P(f(X) = 0 \mid X) η(X) + P(f(X) = 1 \mid X)(1 - η(X))) \\ = 𝔼((1 - P(f(X) = 1 \mid X)) η(X) + P(f(X) = 1 \mid X)(1 - η(X))) \\ = 𝔼(P(f(X) = 1 \mid X)(1 - 2 η(X)) + η(X)) ≥ 𝔼(\min(η(X), 1 - η(X))) = R^\ast
\lim_{n ∞} 𝔼(R(f_n)) ≝ R_{NN}

$k$-nearest-neighbors: the more neighbors you take, the smaller the risk is.

\lim_{n ∞} 𝔼(ℛ(f_n^{kNN})) ≤ ℛ^\ast + \frac{1}{\sqrt{ke}}

$k$ should be chosen as a function of $n$:

  • $k(n) ⟶_{n∞} ∞$
  • $\frac{k(n)}{n} ⟶_{0}$

Bayes predictor for labels $Y∈ {1, \ldots, n}$:

predicted class: i^\ast ≝ argmin_{1 ≤ i ≤n} P(Y=i \mid X=x)

Binary: $Y ∈ {0, 1}$

i^\ast ≝ argmin_{1 ≤ i ≤ 2} P(Y=i \mid X=x)

Goal, as $P(Y=1 \mid X=x)+ P(Y=2 \mid X=x) =1$

i^\ast = \begin{cases} 1 \text{ if } P(Y=1 \mid X=x) > 1/2 \\ 2 \text{ else}\end{cases}

Bayes predictor for binary classifier:

f^\ast(x) = 1_{η(x)>1/2} = \begin{cases} 1 \text{ if } P(Y=1 \mid X) ≥ 1/2\\ 0 \text{ else} \end{cases}

Here: $η(x) = α > 1/2 ⟹ f^\ast = 1$, so:

Bayes risk:
ℛ(f^\ast) = 𝔼(1_{f(X) ≠ Y}) = P(Y ≠ 1) = P(Y ≠ 1 \mid X = x) = 1 - α

because $η(x) = α$ doesn’t depend on $X$, so $P(Y=1) = P(Y=1 \mid X=x) = α$


Expected risk of binary classifier:
ℛ(f) = 𝔼_{X, Y}(l(f(X), Y)) = 𝔼_{X, Y}(1_{f(X)≠Y}) = P_{X, Y}(f(X)≠Y)\\ = P_{X, Y}(f(X)=1, Y = 0) + P_{X, Y}(f(X)=0, Y = 1) \\ = P_{X, Y}(Y = 0) P(f(X) = 1) + P_{X, Y}(Y = 1)P_{X, Y}(f(X) = 0)

because $Y$ independent of $X$ (and thus $f(X)$)

so that:

ℛ(f) = P(f(X) = 1) (1 - α)+ (1 - P(f(X) = 1)) α\\ = α - (2α -1) \underbrace{P(f(X) = 1)}_{𝔼(f(X))}

$Y_i$ independent on $(X_1, ⋯, X_n)$

$(X_i, Y_i)$ are iid ⟹ $(X_i, Y_i)$ and $(X_j, Y_j)$ are independent if $i ≠ j$

But $Y_i$ is independent on $X_i$ ⟶ we get the expected result


\hat{f}^1(x) = \sum\limits_{ i =0 }^n P(Y = i) B_i(x)

and $\sum\limits_{ i =0 }^n B_i(x) = 1$ because $x$ has only one neighbor (because $X$ has density w.r.t. Lebesgue).


Independence:

𝔼(Y_i \mid X_1, ⋯, X_n) = 𝔼(Y_i) = α

So

\begin{align*} 𝔼(\hat{f}(X) \mid X_1, ⋯, X_n) &= 𝔼(\sum\limits_{ i =0 }^n B_i(X) Y_i \mid X_1, ⋯, X_n) \\ &= \sum\limits_{ i =0 }^n 𝔼( B_i(X) Y_i \mid X_1, ⋯, X_n) \\ &= \sum\limits_{ i =0 }^n 𝔼( B_i(X) \mid X_1, ⋯, X_n) \underbrace{𝔼( Y_i \mid X_1, ⋯, X_n)}_{ ≝ α} &&\text{independence given } X_1, ⋯, X_n\\ &= α \underbrace{𝔼( \sum\limits_{ i =0 }^n B_i(X) \mid X_1, ⋯, X_n)}_{ = 1} & = α \end{align*}

Non-consistency:

𝔼_{D_n}(ℛ(\hat{f}_n)) \not⟶_{n ∞} ℛ^\ast

By question 3:

𝔼_{D_n}(ℛ(\hat{f}^1) = 2α(1-α) > 1-α \text{ if } α > 1/2

So $ℛ(\hat{f}^1) > ℛ^\ast$: not consistent


  • $V_k(x)$: neighborhood of $x$, that is: \lbrace i_1, ⋯, i_k \text{ s.t. } x_{i_1}, ⋯, x_{i_k} \text{ are the closest to } x\rbrace

Decision rule in $k$-NN is a “vote”:

\hat{f}^k ≝ 1_{\hat{η}^k(x) > 1/2}

where

\hat{η}^k(x) ≝ \frac 1 k \sum\limits_{ i ∈ V_k(x)} y_i

𝔼_{D_n}(ℛ(\hat{f}^k)) = α - (2 α - 1)
𝔼_{D_n} (ℛ(\hat{f}^k)) = α - (2α -1) 𝔼_{D_n} \underbrace{ 𝔼_X(\hat{f}^k \mid D_n)}_{> 1/2}
𝔼_X(\hat{f}^k \mid D_n) = 𝔼(1_{\hat{η}^k(X) > 1/2} D_n) \\ = P(\hat{η}^k(X) > 1/2 \mid D_n) \\ = P(\frac 1 k \sum\limits_{ i ∈ V_k(X) } y_i > 1/2 \mid D_n) \\ = P(\sum\limits_{ i ∈ V_k(X) } y_i > k/2 \mid D_n)

As $Y_i \sim B(α)$: $\sum\limits_{ i ∈ V_k(X) } y_i \sim B(α, k)$

𝔼(\hat{f}^k(X) \mid D_n) = P(Z > k/2) = β

where $Z \sim B(α, k)$

So $𝔼(ℛ(\hat{f}^k(X))) = α - (2α -1) β > 1-α = ℛ(f^\ast)$

Conclusion: not consistent

Leave a comment