Lecture 3: $k$-nearest neighbors
kNN: k-nearest neighbors
cf. Book Devroy, Gyorfi and Lugosi
- Predictor of 1-NN:
- \[f_n(x) ≝ Y_{argmin_{1 ≤ i ≤ n} \Vert x - X_i \Vert}\]
- Risk:
- \[ℛ(f) ≝ 𝔼_{X,Y}(1_{f(X) ≠ Y})\]
Bayes predictor: the optimal one
\[ℛ^\ast = \inf_f ℛ(f) = ℛ(f^\ast)\]where \(f^\ast ≝ \begin{cases} 1 \text{ if } P(Y=1 \mid X) ≥ 1/2\\ 0 \text{ else} \end{cases}\)
- A predictor $f$ is consistent:
-
if \(\lim_{n ⟶ ∞} 𝔼_{D_n}(ℛ(f_n)) = ℛ^\ast\)
What is $𝔼_{D_n}(ℛ(f_n))$?
Intuition:
- when $n⟶∞$, \(\Vert \underbrace{X_n^{(1)}}_{ ≝ argmin_{X_i ∈ D_n} \Vert X_i - X \Vert} - X \Vert ⟶_{n⟶∞} 0\)
Let
\[η(x) ≝ P(Y=1 \mid X=x)\] \[η(X) ≃ η(X_n^{(1)})\] \[ℛ(f_n) = 𝔼 \Big(P(Y ≠ Y_n^{(1)} \mid X, X_n^{(1)}) \Big) = 𝔼 \Big( η(X)(1-η(X)) + (1-η(X))η(X) \Big) \\ = 2 𝔼(η(X)(1-η(X)))\]\[𝔼(R(f)) = 𝔼(1_{f(X)≠Y}) = 𝔼(𝔼(1_{f(X)≠ Y} \mid X)) = 𝔼(P(f(X)≠ Y \mid X))\\ = 𝔼(P(f(X) = 0 \mid X) η(X) + P(f(X) = 1 \mid X)(1 - η(X))) \\ = 𝔼((1 - P(f(X) = 1 \mid X)) η(X) + P(f(X) = 1 \mid X)(1 - η(X))) \\ = 𝔼(P(f(X) = 1 \mid X)(1 - 2 η(X)) + η(X)) ≥ 𝔼(\min(η(X), 1 - η(X))) = R^\ast\] \[\lim_{n ∞} 𝔼(R(f_n)) ≝ R_{NN}\]
$k$-nearest-neighbors: the more neighbors you take, the smaller the risk is.
\[\lim_{n ∞} 𝔼(ℛ(f_n^{kNN})) ≤ ℛ^\ast + \frac{1}{\sqrt{ke}}\]$k$ should be chosen as a function of $n$:
- $k(n) ⟶_{n∞} ∞$
- $\frac{k(n)}{n} ⟶_{0}$
- Bayes predictor for labels $Y∈ {1, \ldots, n}$:
-
predicted class: \(i^\ast ≝ argmin_{1 ≤ i ≤n} P(Y=i \mid X=x)\)
Binary: $Y ∈ {0, 1}$
\[i^\ast ≝ argmin_{1 ≤ i ≤ 2} P(Y=i \mid X=x)\]Goal, as $P(Y=1 \mid X=x)+ P(Y=2 \mid X=x) =1$
\[i^\ast = \begin{cases} 1 \text{ if } P(Y=1 \mid X=x) > 1/2 \\ 2 \text{ else}\end{cases}\]Bayes predictor for binary classifier:
\[f^\ast(x) = 1_{η(x)>1/2} = \begin{cases} 1 \text{ if } P(Y=1 \mid X) ≥ 1/2\\ 0 \text{ else} \end{cases}\]Here: $η(x) = α > 1/2 ⟹ f^\ast = 1$, so:
- Bayes risk:
- \[ℛ(f^\ast) = 𝔼(1_{f(X) ≠ Y}) = P(Y ≠ 1) = P(Y ≠ 1 \mid X = x) = 1 - α\]
because $η(x) = α$ doesn’t depend on $X$, so $P(Y=1) = P(Y=1 \mid X=x) = α$
- Expected risk of binary classifier:
- \[ℛ(f) = 𝔼_{X, Y}(l(f(X), Y)) = 𝔼_{X, Y}(1_{f(X)≠Y}) = P_{X, Y}(f(X)≠Y)\\ = P_{X, Y}(f(X)=1, Y = 0) + P_{X, Y}(f(X)=0, Y = 1) \\ = P_{X, Y}(Y = 0) P(f(X) = 1) + P_{X, Y}(Y = 1)P_{X, Y}(f(X) = 0)\]
because $Y$ independent of $X$ (and thus $f(X)$)
so that:
\[ℛ(f) = P(f(X) = 1) (1 - α)+ (1 - P(f(X) = 1)) α\\ = α - (2α -1) \underbrace{P(f(X) = 1)}_{𝔼(f(X))}\]$Y_i$ independent on $(X_1, ⋯, X_n)$
$(X_i, Y_i)$ are iid ⟹ $(X_i, Y_i)$ and $(X_j, Y_j)$ are independent if $i ≠ j$
But $Y_i$ is independent on $X_i$ ⟶ we get the expected result
\[\hat{f}^1(x) = \sum\limits_{ i =0 }^n P(Y = i) B_i(x)\]
and $\sum\limits_{ i =0 }^n B_i(x) = 1$ because $x$ has only one neighbor (because $X$ has density w.r.t. Lebesgue).
Independence:
\[𝔼(Y_i \mid X_1, ⋯, X_n) = 𝔼(Y_i) = α\]So
\[\begin{align*} 𝔼(\hat{f}(X) \mid X_1, ⋯, X_n) &= 𝔼(\sum\limits_{ i =0 }^n B_i(X) Y_i \mid X_1, ⋯, X_n) \\ &= \sum\limits_{ i =0 }^n 𝔼( B_i(X) Y_i \mid X_1, ⋯, X_n) \\ &= \sum\limits_{ i =0 }^n 𝔼( B_i(X) \mid X_1, ⋯, X_n) \underbrace{𝔼( Y_i \mid X_1, ⋯, X_n)}_{ ≝ α} &&\text{independence given } X_1, ⋯, X_n\\ &= α \underbrace{𝔼( \sum\limits_{ i =0 }^n B_i(X) \mid X_1, ⋯, X_n)}_{ = 1} & = α \end{align*}\]Non-consistency:
\[𝔼_{D_n}(ℛ(\hat{f}_n)) \not⟶_{n ∞} ℛ^\ast\]By question 3:
\[𝔼_{D_n}(ℛ(\hat{f}^1) = 2α(1-α) > 1-α \text{ if } α > 1/2\]So $ℛ(\hat{f}^1) > ℛ^\ast$: not consistent
- $V_k(x)$: neighborhood of $x$, that is: \(\lbrace i_1, ⋯, i_k \text{ s.t. } x_{i_1}, ⋯, x_{i_k} \text{ are the closest to } x\rbrace\)
Decision rule in $k$-NN is a “vote”:
\[\hat{f}^k ≝ 1_{\hat{η}^k(x) > 1/2}\]where
\[\hat{η}^k(x) ≝ \frac 1 k \sum\limits_{ i ∈ V_k(x)} y_i\]\[𝔼_{D_n}(ℛ(\hat{f}^k)) = α - (2 α - 1)\] \[𝔼_{D_n} (ℛ(\hat{f}^k)) = α - (2α -1) 𝔼_{D_n} \underbrace{ 𝔼_X(\hat{f}^k \mid D_n)}_{> 1/2}\] \[𝔼_X(\hat{f}^k \mid D_n) = 𝔼(1_{\hat{η}^k(X) > 1/2} D_n) \\ = P(\hat{η}^k(X) > 1/2 \mid D_n) \\ = P(\frac 1 k \sum\limits_{ i ∈ V_k(X) } y_i > 1/2 \mid D_n) \\ = P(\sum\limits_{ i ∈ V_k(X) } y_i > k/2 \mid D_n)\]
As $Y_i \sim B(α)$: $\sum\limits_{ i ∈ V_k(X) } y_i \sim B(α, k)$
\[𝔼(\hat{f}^k(X) \mid D_n) = P(Z > k/2) = β\]where $Z \sim B(α, k)$
So $𝔼(ℛ(\hat{f}^k(X))) = α - (2α -1) β > 1-α = ℛ(f^\ast)$
Conclusion: not consistent
Leave a comment