Lecture 3: Supervized and Unsupervized learning

Lecturer: Mirjana Maras

  • Input variables: $\textbf{x} ≝ (x_1, ⋯, x_N)$
  • Hidden variables: $\textbf{h} ≝ (h_1, ⋯, h_K)$
  • Ouput variables: $\textbf{y} ≝ (y_1, ⋯, y_L)$

  • Parameters: $w_i$
  1. Supervized learning:

    • Classification (discrete $y$)

      • recognizing labels of images
      • recognizing handwritten digits
    • Regression (continuous $y$)

      • prediciting size/weight of animals
  2. Unsupervized learning (no $y$):

    • Clustering ($h$= different groups of types of data)

    • Density estimation ($h$ = parameters of probability distribution)

    • Dimensionality Reductions ($h$ = a few latent variable describing high dimensional data)

  3. Reinforcement learning ($y$=actions)

Regression (Supervized)

Polynomial curve fitting

  • \[y(x, \textbf{w}) ≝ \sum\limits_{ j=0 }^M w_j x^j\]
  • Target output: $t$

where $\textbf{w}$ is an unknown parameter

NB: here, basis functions are the $x^i$, but it could be anything else.

More generally:

\[y(\textbf{x}, \textbf{w}) ≝ \sum\limits_{ j=0 }^{M-1} w_j ϕ_j(\textbf{x}) = \textbf{w}^T \textbf{ϕ}(\textbf{x})\]

where $\textbf{ϕ} ≝ (ϕ_1, ⋯, ϕ_{M-1})$ is a basis function.

NB: Typically:

  • $ϕ_0(\textbf{x}) = 1$ so that $w_0$ acts as a biais
  • in the polynomial case: $ϕ_j(x) ≝ x^j$

Polynomial basis functions

\[ϕ_j(x) ≝ x^j\]

This is problematic: we want something more local, so that a slight change in $x$ doesn’t lead to a change in all of the $ϕ_j(x)$

Gaussian basis functions

\[ϕ_j(x) ≝ \exp\left(\frac{-(x - μ_j)^2}{σ^2}\right)\]

⟹ local ⟹ better to use

Same thing for sigmoid basis functions

Maximum likelihood and Least Squares

\[t ≝ y(\textbf{x}, \textbf{w}) + ε\]

where

\[p(ε \mid β) = 𝒩(ε \mid 0, β^{-1})\]

which is tantamount to saying

\[p(t \mid \textbf{x}, \textbf{w}, β) = 𝒩(t \mid y(\textbf{x}, \textbf{w}), β^{-1})\]

For $\textbf{X} = (\textbf{x}_1, ⋯, \textbf{x}_N)$, $\textbf{t} = (\textbf{t}_1, ⋯, \textbf{t}_N)$:

Likelihood function:
\[p(\textbf{t} \mid \textbf{X}, \textbf{w}, β) = \prod\limits_{ n=1 }^N 𝒩(t_n \mid \underbrace{\textbf{w}^T \textbf{ϕ}(\textbf{x}_n)}_{= y(\textbf{x}, \textbf{w})}, β^{-1})\]

Now, take the logarithm of the likelihood:

\[\ln (p(\textbf{t} \mid \textbf{X}, \textbf{w}, β)) = \sum\limits_{ n=1 }^N \ln \underbrace{𝒩(t_n \mid \underbrace{\textbf{w}^T \textbf{ϕ}(\textbf{x}_n)}_{= y(\textbf{x}, \textbf{w})}, β^{-1})}_{= \frac{1}{(2π)^{N/2}} β^{N/2} \exp(β (\textbf{t}_n - \textbf{w}^T \textbf{ϕ}(\textbf{x}_n))^T (\textbf{t}_n - \textbf{w}^T \textbf{ϕ}(\textbf{x}_n)))} \\ = \frac{N}{2} \ln β - \frac{N}{2} \ln(2 π) - β E_D(\textbf{w})\]

where

\[E_D(\textbf{w}) ≝ \frac 1 2 \sum\limits_{ n=1 }^N (\textbf{t}_n - \textbf{w}^T \textbf{ϕ}(\textbf{x}_n))^2\]

By computing the gradient and setting it to zero:

\[\nabla_{\textbf{w}} \ln p(\textbf{t} \mid \textbf{w}, β) = β \sum\limits_{ n=1 }^N (\textbf{t}_n - \textbf{w}^T \textbf{ϕ}(\textbf{x}_n))\textbf{ϕ}(\textbf{x}_n)^T = 0\]

Solving for $\textbf{w}$, we get:

\[\textbf{w}_{ML} = (\textbf{ϕ}^T \textbf{ϕ})^{-1} \textbf{ϕ}^T \textbf{t}\]

where \(\textbf{ϕ}_{i, j} ≝ ϕ_j(\textbf{x}_i)\)

Seqiential learning

Stochastic gradient descent:

\[\textbf{w}^{(τ+1)} = \textbf{w}^{(τ)} - η \nabla E_n\\ = \textbf{w}^{(τ)} + η(\textbf{t}_n - (\textbf{w}^{(τ)})^T \textbf{ϕ}(\textbf{x}_n)) \textbf{ϕ}(\textbf{x}_n)^T\]

Avoiding overfitting

Regularization:

\[E_D(\textbf{w}) + λ E_{\textbf{W}}(\textbf{w})\] \[\frac 1 2 \sum\limits_{ n=1 }^N (\textbf{t}_n - \textbf{w}^T \textbf{ϕ}(\textbf{x}_n))^2 + \frac λ 2 \sum\limits_{ j=1 }^M \vert w_j \vert^q\]

NB: Lasso ⇒ sparser solutions than a quadratic regularizer.

Bayesian Linear Regression

Define a conjugate prior over $\textbf{w}$:

\[p(\textbf{w}) = 𝒩(\textbf{w} \mid \textbf{m}_0, α^{-1})\]

Prior belief acts as a regularizer.

  1. You have a prior (ex: Gaussian distribution depending on $w_0$ and $w_1$)

  2. You draw parameters from this distribution and and draw the predictors $w_0 + x w_1$ (wrt $x$)

  3. You observe an input $x_0$, and an output $t_0$

  4. You have a likelihood (wrt $w_0$ and $w_1$), that you’ll multiply by the prior to get the posterior

  5. You reiterate

⟹ Application to neural coding: function approximation with tuning curves

Binary Classification

Linear classifier:

\[y(\textbf{x}) = \textbf{w}^T \textbf{x} + w_0\]

Then: apply the Heavyside function $y ⟼ 1_{y > 0}$ to classify.

Linear classification with least squares is sensitive to outliers, contrary to logisitic regression.

Linea classification with Fisher linear discriminants: minimizes class overlap by:

  • maximizing the separation between projected class means
  • minimizing the variance within each projected class

Neural network interpretation: each neuron is a classifier

Experiment: monkey looking at moving dots

Unsupervized: Principal Component Analysis (PCA)

To reduce the dimension of your data points/trying to find out which directions are the important ones.

Find an orthogonal basis of eigenvectors (the first one being associated to the largest eigenvalue, etc…) so that in the span of these vecors, variance is maximized.

PCA Neural implementation: Oja’s rule

\[x_i = \sum\limits_{ j } \underbrace{w_{i,j}}_{\text{orthonormal}} h_j\] \[Δ w_{i, j} = α y_j \left(x_i - \sum\limits_{ k } w_{i,k} y_k\right)\] \[y_j = \sum\limits_{ i } w_{i,j} x_i\]

Independent Component Analysis: better at finding input hidden variables that were in the data

Leave a comment