Lecture 1: Bayesian perception

Lecturer: Pantelis Leptourgos

Machine Learning Introduction

Beginning goes back to Alan Turing ⟶ Turing Test (1950)

Can machines do what we can do, as thinking entities?

Supervized learning:

  • Task: recognition
  • Experience: comparing the predicted label and the true label
  • Measure of efficiency: how often does it fail?


  • Input variables: $\mathbb{x} ∈ ℝ^N$
  • Hidden variables: $\mathbb{h} ∈ ℝ^K$
  • Output variables: $\mathbb{y} ∈ ℝ^K$


  • Supervized learning: $\mathbb{y}$ is given

    • Classification: discrete values
    • Regression: continuous ones
  • Unsepervized learning: $\mathbb{y}$ not known

    • Find patterns in data (ex: clustering, density estimation, dimensionality reduction, etc…)
  • Reinforcement learning: $\mathbb{y}$ is actions ⟶ the system learns from rewards/punishments by acting

When do we not need ML?

⟶ when the relationship between $\mathbb{x}$ and $\mathbb{y}$ is already known/can be analytically solved

Steps of Supervized ML:

  1. Pre-processing on Initial Data (feature extraction, etc…)

  2. Learn model parameters on Training Set

  3. Test the model on the Test Set

  4. Make predictions with your model

Challenges in ML and Cognition

Three very well-known problems:

  1. Playing Chess/Games
  2. Moving an arm
  3. Computer vision

⟹ Curse of dimensionality ⟶ to learn, we need a lot of data (often intractable)

⟹ Computer Vision: Going from 2D-images to 3D-representations (need to use some priors)

Ex: Polynomial fitting

  • Minimize the Sum-of-Squares Error Function

  • Beware of under/over-fitting ⟶ over-fitting: you end up fitting the noise (no generalization anymore)

How to get rid of over-fitting?

  • Increase the training set
  • Remove some outliers
  • Increase the data compared to the number of features
  • Regularization ⟶ penalize large coefficients values
    • Ridge/Lasso regression

How to choose the order of the polynomial?

  • Cross Validation: step between training and test data

  • $BIC$ score: the lower, the better

    \[BIC = \ln(n)\underbrace{k}_{\text{number of features}} - 2 ln(\underbrace{\hat{L}}_{\text{likelihood}})\]

Bayesian perception

Bayes Theorem: Indicates how to update our belief

  • Sensation: is the detection of external stimulation
  • Perception: is how we interpret/integrate these sensations

Why do we need perception? Sensation is not enough, there’s ambiguity everywhere:

  • Uncertainty: Noise, Ambiguity
  • Cue combination: combination of sensations
  • Accumulation of evidence
  • Latent variables: things are not directly observable

Core concepts of Bayesian perception:

  • Priors: expectations, implicit knowledge
  • Sensory data
  • the world (what you want to know something about)
  • Prediction ⟶ Decision

Goal: go from the sensory data to make predictions about the cause (inversion of the generative model)

3 steps:

  1. Define generative model
  2. Beyesian inference
  3. Computer observer’s estimate distribution
  digraph {
    X1 -> S[label="  P(S | X)"];
    S -> "X"[label="  P(X | S)"];

Likelihood = Noise distribution (why is it called “likelihood”? It’s not a probability, it doesn’t sum to $1$).

Bayesian inference to invert the generative model

Difference between Prior Distribution and the Real Distribution?

⟶ The prior is assumed, based on our beliefs

The Likelihood function is centered at the measurement.

Variance of the likelihood function = reliability of Sensory Evidence (the smaller, the more we trust our measurements).

  • The proba of the value of $X$ at the peak is called confidence.
  • Uncertainty of the posterior = the variance of the distribution

Decision Criteria

Maximum Likelihood:
\[\hat{X}_{ML} = argmax_X (\underbrace{L(X)}_{≝ P(S \mid X)})\]
Maximum a Posteriori:
\[\hat{X}_{MAP} = argmax_X (P(X \mid S))\]

Other: Softmax, etc…

Distribution of MAP estimate:
\[P(\hat{X}_{MAP} \mid X_{true})\]

NB: It’s Gaussian

Leave a comment