Week 4: Large Language Models

Week 4: Large Language Models

Younesse Kaddar

Philosophy Seminar, University of Oxford

Topics in Minds and Machines: Perception, Cognition, and ChatGPT


Recap of Last Week’s Session

  • Foundations and milestones of Deep Learning
  • Architecture and functioning of Neural Networks
  • Backpropagation algorithm
  • Bias-Variance Tradeoff
  • Optimizers and Learning Rate Schedulers
  • Adversarial Attacks
  • Understanding Neural Networks through visualizations

This week:

  • Semi-supervised learning as the “dark matter of intelligence” (Yann LeCun)
  • “Attention is all you need”: the Transformer architecture
  • From Transformers to Large Language Models (LLMs) like ChatGPT
  • Mechanistic interpretability
  • Prompt engineering
  • Shoggoth and “AI doomers”
  • Theory of mind for LLMs
    • Philipp and Vincent’s paper: Humans in Humans Out

1. “Dark Matter of Intelligence”

Self-Supervised Learning (SSL)

Model trained on unlabeled data: learns to predict missing parts of the input data (eg. next word in a sentence or part of an image)

Advantages over supervised learning:

  1. Does not require labeled data (expensive, time-consuming and not easily available)
  2. Can be used to train models on very large datasets (better performance)

⟶ Yann LeCun: “Dark Matter of Intelligence”


2. Introduction to Attention Mechanism

h:300 center


Self-attention: Key Components

  • Query, Key, and Value vectors
  • Attention scores and weights

h:400 center


Benefits:

  • Better handling of long-range dependencies
  • Improved interpretability of model decisions

h:400 center


3. “Attention is all you need”: Transformer Architecture

  • “Attention is All You Need” (Vaswani et al., 2017)
    • Revolutionized NLP, unified NLP and Computer Vision
    • Recurrent and Convolutional layers ⟶ replaced by self-attention
  • Key Components:
    • Multi-head attention block
    • MLP block
    • Residual path and Layer Normalization

bg right:40%


The Math, Reminder: MLP

\[z_j ≝ \sum_{i=1}^n w_{ij} x_i + b_j \qquad o_k ≝ \sum_{j=1}^m w_{jk} σ(z_j) + b'_k\]

h:300 center

If $X$ is the design matrix, the NN (without the bias terms) is given by:

\[σ(X W^1) W^2\]

The Math, Transformer

  • Multi-head attention block
    • \[σ\big(X W_Q (X W_K)^T\big) X W_V = σ\big(X W_Q W_K^T X^T\big) X W_V\]
  • MLP block:
    • \[σ(X W^1) W^2\]
  • Layer Normalization:
    • Every row of $X$ is normalized to have mean $0$ and variance $1$ (up to learned scaling and shifting parameters)
  • Skip connection: we keep adding to the residual path
    • $X + \text{MHA}(X)$
    • $X + \text{MLP}(X)$

Pre-Layer Normalization

On Layer Normalization in the Transformer Architecture (Xiong, 2020)

h:400 center


Transformers: Special case of Graph Neural Networks

  • Graph Neural Networks (GNNs)
    • NNs operating on graph-structured data
    • Transformers: special case of GNNs. Unordered sequence ⟺ a complete graph
  • Similarities: Message-passing-like architectures

  • Differences: Transformers for sequence data (self-attention), GNNs for arbitrary graph structures

h:250 center


4. From Transformers to Chat Large Language Models (LLMs)

Example: ChatGPT

  • Fine-tuned for conversational AI tasks

  • Further trained with Reinforcement Learning from Human Feedback (RLHF)


h:600 center


PEFT, Adapters, and LoRA

  • Parameter-Efficient Fine-Tuning (PEFT)
    • Fine-tuning method for Large Language Models (LLMs).
    • Reduces computation and resource requirements, uses adapters.
  • Adapters
    • Small, specialized modules enhancing LLM performance for specific tasks.
    • Trained on task-specific data.
    • Faster and cheaper to train compared to entire LLMs.
  • LoRA (Low-Rank Adapters)
    • Utilizes low-rank matrix approximation to minimize parameter count.
  • Examples: Alpaca, Vicuna, Koala models, etc.

5. Opinions on Transformers

  • Aidan Gomez:
    • Did not realize the impact of the “Attention is All You Need” paper at first
  • Andrej Karpathy:
    • Best Idea in AI, according to Karpathy
    • The Transformer as a General-Purpose, Efficient, Optimizable Computer:
      • Expressive (in the forward pass)
      • Optimizable (via backpropagation and gradient descent)
      • Efficient (high parallelism compute graph)

6. Mechanistic Interpretability

h:400 center


7. Recent Developments

  • Task-Specific Skill Localization in Fine-tuned Language Models
    • Grafting technique: small subset of parameters responsible for model performance on specific tasks
  • Emergent Deception and Emergent Optimization (Jacob Steinhardt)

  • Prompt engineering to achieve desired model behavior
    • Zero-shot vs few-shot learning
    • Important increase in performance:
      • chain of thought (e.g. “Let’s work this out step by step”)
      • reflection (finding its own errors)
      • and dialoguing with itself (several personas).
    • Cat and mouse game: jailbreaks (eg. wrap things in a simple Pyton print)

8. Emergence


h:700 center


AI Saftety crowd and the Shoggoth

h:600 center


Other Examples


Theory of Mind (ToM)

Theory of mind (ToM): the ability to attribute mental states to ourselves and others, serving as one of the foundational elements for social interaction.

h:350 center


h:500 center


“AI Explained” YT channel.

h:600 center


h:600 center


h:600 center


h:600 center


New words

The thing that blew me away was when I told ChatGPT about a “new word” - I told it that “wibble” is defined as: a sequence of four digits that are in neither ascending or descending order. I asked it to give me an example of a wibble - and it did. 3524. I asked it for a sequence that is NOT a wibble and it said 4321. Then I asked it for an anti-wibble and no problem, 2345. Then I asked it for an example of an alpha-wibble and it said FRDS….which is amazing. It was able to understand an entirely new word… which is clever - but it was able to extrapolate from it… which is far more than I thought possible.

https://www.youtube.com/watch?v=cP5zGh2fui0


h:600 center


h:600 center


h:600 center