Deep Learning 12 min read

Transformers Explained: How Attention Is All You Need

Neural Networks Guide

December 8, 2024

In 2017, a paper titled "Attention Is All You Need" by Vaswani et al. introduced the Transformer architecture, fundamentally changing the landscape of natural language processing and, eventually, all of AI. Let's explore how this remarkable architecture works.

The Problem with RNNs

Before Transformers, recurrent neural networks (RNNs) and their variants (LSTMs, GRUs) were the go-to architectures for sequence processing. However, they had significant limitations:

• Sequential processing: RNNs process tokens one at a time, making them impossible to parallelize during training.
• Vanishing gradients: Information from early tokens gets diluted over long sequences, despite LSTM gates.
• Limited context: Capturing relationships between distant tokens remains difficult.

Self-Attention: The Core Innovation

The key idea behind Transformers is self-attention — a mechanism that allows every token in a sequence to directly attend to every other token, regardless of distance. This solves all three problems of RNNs at once.

Self-attention computes three vectors for each token: a Query (Q), a Key (K), and a Value (V). The attention score between two tokens is the dot product of the query of one and the key of the other:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Scaled dot-product attention

self_attention.py

import numpy as np

def self_attention(X, W_q, W_k, W_v):
    # Compute Q, K, V matrices
    Q = X @ W_q
    K = X @ W_k
    V = X @ W_v
    
    # Scaled dot-product attention
    d_k = K.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    
    # Softmax to get attention weights
    weights = softmax(scores)
    
    return weights @ V

Multi-Head Attention

Instead of computing a single attention function, Transformers use multi-head attention — running several attention functions in parallel with different learned projections. This allows the model to jointly attend to information from different representation subspaces at different positions.

For example, one head might learn to focus on syntactic relationships (subject-verb agreement), while another focuses on semantic relationships (word meaning in context). The outputs of all heads are concatenated and linearly projected to produce the final output.

Positional Encoding

Since self-attention has no inherent notion of token order (it processes all tokens simultaneously), Transformers add positional encodings to the input embeddings. The original paper used sinusoidal functions of different frequencies:

PE_{(pos, 2i)} = sin(pos / 10000^2i/d)

PE_{(pos, 2i+1)} = cos(pos / 10000^2i/d)

Sinusoidal positional encoding for position pos and dimension i

This clever encoding allows the model to learn relative positions and generalize to sequence lengths not seen during training. Modern models like GPT use learned positional embeddings instead, but the principle remains the same.

The Full Transformer Block

A complete Transformer block combines multi-head attention with a feed-forward network, layer normalization, and residual connections:

Multi-head self-attention — compute attention across all positions
Add & normalize — residual connection + layer normalization
Feed-forward network — two linear layers with a ReLU/GELU activation
Add & normalize — another residual connection + layer norm

GPT-3, for example, stacks 96 of these blocks. Each block refines the representation, building increasingly abstract and contextual understanding of the input.

Why Transformers Won

Transformers dominate modern AI for several compelling reasons:

Parallelizable: All tokens are processed simultaneously during training, enabling massive speedups on GPUs.
Long-range dependencies: Self-attention connects any two tokens directly, regardless of distance.
Scalability: Transformers scale remarkably well — more parameters and more data consistently yield better results.

Continue Learning

Ready to see how Transformers power large language models? Our LLM module covers GPT, BERT, tokenization, and the training process in detail.

Explore LLMs → ← Backpropagation