Attention Is All You Need — A Practitioner's Guide to the Transformer

September 2018

Last year, Vaswani et al. published a paper that I think will matter for a long time: Attention Is All You Need. The title is deliberately provocative. For years, the dominant approach to sequence modeling has been recurrent neural networks — LSTMs, GRUs, and their bidirectional variants. This paper argues that you can throw all of that away and replace it with a single mechanism: attention.

I have been studying this paper closely during my M.Tech at NUS, and what follows is my attempt to explain the core ideas in practical terms — what the Transformer does, why it works, and why it matters. I am still working through some of the mathematical details, but the architecture is clear enough to write about.

The Problem with Recurrence

Recurrent neural networks process sequences one token at a time, left to right. The hidden state at position t depends on the hidden state at position t-1. This sequential dependency creates two problems.

First, it is slow. You cannot process position 100 until you have processed positions 1 through 99. This makes RNNs fundamentally difficult to parallelize on modern hardware. GPUs are designed for massively parallel computation, but recurrence forces sequential execution.

Second, long-range dependencies are hard. By the time the hidden state has been passed through 200 positions, information from the beginning of the sequence has been compressed, distorted, and often lost. LSTMs and GRUs mitigate this with gating mechanisms, but they do not solve it. In practice, LSTMs struggle with dependencies beyond a few hundred tokens.

The attention mechanism, introduced by Bahdanau et al. in 2014, was the first crack in the wall. Instead of relying solely on the hidden state to carry information forward, attention allows the decoder to look directly at all encoder positions and decide which ones are relevant for each output step. This was a breakthrough for machine translation — but it was still layered on top of an RNN. The recurrence remained.

The Transformer removes it entirely.

Self-Attention: The Core Idea

The central insight of the Transformer is self-attention: every position in a sequence attends to every other position in the same sequence. No recurrence. No convolution. Just attention.

For each token, the model computes three vectors: a query, a key, and a value. These are linear projections of the input embedding. The attention score between two positions is the dot product of the query at one position with the key at the other, scaled by the square root of the dimension and passed through a softmax. The output for each position is a weighted sum of all value vectors, where the weights are the attention scores.

Concretely, for a sequence of length n with model dimension d:

Attention(Q, K, V) = softmax(QK^T / √d) V

This single equation replaces the entire recurrence mechanism. Every position can attend to every other position in a single step. The computation is a matrix multiplication — and matrix multiplications are exactly what GPUs are good at.

Self-attention: each token computes Q, K, V vectors. Attention scores determine how much each position contributes to the output.

Multi-Head Attention

A single attention function computes one set of attention weights. But language has many types of relationships: syntactic, semantic, positional, coreference. A single attention head cannot capture all of them simultaneously.

The Transformer solves this by running multiple attention functions in parallel — the paper uses 8 heads. Each head has its own learned query, key, and value projections. Each head learns to attend to different types of relationships. The outputs of all heads are concatenated and projected through a final linear layer.

In practice, you can inspect what each head learns. Some heads attend to the previous word. Some attend to the syntactic head of a clause. Some attend to coreferent mentions. The model discovers these patterns without being told to look for them.

Positional Encoding

Without recurrence, the model has no inherent sense of word order. The sentence "the cat sat on the mat" and "mat the on sat cat the" would produce identical representations. This is clearly wrong — word order matters.

The solution is positional encoding. The paper adds a fixed sinusoidal signal to the input embeddings, where each dimension uses a sine or cosine function of different frequency. Position pos and dimension i get:

PE(pos, 2i) = sin(pos / 10000^2i/d)
PE(pos, 2i+1) = cos(pos / 10000^2i/d)

Why sinusoids? The authors hypothesize that sinusoidal encodings allow the model to learn relative positions, because for any fixed offset k, the encoding at position pos + k can be expressed as a linear function of the encoding at position pos. This is a clever mathematical property, though whether it is optimal remains an open question. I expect we will see learned positional encodings become more common.

The Architecture

The full Transformer is an encoder-decoder architecture. The encoder is a stack of 6 identical layers, each containing a multi-head self-attention sublayer and a position-wise feed-forward sublayer. The decoder is also 6 layers, but with an additional cross-attention sublayer that attends to the encoder output. Residual connections and layer normalization surround each sublayer.

The Transformer: encoder (left) and decoder (right). Cross-attention connects them.

A few architectural details are worth noting:

The feed-forward network is a two-layer MLP with a ReLU activation and inner dimension of 2048, applied identically to each position. This is where the model stores factual knowledge — a point that is not obvious from the paper but has become clear in subsequent work.
Masked self-attention in the decoder prevents positions from attending to future positions. This is essential for autoregressive generation — you cannot look ahead when predicting the next word.
Label smoothing with a value of 0.1 is applied during training. This hurts perplexity (the model becomes less confident) but improves BLEU score. The trade-off between confidence and accuracy is an important practical consideration.

Why This Works: Parallelism and Path Length

The Transformer outperforms recurrent models on machine translation (WMT 2014 English-to-German: 28.4 BLEU, English-to-French: 41.0 BLEU) while training significantly faster. The English-to-French model trained in 3.5 days on 8 GPUs. Comparable RNN-based results required weeks.

Two properties explain this:

Parallelism. Self-attention computes all pairwise interactions in a single matrix multiplication. There is no sequential dependency between positions. This allows full utilization of GPU parallelism, making training time proportional to model size, not sequence length.

Path length. In an RNN, information from position 1 must traverse n-1 steps to reach position n. In self-attention, any position can attend to any other position in a single step. The maximum path length is O(1), compared to O(n) for recurrence and O(log n) for convolutions. Shorter paths make it easier to learn long-range dependencies.

RNNs require O(n) sequential steps for distant positions. Self-attention connects any two positions in O(1).

The trade-off is memory. Self-attention is O(n²) in sequence length — every position attends to every other position. For long sequences, this becomes expensive. The paper works with sequences up to 512 tokens. How to scale attention to longer sequences is an important open problem.

What This Means for Practitioners

I see three immediate implications for anyone building NLP systems.

Sequence-to-sequence is not just for translation anymore. The Transformer architecture is general. Anything that can be framed as a sequence-to-sequence problem — summarization, question answering, text generation, even code generation — can potentially benefit from this architecture. I expect the next year will see Transformers applied to tasks far beyond machine translation.

The feature engineering era for NLP is ending. With sufficient data and compute, attention-based models learn their own representations. Hand-crafted features, syntactic parsers, and task-specific architectures may become less important. This does not mean domain knowledge becomes irrelevant — it means domain knowledge shifts from feature design to data curation, evaluation design, and system architecture.

Compute is becoming the bottleneck. The base Transformer has 65 million parameters. The big model has 213 million. These are large by current standards, but I suspect they are small by future standards. If attention mechanisms scale, the limiting factor will be compute and data, not architectural innovation. This has implications for who can compete in NLP research — organizations with large compute budgets will have a structural advantage.

Open Questions

The paper leaves several questions open that I find interesting:

Can Transformers work without the encoder-decoder split? The paper uses the full encoder-decoder for translation, but language modeling only needs the decoder. A decoder-only Transformer that generates text autoregressively is an obvious next step. OpenAI's recent GPT work explores this direction.
How far can transfer learning go? If a Transformer learns general language representations during pre-training, can those representations transfer to downstream tasks? The ELMo paper from earlier this year suggests yes. I expect pre-trained Transformer models to become the standard starting point for NLP tasks, similar to how ImageNet pre-training transformed computer vision.
What are the limits of self-attention? Self-attention is O(n²). Processing documents of thousands of tokens will require either approximate attention, hierarchical approaches, or architectural modifications. This is likely where the next wave of innovation will come.
What exactly do the attention heads learn? Visualizations suggest they learn interpretable patterns, but we lack a complete understanding. Better interpretability tools for attention would help both researchers and practitioners.

Practical Notes

A few practical observations from implementing and experimenting with this architecture:

Learning rate warmup is essential. The paper uses a custom learning rate schedule that increases linearly for the first 4000 steps, then decays proportionally to the inverse square root of the step number. Without warmup, training is unstable. The Adam optimizer with the paper's exact hyperparameters (beta1=0.9, beta2=0.98, epsilon=1e-9) works well. Deviating from these defaults can cause training to diverge.

Dropout matters more than you think. The paper applies dropout (0.1 for the base model, 0.3 for small datasets) to attention weights, sublayer outputs, and embeddings. Without dropout, the model overfits quickly on smaller datasets. For tasks with limited training data, aggressive dropout and smaller models are more practical than scaling up.

Byte-pair encoding is a reasonable default. The paper uses BPE with a shared vocabulary of 37,000 tokens for translation. BPE handles rare words and morphology gracefully, which is important for languages with rich morphology. For English-only tasks, WordPiece (as used in BERT) is a comparable alternative.

The Transformer is one of those papers where the idea is simpler than the architecture it replaces, and yet it works better. Replacing recurrence with attention is a conceptual simplification that also happens to be a practical improvement. I think we are at the beginning of understanding what this architecture can do. The next few years will be interesting.

The original paper: Vaswani, A. et al., Attention Is All You Need, NeurIPS 2017. If you read one machine learning paper this year, make it this one.