A Visual Understanding of Positional Encoding in Transformers

Transformers have revolutionized natural language processing (NLP) by introducing a mechanism that allows models to leverage context and relationships within language efficiently. However, unlike traditional sequence-based models like RNNs and LSTMs, transformers lack a built-in way of handling sequential data. Enter positional encoding, a clever solution to help transformers understand the order of tokens in input sequences.

Why Do We Need Positional Encoding?

Transformers process input as a whole rather than step-by-step. This means they view all words in a sentence simultaneously, making their architecture order-agnostic. Traditional models, like RNNs, inherently track sequence information, as they process data step-by-step. Transformers, on the other hand, need help understanding which word comes first, second, or last.

This problem is addressed by positional encoding, a technique that injects sequence order into the model. Without it, a transformer would treat, “The cat sat on the mat” and “On the mat sat the cat” as identical sequences, simply because the set of words is the same.

How Does Positional Encoding Work?

Positional encoding works by adding a set of continuous values—often sine and cosine functions—to the input embeddings. These values carry information about the position (or index) of each token in the sequence.

Input Embedding: Each token is converted to a high-dimensional vector.
Positional Encoding: A vector of the same dimension as the token embedding is generated for each position.
Addition: The positional encoding is added to the input embedding. The resulting vector is used for processing within the transformer.

Visual Step-by-Step Example

Suppose you have a sentence: “I love AI”.
- Token embeddings might look like random numbers (say, 3-dimensional for simplicity):
  I: [0.2, 0.8, 0.5]
  love: [0.7, 0.1, 0.6]
  AI: [0.5, 0.4, 0.9]
Generate positional encodings for each position (index 0, 1, 2):
- These are calculated using sine and cosine functions as described in the original “Attention Is All You Need” paper.
  Example encodings (for demonstration):
  Position 0: [0, 1, 0]
  Position 1: [0.84, 0.54, 0.91]
  Position 2: [0.91, 0.41, 0.14]
Add positional encodings to token embeddings:
- I: [0.2, 0.8, 0.5] + [0, 1, 0] = [0.2, 1.8, 0.5]
- love: [0.7, 0.1, 0.6] + [0.84, 0.54, 0.91] = [1.54, 0.64, 1.51]
- AI: [0.5, 0.4, 0.9] + [0.91, 0.41, 0.14] = [1.41, 0.81, 1.04]

An Intuitive Visualization

Imagine the input sequence as beads on a string—each bead is a word. Embeddings give each word an identity; positional encoding tells the model the position of each bead along the string.

If you want a dynamic visualization, The Illustrated Transformer provides excellent graphics to help visualize how positional encoding enriches the model’s understanding.

Mathematical Formulation

The original transformer paper by Vaswani et al. defines positional encodings as follows:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where pos is the position and i is the dimension. d_model is the embedding size. This creative choice ensures each position in the sequence has a unique signature and that the encoding generalizes to sequences longer than those seen during training. For a deeper dive into the math, see Google’s AI Blog.

Learned vs. Fixed Positional Encoding

The classical approach (from the original paper) involves fixed (non-trainable) encodings. Recent research also explores learned positional embeddings, which are trainable vectors optimized by the model itself—this is now common in models like BERT. Both approaches have their merits; you can read more in this research paper about embedding methods.

Key Takeaways

Transformers need positional encoding to capture order in sequences.
Sinusoidal functions provide a simple, effective way to generate positional encodings.
Both fixed and learned encodings have applications depending on the use case.