Machine Learning Academy · Lesson

Transformer Architecture: Attention, Tokens, and Context

Learners will trace the self-attention mechanism, understand how BERT reads the full sentence at once rather than left-to-right, and interpret CLS and SEP special tokens.

What Is a Transformer?

The Transformer is a neural network architecture introduced in 2017 that replaced recurrent networks for most NLP tasks. Unlike RNNs that process tokens one at a time, Transformers process the entire sequence in parallel using a mechanism called self-attention. This parallel processing makes training much faster and allows the model to capture long-range dependencies more effectively.

Self-Attention: Relating Every Token

Self-attention allows each token in a sequence to attend to every other token simultaneously. For the sentence 'The bank by the river was steep', the word 'bank' can attend strongly to 'river' to resolve its meaning. Each token produces three vectors: Query (Q), Key (K), and Value (V), which are used to compute weighted relationships between all token pairs.

import torch
import torch.nn.functional as F

# Simplified self-attention for 3 tokens, d_model=4
Q = torch.randn(3, 4)  # queries
K = torch.randn(3, 4)  # keys
V = torch.randn(3, 4)  # values

d_k = Q.shape[-1]
scores = torch.matmul(Q, K.T) / (d_k ** 0.5)  # scaled dot product
weights = F.softmax(scores, dim=-1)  # attention weights
output = torch.matmul(weights, V)    # weighted values
print('Attention weights:', weights)

All lessons in this course

Transformer Architecture: Attention, Tokens, and Context
Hugging Face Tokenizers: Encoding Text for BERT
Fine-Tuning BertForSequenceClassification
Evaluation and Inference: From Logits to Predicted Labels

← Back to Machine Learning Academy