Transformer Architecture: Attention, Tokens, and Context
Learners will trace the self-attention mechanism, understand how BERT reads the full sentence at once rather than left-to-right, and interpret CLS and SEP special tokens.
What Is a Transformer?
The Transformer is a neural network architecture introduced in 2017 that replaced recurrent networks for most NLP tasks. Unlike RNNs that process tokens one at a time, Transformers process the entire sequence in parallel using a mechanism called self-attention. This parallel processing makes training much faster and allows the model to capture long-range dependencies more effectively.
Self-Attention: Relating Every Token
Self-attention allows each token in a sequence to attend to every other token simultaneously. For the sentence 'The bank by the river was steep', the word 'bank' can attend strongly to 'river' to resolve its meaning. Each token produces three vectors: Query (Q), Key (K), and Value (V), which are used to compute weighted relationships between all token pairs.
import torch
import torch.nn.functional as F
# Simplified self-attention for 3 tokens, d_model=4
Q = torch.randn(3, 4) # queries
K = torch.randn(3, 4) # keys
V = torch.randn(3, 4) # values
d_k = Q.shape[-1]
scores = torch.matmul(Q, K.T) / (d_k ** 0.5) # scaled dot product
weights = F.softmax(scores, dim=-1) # attention weights
output = torch.matmul(weights, V) # weighted values
print('Attention weights:', weights)All lessons in this course
- Transformer Architecture: Attention, Tokens, and Context
- Hugging Face Tokenizers: Encoding Text for BERT
- Fine-Tuning BertForSequenceClassification
- Evaluation and Inference: From Logits to Predicted Labels