Multi-Head Attention and Positions
Many views plus order awareness.
One Head Is Limiting
A single attention head can track only one kind of relationship at a time. Real language needs many patterns noticed at once. 🧠
Many Heads, Many Views
Multi-head attention runs several attention computations in parallel, each with its own learned projections and its own focus.
All lessons in this course
- The Idea of Attention
- Self-Attention, Step by Step
- Multi-Head Attention and Positions
- Inside the Transformer Block