Scaled Dot-Product & Multi-Head
Attend in parallel subspaces.
The Scaling Problem
When key vectors are long, dot-product scores grow huge and softmax becomes razor-sharp. That kills gradients, so we need to scale the scores down.
Divide by Root d
The fix is to divide each score by the square root of the key dimension. This keeps the numbers in a stable range before softmax.
scores = (q @ k.transpose(-2, -1)) / d_k ** 0.5All lessons in this course
- Self-Attention: Query, Key & Value
- Scaled Dot-Product & Multi-Head
- Positional Encoding for Order
- Stack a Transformer Encoder Block