Deep Learning Academy · Lesson

Scaled Dot-Product & Multi-Head

Attend in parallel subspaces.

The Scaling Problem

When key vectors are long, dot-product scores grow huge and softmax becomes razor-sharp. That kills gradients, so we need to scale the scores down.

Divide by Root d

The fix is to divide each score by the square root of the key dimension. This keeps the numbers in a stable range before softmax.

scores = (q @ k.transpose(-2, -1)) / d_k ** 0.5

All lessons in this course

Self-Attention: Query, Key & Value
Scaled Dot-Product & Multi-Head
Positional Encoding for Order
Stack a Transformer Encoder Block

← Back to Deep Learning Academy