0PricingLogin
Deep Learning Academy · Lesson

Scaled Dot-Product & Multi-Head

Attend in parallel subspaces.

The Scaling Problem

When key vectors are long, dot-product scores grow huge and softmax becomes razor-sharp. That kills gradients, so we need to scale the scores down.

Divide by Root d

The fix is to divide each score by the square root of the key dimension. This keeps the numbers in a stable range before softmax.

scores = (q @ k.transpose(-2, -1)) / d_k ** 0.5

All lessons in this course

  1. Self-Attention: Query, Key & Value
  2. Scaled Dot-Product & Multi-Head
  3. Positional Encoding for Order
  4. Stack a Transformer Encoder Block
← Back to Deep Learning Academy