0Pricing
CUDA Academy · Lesson

Multi-Block Final Reduction

Combining per-block partial sums.

Blocks Cannot Talk

A reduction within a block is easy, but blocks run independently and cannot synchronize with each other mid-kernel. So one launch cannot sum everything.

Each Block Produces a Partial

So every block reduces its own chunk to one number, a partial sum, and writes it to a small output array indexed by blockIdx.

if (tid == 0)
  out[blockIdx.x] = data[0];

All lessons in this course

  1. The Reduction Tree Idea
  2. Killing Warp Divergence
  3. Sequential Addressing
  4. Multi-Block Final Reduction
← Back to CUDA Academy