CUDA Academy · Lesson

Multi-Block Final Reduction

Combining per-block partial sums.

Blocks Cannot Talk

A reduction within a block is easy, but blocks run independently and cannot synchronize with each other mid-kernel. So one launch cannot sum everything.

Each Block Produces a Partial

So every block reduces its own chunk to one number, a partial sum, and writes it to a small output array indexed by blockIdx.

if (tid == 0)
  out[blockIdx.x] = data[0];

All lessons in this course

The Reduction Tree Idea
Killing Warp Divergence
Sequential Addressing
Multi-Block Final Reduction

← Back to CUDA Academy