Multi-Block Final Reduction
Combining per-block partial sums.
Blocks Cannot Talk
A reduction within a block is easy, but blocks run independently and cannot synchronize with each other mid-kernel. So one launch cannot sum everything.
Each Block Produces a Partial
So every block reduces its own chunk to one number, a partial sum, and writes it to a small output array indexed by blockIdx.
if (tid == 0)
out[blockIdx.x] = data[0];All lessons in this course
- The Reduction Tree Idea
- Killing Warp Divergence
- Sequential Addressing
- Multi-Block Final Reduction