CUDA Academy · Lesson

__shfl_down_sync for Reductions

Warp reductions without barriers.

Shuffle Moves Registers

A shuffle lets one lane read another lane's register value directly. Data hops between threads with no shared memory and no barrier in between.

The workhorse for reductions is shfl_down_sync. Each lane grabs a value from a lane a fixed number of positions higher in the warp.

float v = __shfl_down_sync(mask, val, offset);