CUDA Academy · Lesson

The Load-Sync-Compute Pattern

Staging tiles before computing on them.

Three Simple Phases

Tiling follows one rhythm in every kernel: load a tile into shared memory, sync, then compute from the fast copy.

In the load phase, each thread reads one element from global memory and stores it into a shared-memory tile its whole block can see.

tile[threadIdx.x] = in[globalIndex];