CUDA Academy · Lesson

Looping Over Tile Phases

Accumulating partial sums across tiles.

The Dot Product Is Split

A full row times a full column is too big for one tile. So you split that long sum into chunks of width TILE, one chunk per phase.

If the matrices are N wide and tiles are TILE wide, you need N / TILE phases to cover the whole inner dimension.

int numPhases = (N + TILE - 1) / TILE;