CUDA Academy · Lesson

Tiling the Inner Product

Loading sub-tiles of A and B per phase.

The Tiling Idea

Tiling breaks the matrices into small square tiles that fit in fast on-chip memory. Threads cooperate to load a tile once and reuse it many times.

Why Shared Memory

A tile lives in __shared__ memory, visible to every thread in the block. Reading it is far faster than hitting global memory again and again. ⚡

__shared__ float As[TILE][TILE];
__shared__ float Bs[TILE][TILE];

All lessons in this course

The Naive Matmul Kernel
Tiling the Inner Product
Looping Over Tile Phases
Measuring the Speedup

← Back to CUDA Academy