Tiling the Inner Product
Loading sub-tiles of A and B per phase.
The Tiling Idea
Tiling breaks the matrices into small square tiles that fit in fast on-chip memory. Threads cooperate to load a tile once and reuse it many times.
Why Shared Memory
A tile lives in __shared__ memory, visible to every thread in the block. Reading it is far faster than hitting global memory again and again. ⚡
__shared__ float As[TILE][TILE];
__shared__ float Bs[TILE][TILE];All lessons in this course
- The Naive Matmul Kernel
- Tiling the Inner Product
- Looping Over Tile Phases
- Measuring the Speedup