0Pricing
CUDA Academy · Lesson

Tiling the Inner Product

Loading sub-tiles of A and B per phase.

The Tiling Idea

Tiling breaks the matrices into small square tiles that fit in fast on-chip memory. Threads cooperate to load a tile once and reuse it many times.

Why Shared Memory

A tile lives in __shared__ memory, visible to every thread in the block. Reading it is far faster than hitting global memory again and again. ⚡

__shared__ float As[TILE][TILE];
__shared__ float Bs[TILE][TILE];

All lessons in this course

  1. The Naive Matmul Kernel
  2. Tiling the Inner Product
  3. Looping Over Tile Phases
  4. Measuring the Speedup
← Back to CUDA Academy