Vectorized Loads with float4
Wider transactions for more bandwidth.
Move More Per Instruction
A normal load fetches one value at a time. A vectorized load grabs several adjacent values in a single wider instruction, doing more work per request.
Meet float4
CUDA ships built-in vector types. A float4 packs four floats into one 16-byte bundle you can load or store together.
float4 v = make_float4(1, 2, 3, 4);
float first = v.x; // also .y .z .wAll lessons in this course
- Instruction-Level Parallelism
- Loop Unrolling with #pragma unroll
- Vectorized Loads with float4
- Register Pressure and Spills