CUDA Academy · Lesson

Vectorized Loads with float4

Wider transactions for more bandwidth.

Move More Per Instruction

A normal load fetches one value at a time. A vectorized load grabs several adjacent values in a single wider instruction, doing more work per request.

Meet float4

CUDA ships built-in vector types. A float4 packs four floats into one 16-byte bundle you can load or store together.

float4 v = make_float4(1, 2, 3, 4);
float first = v.x; // also .y .z .w

All lessons in this course

Instruction-Level Parallelism
Loop Unrolling with #pragma unroll
Vectorized Loads with float4
Register Pressure and Spills

← Back to CUDA Academy