0PricingLogin
CUDA Academy · Lesson

Vectorized Loads with float4

Wider transactions for more bandwidth.

Move More Per Instruction

A normal load fetches one value at a time. A vectorized load grabs several adjacent values in a single wider instruction, doing more work per request.

Meet float4

CUDA ships built-in vector types. A float4 packs four floats into one 16-byte bundle you can load or store together.

float4 v = make_float4(1, 2, 3, 4);
float first = v.x; // also .y .z .w

All lessons in this course

  1. Instruction-Level Parallelism
  2. Loop Unrolling with #pragma unroll
  3. Vectorized Loads with float4
  4. Register Pressure and Spills
← Back to CUDA Academy