Combining SIMD with Loops
Vectorize the kernel's core.
From Scalar to SIMD
To accelerate a kernel, swap the one-at-a-time loop for one that handles a whole pack of values per step with SIMD.
for i in range(n):
out[i] = a[i] + b[i]Pick a Width
Choose how many elements fit in one vector. The width is the number of lanes each step processes together.
alias width = 4All lessons in this course
- Anatomy of a Compute Kernel
- Combining SIMD with Loops
- Reducing Memory Traffic
- Tiling for Cache Locality