CUDA Academy · Lesson

The Classic Index Formula

blockIdx.x * blockDim.x + threadIdx.x.

One Thread, One Element

The whole point of a CUDA kernel is that every thread handles one piece of data. To do that, each thread needs a unique global index.

Inside a block, threadIdx.x only counts 0 up to blockDim minus one. Many blocks reuse those same small numbers, so it cannot be your final index.