CUDA Academy icon

CUDA Academy

CPPEnterpriseDesktopAi

Program the GPU with CUDA C++. From threads and memory hierarchy to kernels, optimization, and real parallel computing.

🤖 AI-Powered📚 30 courses👥 100,000+ learners⭐ 4.9 rating
Course Overview

CUDA: GPU Programming with C++

Program the GPU with CUDA C++. From threads and memory hierarchy to kernels, optimization, and real parallel computing. This track covers 30 progressive mini-courses from absolute beginner (A1) through advanced (B2), with short focused lessons and quick quizzes to lock in each concept.

What You Will Learn

You start with the fundamentals and build up through intermediate and advanced topics, each course building on the last. Every lesson is practical and bite-sized, with a 24/7 AI tutor available when you need help.

How It Works

Each course is broken into four focused, bite-sized lessons. Complete a few lessons a day and you will master the full track in weeks, not months.

Start Learning →

How You'll Learn

🎯
Interactive Lessons
Hands-on coding exercises with real-time feedback
🤖
AI Tutor
Get instant help from our AI when you're stuck
💻
Built-in Editor
Write and run code directly in your browser
🏆
Certificate
Earn a certificate when you complete the course
Curriculum

30 Courses

Every course in the CUDA Academy learning path.

01

Why GPUs Crush Parallel Work

A14 lessons

Explain what a GPU is and why thousands of small cores beat a few fast CPU cores on parallel problems.

  • CPU vs GPU: Latency vs Throughput
  • SIMT: The Same Instruction, Many Threads
  • What CUDA Actually Is
  • +1 more
02

Set Up Your CUDA Toolchain

A14 lessons

Install the CUDA Toolkit, verify your driver, and compile your first program with nvcc.

  • Driver, Runtime, and Toolkit Versions
  • Reading nvidia-smi Like a Pro
  • Compiling with nvcc
  • +1 more
03

Host vs Device: Two Worlds

A14 lessonsPRO

Distinguish CPU host code from GPU device code and know which memory each side can touch.

  • The __global__ Function Qualifier
  • __device__ and __host__ Functions
  • Separate Address Spaces
  • +1 more
04

Write Your First Kernel

A14 lessonsPRO

Define, launch, and run a GPU kernel that prints from inside the device.

  • Anatomy of a Kernel
  • The Triple-Angle-Bracket Launch
  • printf Inside a Kernel
  • +1 more
05

Threads, Blocks, and Grids

A14 lessonsPRO

Map a problem onto CUDA's hierarchy of threads, blocks, and the grid.

  • The Thread Hierarchy
  • threadIdx, blockIdx, blockDim
  • Why Blocks Exist
  • +1 more
06

Global Thread Indexing

A24 lessonsPRO

Compute a unique global index so each thread handles exactly one data element.

  • The Classic Index Formula
  • Guarding Against Out-of-Range
  • Rounding Up the Block Count
  • +1 more
07

Manage Device Memory

A24 lessonsPRO

Allocate and free GPU memory and understand the device heap.

  • cudaMalloc and cudaFree
  • Pointers to GPU Memory
  • cudaMemset for Initialization
  • +1 more
08

Move Data with cudaMemcpy

A24 lessonsPRO

Transfer arrays between host and device in both directions reliably.

  • Host-to-Device Transfers
  • Device-to-Host Transfers
  • The Copy Direction Enum
  • +1 more
09

Build Vector Addition End to End

A24 lessonsPRO

Write a complete C = A + B program on the GPU from allocation to verification.

  • The Vector Add Kernel
  • Wiring Up the Host Side
  • Verifying the Result on the CPU
  • +1 more
10

Catch CUDA Errors Properly

A24 lessonsPRO

Detect and report CUDA failures from API calls and kernel launches.

  • Return Codes vs Async Errors
  • cudaGetLastError After Launch
  • A Reusable CUDA_CHECK Macro
  • +1 more
11

Map the CUDA Memory Hierarchy

B14 lessonsPRO

Choose the right memory space among global, shared, constant, local, and registers.

  • Registers and Local Memory
  • Global Memory Tradeoffs
  • Constant Memory and Its Cache
  • +1 more
12

Coalesce Global Memory Access

B14 lessonsPRO

Lay out access patterns so a warp reads contiguous memory in one transaction.

  • What a Memory Transaction Is
  • Coalesced vs Strided Reads
  • Structure of Arrays vs Array of Structs
  • +1 more
13

Master On-Chip Shared Memory

B14 lessonsPRO

Use __shared__ memory and __syncthreads to share data within a block.

  • Declaring __shared__ Arrays
  • Synchronizing with __syncthreads
  • Avoiding Bank Conflicts
  • +1 more
14

Tile Algorithms in Shared Memory

B14 lessonsPRO

Apply the load-sync-compute tiling pattern to reuse data and cut global traffic.

  • The Data Reuse Problem
  • The Load-Sync-Compute Pattern
  • Stencil and Sliding Windows
  • +1 more
15

Tiled Matrix Multiplication

B14 lessonsPRO

Build a shared-memory tiled GEMM that vastly outperforms the naive version.

  • The Naive Matmul Kernel
  • Tiling the Inner Product
  • Looping Over Tile Phases
  • +1 more
16

Parallel Reduction Done Right

B14 lessonsPRO

Sum an array on the GPU efficiently using a tree-based reduction.

  • The Reduction Tree Idea
  • Killing Warp Divergence
  • Sequential Addressing
  • +1 more
17

Atomics for Safe Concurrency

B14 lessonsPRO

Use atomic operations to update shared results without race conditions.

  • Race Conditions on the GPU
  • atomicAdd and Friends
  • Building a Histogram
  • +1 more
18

Overlap Work with CUDA Streams

B14 lessonsPRO

Run kernels and transfers concurrently using non-default streams.

  • The Default Stream Trap
  • Creating and Using Streams
  • Events for Timing and Sync
  • +1 more
19

Asynchronous Transfers & Pinned Memory

B14 lessonsPRO

Speed up transfers with pinned host memory and async copies in streams.

  • Why Pageable Memory Is Slow
  • Pinned Memory with cudaMallocHost
  • cudaMemcpyAsync in a Stream
  • +1 more
20

Simplify with Unified Memory

B14 lessonsPRO

Use cudaMallocManaged for one pointer shared by host and device.

  • One Pointer, Both Sides
  • On-Demand Page Migration
  • Prefetching with cudaMemPrefetchAsync
  • +1 more
21

Tune Occupancy & Launch Config

B24 lessonsPRO

Pick block sizes and limit resources to maximize SM occupancy.

  • What Occupancy Really Means
  • Registers and Shared Memory Limits
  • The Occupancy Calculator API
  • +1 more
22

Profile Kernels with Nsight

B24 lessonsPRO

Find bottlenecks using Nsight Systems and Nsight Compute metrics.

  • Timeline View in Nsight Systems
  • Kernel Metrics in Nsight Compute
  • Compute-Bound vs Memory-Bound
  • +1 more
23

Warp-Level Primitives & Shuffles

B24 lessonsPRO

Exchange data inside a warp with shuffle and vote intrinsics, no shared memory.

  • Warps, Lanes, and Masks
  • __shfl_down_sync for Reductions
  • Ballot and Vote Functions
  • +1 more
24

Advanced Kernel Optimization

B24 lessonsPRO

Apply ILP, loop unrolling, and vectorized loads to squeeze peak performance.

  • Instruction-Level Parallelism
  • Loop Unrolling with #pragma unroll
  • Vectorized Loads with float4
  • +1 more
25

Scale Across Multiple GPUs

B24 lessonsPRO

Split work over several GPUs and move data directly between them with P2P.

  • Enumerating and Selecting Devices
  • Partitioning Work Across GPUs
  • Peer-to-Peer Memory Access
  • +1 more
26

Accelerate with cuBLAS & Thrust

B24 lessonsPRO

Call NVIDIA's tuned libraries for GEMM, sorting, and parallel algorithms.

  • cuBLAS GEMM Done Right
  • Thrust Vectors and Transforms
  • Thrust Reduce, Scan, and Sort
  • +1 more
27

Dynamic Parallelism & CUDA Graphs

B24 lessonsPRO

Launch kernels from the device and capture work into reusable CUDA graphs.

  • Launching Kernels from a Kernel
  • When Dynamic Parallelism Pays
  • Capturing Work into a Graph
  • +1 more
28

Program Tensor Cores

B24 lessonsPRO

Use mixed-precision tensor cores via the WMMA API for fast matrix math.

  • What Tensor Cores Compute
  • Mixed Precision: FP16, BF16, TF32
  • The WMMA Fragment API
  • +1 more
29

Debug with cuda-gdb & Sanitizer

B24 lessonsPRO

Hunt down crashes, races, and memory errors with CUDA's debugging tools.

  • Stepping Kernels in cuda-gdb
  • Finding Leaks with memcheck
  • Hunting Races with racecheck
  • +1 more
30

Capstone: A GPU Image Pipeline

B24 lessonsPRO

Combine kernels, streams, and profiling into a real GPU-accelerated image processor.

  • Designing the Processing Pipeline
  • Fusing Filters into One Kernel
  • Streaming Tiles for Big Images
  • +1 more
FAQ

Frequently Asked Questions

Is the CUDA Academy course free?

Yes. You can start the CUDA Academy course for free and complete its interactive lessons at no cost. An optional PRO subscription unlocks advanced AI tools and a shareable certificate.

Do I need prior experience to learn CPP?

No. The course begins with the fundamentals and gradually moves to more advanced topics, so you can start even with no prior CPP experience.

How will I learn CPP on CoddyKit?

You learn by doing. Short interactive lessons pair a clear explanation with a hands-on coding exercise that runs in real time, and a 24/7 AI tutor gives personalized help whenever you get stuck.

Do I get a certificate for completing CUDA Academy?

Yes. PRO learners can take an exam and earn a shareable certificate of completion with a verifiable code for the CUDA Academy course.

Can I learn CPP on my phone?

Yes. CoddyKit is available on the web and as native iOS and Android apps, so you can learn CPP on any device and your progress syncs across them.

Start CUDA Academy Now

Join thousands of learners mastering programming with AI-powered lessons.

Get Started Free →Browse All Courses