CUDA Academy

Name: CUDA Academy
Availability: InStock

CPPEnterpriseDesktopAi

Program the GPU with CUDA C++. From threads and memory hierarchy to kernels, optimization, and real parallel computing.

🤖 AI-Powered📚 30 courses👥 100,000+ learners⭐ 4.9 rating

Course Overview

CUDA: GPU Programming with C++

Program the GPU with CUDA C++. From threads and memory hierarchy to kernels, optimization, and real parallel computing. This track covers 30 progressive mini-courses from absolute beginner (A1) through advanced (B2), with short focused lessons and quick quizzes to lock in each concept.

What You Will Learn

You start with the fundamentals and build up through intermediate and advanced topics, each course building on the last. Every lesson is practical and bite-sized, with a 24/7 AI tutor available when you need help.

How It Works

Each course is broken into four focused, bite-sized lessons. Complete a few lessons a day and you will master the full track in weeks, not months.

Start Learning →

Tools

Explore Course Tools

Supercharge your learning with AI-powered tools and features

Ask Super QuestionPRO

Get expert-level answers

📝

Create LessonPRO

AI-powered lesson builder

🏆

Start Certification ExamPRO

Earn your certificate

🎯

Start Challenge

Test your skills

How You'll Learn

🎯

Interactive Lessons

Hands-on coding exercises with real-time feedback

🤖

AI Tutor

Get instant help from our AI when you're stuck

💻

Built-in Editor

Write and run code directly in your browser

🏆

Certificate

Earn a certificate when you complete the course

Curriculum

30 Courses

Every course in the CUDA Academy learning path.

Why GPUs Crush Parallel Work

A14 lessons

Explain what a GPU is and why thousands of small cores beat a few fast CPU cores on parallel problems.

CPU vs GPU: Latency vs Throughput
SIMT: The Same Instruction, Many Threads
What CUDA Actually Is
+1 more

Set Up Your CUDA Toolchain

A14 lessons

Install the CUDA Toolkit, verify your driver, and compile your first program with nvcc.

Driver, Runtime, and Toolkit Versions
Reading nvidia-smi Like a Pro
Compiling with nvcc
+1 more

Host vs Device: Two Worlds

A14 lessonsPRO

Distinguish CPU host code from GPU device code and know which memory each side can touch.

The __global__ Function Qualifier
__device__ and __host__ Functions
Separate Address Spaces
+1 more

Write Your First Kernel

A14 lessonsPRO

Define, launch, and run a GPU kernel that prints from inside the device.

Anatomy of a Kernel
The Triple-Angle-Bracket Launch
printf Inside a Kernel
+1 more

Threads, Blocks, and Grids

A14 lessonsPRO

Map a problem onto CUDA's hierarchy of threads, blocks, and the grid.

The Thread Hierarchy
threadIdx, blockIdx, blockDim
Why Blocks Exist
+1 more

Global Thread Indexing

A24 lessonsPRO

Compute a unique global index so each thread handles exactly one data element.

The Classic Index Formula
Guarding Against Out-of-Range
Rounding Up the Block Count
+1 more

Manage Device Memory

A24 lessonsPRO

Allocate and free GPU memory and understand the device heap.

cudaMalloc and cudaFree
Pointers to GPU Memory
cudaMemset for Initialization
+1 more

Move Data with cudaMemcpy

A24 lessonsPRO

Transfer arrays between host and device in both directions reliably.

Host-to-Device Transfers
Device-to-Host Transfers
The Copy Direction Enum
+1 more

Build Vector Addition End to End

A24 lessonsPRO

Write a complete C = A + B program on the GPU from allocation to verification.

The Vector Add Kernel
Wiring Up the Host Side
Verifying the Result on the CPU
+1 more

Catch CUDA Errors Properly

A24 lessonsPRO

Detect and report CUDA failures from API calls and kernel launches.

Return Codes vs Async Errors
cudaGetLastError After Launch
A Reusable CUDA_CHECK Macro
+1 more

Map the CUDA Memory Hierarchy

B14 lessonsPRO

Choose the right memory space among global, shared, constant, local, and registers.

Registers and Local Memory
Global Memory Tradeoffs
Constant Memory and Its Cache
+1 more

Coalesce Global Memory Access

B14 lessonsPRO

Lay out access patterns so a warp reads contiguous memory in one transaction.

What a Memory Transaction Is
Coalesced vs Strided Reads
Structure of Arrays vs Array of Structs
+1 more

Master On-Chip Shared Memory

B14 lessonsPRO

Use __shared__ memory and __syncthreads to share data within a block.

Declaring __shared__ Arrays
Synchronizing with __syncthreads
Avoiding Bank Conflicts
+1 more

Tile Algorithms in Shared Memory

B14 lessonsPRO

Apply the load-sync-compute tiling pattern to reuse data and cut global traffic.

The Data Reuse Problem
The Load-Sync-Compute Pattern
Stencil and Sliding Windows
+1 more

Tiled Matrix Multiplication

B14 lessonsPRO

Build a shared-memory tiled GEMM that vastly outperforms the naive version.

The Naive Matmul Kernel
Tiling the Inner Product
Looping Over Tile Phases
+1 more

Parallel Reduction Done Right

B14 lessonsPRO

Sum an array on the GPU efficiently using a tree-based reduction.

The Reduction Tree Idea
Killing Warp Divergence
Sequential Addressing
+1 more

Atomics for Safe Concurrency

B14 lessonsPRO

Use atomic operations to update shared results without race conditions.

Race Conditions on the GPU
atomicAdd and Friends
Building a Histogram
+1 more

Overlap Work with CUDA Streams

B14 lessonsPRO

Run kernels and transfers concurrently using non-default streams.

The Default Stream Trap
Creating and Using Streams
Events for Timing and Sync
+1 more

Asynchronous Transfers & Pinned Memory

B14 lessonsPRO

Speed up transfers with pinned host memory and async copies in streams.

Why Pageable Memory Is Slow
Pinned Memory with cudaMallocHost
cudaMemcpyAsync in a Stream
+1 more

Simplify with Unified Memory

B14 lessonsPRO

Use cudaMallocManaged for one pointer shared by host and device.

One Pointer, Both Sides
On-Demand Page Migration
Prefetching with cudaMemPrefetchAsync
+1 more

Tune Occupancy & Launch Config

B24 lessonsPRO

Pick block sizes and limit resources to maximize SM occupancy.

What Occupancy Really Means
Registers and Shared Memory Limits
The Occupancy Calculator API
+1 more

Profile Kernels with Nsight

B24 lessonsPRO

Find bottlenecks using Nsight Systems and Nsight Compute metrics.

Timeline View in Nsight Systems
Kernel Metrics in Nsight Compute
Compute-Bound vs Memory-Bound
+1 more

Warp-Level Primitives & Shuffles

B24 lessonsPRO

Exchange data inside a warp with shuffle and vote intrinsics, no shared memory.

Warps, Lanes, and Masks
__shfl_down_sync for Reductions
Ballot and Vote Functions
+1 more

Advanced Kernel Optimization

B24 lessonsPRO

Apply ILP, loop unrolling, and vectorized loads to squeeze peak performance.

Instruction-Level Parallelism
Loop Unrolling with #pragma unroll
Vectorized Loads with float4
+1 more

Scale Across Multiple GPUs

B24 lessonsPRO

Split work over several GPUs and move data directly between them with P2P.

Enumerating and Selecting Devices
Partitioning Work Across GPUs
Peer-to-Peer Memory Access
+1 more

Accelerate with cuBLAS & Thrust

B24 lessonsPRO

Call NVIDIA's tuned libraries for GEMM, sorting, and parallel algorithms.

cuBLAS GEMM Done Right
Thrust Vectors and Transforms
Thrust Reduce, Scan, and Sort
+1 more

Dynamic Parallelism & CUDA Graphs

B24 lessonsPRO

Launch kernels from the device and capture work into reusable CUDA graphs.

Launching Kernels from a Kernel
When Dynamic Parallelism Pays
Capturing Work into a Graph
+1 more

Program Tensor Cores

B24 lessonsPRO

Use mixed-precision tensor cores via the WMMA API for fast matrix math.

What Tensor Cores Compute
Mixed Precision: FP16, BF16, TF32
The WMMA Fragment API
+1 more

Debug with cuda-gdb & Sanitizer

B24 lessonsPRO

Hunt down crashes, races, and memory errors with CUDA's debugging tools.

Stepping Kernels in cuda-gdb
Finding Leaks with memcheck
Hunting Races with racecheck
+1 more

Capstone: A GPU Image Pipeline

B24 lessonsPRO

Combine kernels, streams, and profiling into a real GPU-accelerated image processor.

Designing the Processing Pipeline
Fusing Filters into One Kernel
Streaming Tiles for Big Images
+1 more

FAQ

Frequently Asked Questions

Is the CUDA Academy course free?

Yes. You can start the CUDA Academy course for free and complete its interactive lessons at no cost. An optional PRO subscription unlocks advanced AI tools and a shareable certificate.

Do I need prior experience to learn CPP?

No. The course begins with the fundamentals and gradually moves to more advanced topics, so you can start even with no prior CPP experience.

How will I learn CPP on CoddyKit?

You learn by doing. Short interactive lessons pair a clear explanation with a hands-on coding exercise that runs in real time, and a 24/7 AI tutor gives personalized help whenever you get stuck.

Do I get a certificate for completing CUDA Academy?

Yes. PRO learners can take an exam and earn a shareable certificate of completion with a verifiable code for the CUDA Academy course.

Can I learn CPP on my phone?

Yes. CoddyKit is available on the web and as native iOS and Android apps, so you can learn CPP on any device and your progress syncs across them.

Start CUDA Academy Now

Join thousands of learners mastering programming with AI-powered lessons.

Get Started Free →Browse All Courses