Learning CUDA – Notes & Hands-On Examples

This repository contains a self-contained, incremental tour of NVIDIA CUDA and GPU programming — from a refresher on C/C++ all the way to hand-tuned matrix-multiplication kernels. It grew out of personal study notes that were kept in a handful of Org-mode files and a collection of small, focused code snippets. Everything is now consolidated here so that the notes render nicely on GitHub and the code can be compiled or copy-pasted straight into your projects.

The material is not a full course and it does not try to compete with the excellent official CUDA documentation. Instead, think of it as a curated check-list of the things you usually need when you start writing your own kernels:

where GPU hardware shines compared to CPUs,
the basic programming model (thread / warp / block / grid),
how to compile and profile simple kernels,
the high-level CUDA libraries you should know (cuBLAS, cuDNN, NCCL, …)
and finally a deep dive into fast GEMM / SGEMM implementations.

Section	What you will find	Source file
1. Introduction	A high-level overview of today’s deep-learning software stack and where CUDA fits in	`001-intro.org`
2. Setup	One-liner instructions for getting a CUDA-capable environment up and running (spoiler: use the official Docker images)	`002-setup.org`
3. C/C++ Refresher	How to compile C, C++ and CUDA, plus a primer on the C pre-processor	`003-cpp-overview.org`
4. GPU Basics	CPU vs GPU architecture, a brief hardware history and the vocabulary every CUDA programmer must know	`004-intro-gpu.org`
5. Your First Kernel	A worked example that walks from `deviceQuery` to a hand-written “Hello GPU” kernel	`005-first-kernel.org`
6. CUDA Libraries	cuBLAS(Lt/X/Dx), cuDNN, NCCL, MIG and more—including handy error-checking macros	`006-cuda-api.org`
7. Faster MatMul	From a naïve O(N³) kernel to register / shared-memory / tensor-core madness, heavily inspired by the fantastic write-up from Siboehm (Anthropic)	`007-faster-matmul.org`

The original Org files are kept verbatim for people who prefer Org-mode. You can open them with Emacs or view them on GitHub directly.

Code layout

The repository pairs each note with one or more minimal, buildable examples:

cpp-overview/           # -> C / C++ examples that also run under nvcc
writing-first-kernels/  # -> Kernels that accompany section 5
cuda-api/               # -> cuBLAS / cuDNN samples and Makefiles
faster-matmul/          # -> Every optimisation step discussed in section 7

Most directories have a Makefile or a one-liner comment that shows how to compile the code, e.g.

# compile a plain CUDA file
nvcc -arch=sm_90 -o main main.cu

# run nvcc in “pass-through” mode so that host code is built by g++
nvcc -x cu -Xcompiler="-O3 -Wall" -arch=sm_90 *.cu -o example

Tip: Always pass --generate-line-info when you plan to profile with nsight-compute (ncu). Source-line mapping in the GUI is priceless.

Prerequisites

CUDA 12 or newer (all samples were tested with CUDA 12.3).
A GPU with compute-capability 8.0 or later will run everything, but most code will happily compile for earlier SM versions as well (drop -arch=sm_90).
CMake is not required; plain nvcc / make is used throughout to keep the focus on the kernels.

If you don’t want to install the toolkit locally, spin up the official CUDA Docker image and mount this repo:

docker run --gpus all -it --rm \
  -v "$(pwd)":/workspace \
  nvcr.io/nvidia/cuda:12.3.2-devel-ubuntu22.04

Referenced material

NVIDIA CUDA Toolkit documentation: https://docs.nvidia.com/cuda/
Siboehm’s blog post “How to Write Fast Matrix Multiply”: https://siboehm.com/articles/22/CUDA-MMM

Images in the assets/ folder are re-used from the CUDA docs and public blog posts for educational purposes. All other code and text in this repository is released under the MIT License (see LICENSE).

Contributing

This is a learning project first and foremost. If you spot an inaccuracy, spelling mistake or have a performance tweak to share, feel very welcome to open a pull request or an issue.

Happy GPU hacking! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.claude		.claude
.stfolder		.stfolder
assets		assets
cpp-overview		cpp-overview
cuda-api		cuda-api
cuda-test		cuda-test
faster-matmul		faster-matmul
learning-triton		learning-triton
pytorch-extensions		pytorch-extensions
writing-first-kernels		writing-first-kernels
.gitignore		.gitignore
001-intro.org		001-intro.org
002-setup.org		002-setup.org
003-cpp-overview.org		003-cpp-overview.org
004-intro-gpu.org		004-intro-gpu.org
005-first-kernel.org		005-first-kernel.org
006-cuda-api.org		006-cuda-api.org
007-faster-matmul.org		007-faster-matmul.org
008-triton.org		008-triton.org
009-pytorch-extensions.org		009-pytorch-extensions.org
LICENSE		LICENSE
README.md		README.md
check_cuda.cu		check_cuda.cu
check_cuda_version.ipynb		check_cuda_version.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning CUDA – Notes & Hands-On Examples

Table of contents

Code layout

Prerequisites

Referenced material

Contributing

About

Uh oh!

Releases

Packages

Languages

License

DengYiping/learning-cuda

Folders and files

Latest commit

History

Repository files navigation

Learning CUDA – Notes & Hands-On Examples

Table of contents

Code layout

Prerequisites

Referenced material

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages