This repository contains a self-contained, incremental tour of NVIDIA CUDA and GPU programming — from a refresher on C/C++ all the way to hand-tuned matrix-multiplication kernels. It grew out of personal study notes that were kept in a handful of Org-mode files and a collection of small, focused code snippets. Everything is now consolidated here so that the notes render nicely on GitHub and the code can be compiled or copy-pasted straight into your projects.
The material is not a full course and it does not try to compete with the excellent official CUDA documentation. Instead, think of it as a curated check-list of the things you usually need when you start writing your own kernels:
- where GPU hardware shines compared to CPUs,
- the basic programming model (thread / warp / block / grid),
- how to compile and profile simple kernels,
- the high-level CUDA libraries you should know (cuBLAS, cuDNN, NCCL, …)
- and finally a deep dive into fast GEMM / SGEMM implementations.
| Section | What you will find | Source file |
|---|---|---|
| 1. Introduction | A high-level overview of today’s deep-learning software stack and where CUDA fits in | 001-intro.org |
| 2. Setup | One-liner instructions for getting a CUDA-capable environment up and running (spoiler: use the official Docker images) | 002-setup.org |
| 3. C/C++ Refresher | How to compile C, C++ and CUDA, plus a primer on the C pre-processor | 003-cpp-overview.org |
| 4. GPU Basics | CPU vs GPU architecture, a brief hardware history and the vocabulary every CUDA programmer must know | 004-intro-gpu.org |
| 5. Your First Kernel | A worked example that walks from deviceQuery to a hand-written “Hello GPU” kernel |
005-first-kernel.org |
| 6. CUDA Libraries | cuBLAS(Lt/X/Dx), cuDNN, NCCL, MIG and more—including handy error-checking macros | 006-cuda-api.org |
| 7. Faster MatMul | From a naïve O(N³) kernel to register / shared-memory / tensor-core madness, heavily inspired by the fantastic write-up from Siboehm (Anthropic) | 007-faster-matmul.org |
The original Org files are kept verbatim for people who prefer Org-mode. You can open them with Emacs or view them on GitHub directly.
The repository pairs each note with one or more minimal, buildable examples:
cpp-overview/ # -> C / C++ examples that also run under nvcc
writing-first-kernels/ # -> Kernels that accompany section 5
cuda-api/ # -> cuBLAS / cuDNN samples and Makefiles
faster-matmul/ # -> Every optimisation step discussed in section 7
Most directories have a Makefile or a one-liner comment that shows how to
compile the code, e.g.
# compile a plain CUDA file
nvcc -arch=sm_90 -o main main.cu
# run nvcc in “pass-through” mode so that host code is built by g++
nvcc -x cu -Xcompiler="-O3 -Wall" -arch=sm_90 *.cu -o exampleTip: Always pass
--generate-line-infowhen you plan to profile withnsight-compute(ncu). Source-line mapping in the GUI is priceless.
- CUDA 12 or newer (all samples were tested with CUDA 12.3).
- A GPU with compute-capability 8.0 or later will run everything, but most code
will happily compile for earlier SM versions as well (drop
-arch=sm_90). - CMake is not required; plain
nvcc/makeis used throughout to keep the focus on the kernels.
If you don’t want to install the toolkit locally, spin up the official CUDA Docker image and mount this repo:
docker run --gpus all -it --rm \
-v "$(pwd)":/workspace \
nvcr.io/nvidia/cuda:12.3.2-devel-ubuntu22.04- NVIDIA CUDA Toolkit documentation: https://docs.nvidia.com/cuda/
- Siboehm’s blog post “How to Write Fast Matrix Multiply”: https://siboehm.com/articles/22/CUDA-MMM
Images in the assets/ folder are re-used from the CUDA docs and public blog
posts for educational purposes. All other code and text in this repository is
released under the MIT License (see LICENSE).
This is a learning project first and foremost. If you spot an inaccuracy, spelling mistake or have a performance tweak to share, feel very welcome to open a pull request or an issue.
Happy GPU hacking! 🚀