-
Notifications
You must be signed in to change notification settings - Fork 597
Description
Opening this issue to describe and track tasks related to implementing nvcc
support in sccache-dist
.
tl;dr; sccache
should add cicc
and ptxas
as first-class compilers, complete with support for hashing their inputs, caching their outputs, and distributing their work.
Background
sccache-dist
mode relies on a compiler's ability to compile preprocessed input. The source file is preprocessed on the client, looked up in the cache, and on cache miss the toolchain + preprocessed file is sent to the sccache-dist
scheduler for compilation.
This model is not supported by NVIDIA's CUDA compiler nvcc
, because nvcc
lacks support for compiling preprocessed input. This does not represent a deficit in nvcc
, rather it's an inability to align this feature with what nvcc
actually does under the hood.
A CUDA C++ file contains standard C++ (CPU-executed) code and CUDA device code side-by-side. Internally nvcc
runs a number of preprocessor steps to separate this code into host and device code that are each compiled by different compilers. nvcc
can also be instructed to compile the same CUDA device code for different architectures and bundle them into a "fat binary".
The preprocessor output for each device architecture is potentially different, thus there is no single preprocessed input file nvcc can produce that could be fed back in to the compiler later. (A rough analogy is if gcc
supported compiling and assembling objects for x86 + ARM which could be executed on either platform).
Rather than attempt to trick nvcc
into compiling preprocessed input, sccache
can decompose and distribute nvcc
's constituent sub-compiler invocations.
Proposal
sccache
should add cicc
and ptxas
as first-class compilers, complete with support for hashing their inputs, caching their outputs, and distributing their work. sccache
should change its nvcc
compiler to run the underlying host and device compiler steps individually, with each step re-entering the standard hash + cache + (distributed) compilation pipeline.
sccache
can do this by utilizing the nvcc --dryrun
flag, which outputs the underlying calls executed by nvcc
:
Click to expand nvcc --dryrun output
$ nvcc -c x.cu -o x.cu.o -gencode=arch=compute_60,code=[sm_60] -gencode=arch=compute_70,code=[compute_70,sm_70] --dryrun
#$ _NVVM_BRANCH_=nvvm
#$ _SPACE_=
#$ _CUDART_=cudart
#$ _HERE_=/usr/local/cuda/bin
#$ _THERE_=/usr/local/cuda/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_DIR_=targets/x86_64-linux
#$ TOP=/usr/local/cuda/bin/..
#$ CICC_PATH=/usr/local/cuda/bin/../nvvm/bin
#$ CICC_NEXT_PATH=/usr/local/cuda/bin/../nvvm-next/bin
#$ NVVMIR_LIBRARY_DIR=/usr/local/cuda/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/usr/local/cuda/bin/../lib:
#$ PATH=/usr/local/cuda/bin/../nvvm/bin:/usr/local/cuda/bin:/home/ptaylor/.nvm/versions/node/v22.4.0/bin:/home/ptaylor/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/ptaylor/.fzf/bin:/usr/local/cuda/bin
#$ INCLUDES="-I/usr/local/cuda/bin/../targets/x86_64-linux/include"
#$ LIBRARIES= "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib"
#$ CUDAFE_FLAGS=
#$ PTXAS_FLAGS=
#$ gcc -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-5_x.cpp4.ii"
#$ cudafe++ --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed --m64 --parse_templates --gen_c_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.cpp" --stub_file_name "tmpxft_00003437_00000000-6_x.compute_70.cudafe1.stub.c" --gen_module_id_file --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" "/tmp/tmpxft_00003437_00000000-5_x.cpp4.ii"
#$ gcc -D__CUDA_ARCH__=600 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-8_x.compute_60.cpp1.ii"
#$ cicc --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed -arch compute_60 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_00003437_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.c" --stub_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.gpu" "/tmp/tmpxft_00003437_00000000-8_x.compute_60.cpp1.ii" -o "/tmp/tmpxft_00003437_00000000-7_x.compute_60.ptx"
#$ ptxas -arch=sm_60 -m64 "/tmp/tmpxft_00003437_00000000-7_x.compute_60.ptx" -o "/tmp/tmpxft_00003437_00000000-9_x.compute_60.cubin"
#$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-10_x.compute_70.cpp1.ii"
#$ cicc --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed -arch compute_70 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_00003437_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.c" --stub_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.gpu" "/tmp/tmpxft_00003437_00000000-10_x.compute_70.cpp1.ii" -o "/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx"
#$ ptxas -arch=sm_70 -m64 "/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx" -o "/tmp/tmpxft_00003437_00000000-11_x.compute_70.sm_70.cubin"
#$ fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=60,file=/tmp/tmpxft_00003437_00000000-9_x.compute_60.cubin" "--image3=kind=ptx,sm=70,file=/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx" "--image3=kind=elf,sm=70,file=/tmp/tmpxft_00003437_00000000-11_x.compute_70.sm_70.cubin" --embedded-fatbin="/tmp/tmpxft_00003437_00000000-3_x.fatbin.c"
#$ rm /tmp/tmpxft_00003437_00000000-3_x.fatbin
#$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -c -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -Wno-psabi "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -m64 "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.cpp" -o "x.cu.o"
This output represents a sequence of preprocessing steps that must run on the sccache client, followed by compilation steps on the preprocessed result that can be distributed to sccache-dist
workers.
Explanation
Here's a rough break down of the command stages above:
#$ gcc -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-5_x.cpp4.ii"
#$ cudafe++ --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed --m64 --parse_templates --gen_c_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.cpp" --stub_file_name "tmpxft_00003437_00000000-6_x.compute_70.cudafe1.stub.c" --gen_module_id_file --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" "/tmp/tmpxft_00003437_00000000-5_x.cpp4.ii"
These two lines run the host preprocessor to resolve host-side macros and inline #include
s, then run the CUDA front-end to separate the source into host and device source files. The sccache client should run both these steps before requesting any compilation jobs.
#$ gcc -D__CUDA_ARCH__=600 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-8_x.compute_60.cpp1.ii"
#$ cicc --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed -arch compute_60 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_00003437_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.c" --stub_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.gpu" "/tmp/tmpxft_00003437_00000000-8_x.compute_60.cpp1.ii" -o "/tmp/tmpxft_00003437_00000000-7_x.compute_60.ptx"
#$ ptxas -arch=sm_60 -m64 "/tmp/tmpxft_00003437_00000000-7_x.compute_60.ptx" -o "/tmp/tmpxft_00003437_00000000-9_x.compute_60.cubin"
In this phase,nvcc
:
- runs the host compiler to preprocess the input file (
x.cu
) - runs
cicc
on the output of step 1 to generate a.ptx
file - runs
ptxas
on the output of step 2 to assemble the PTX into a.cubin
All these steps must run sequentially. Step 1 must run on the sccache client, but 2 and 3 can be executed by sccache-dist
workers.
#$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-10_x.compute_70.cpp1.ii"
#$ cicc --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed -arch compute_70 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_00003437_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.c" --stub_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.gpu" "/tmp/tmpxft_00003437_00000000-10_x.compute_70.cpp1.ii" -o "/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx"
#$ ptxas -arch=sm_70 -m64 "/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx" -o "/tmp/tmpxft_00003437_00000000-11_x.compute_70.sm_70.cubin"
This is similar to the prior commands, except for a different GPU arch sm_70
. These commands must still run sequentially with respect to each other, but they can run in parallel to the commands from the prior stage.
#$ fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=60,file=/tmp/tmpxft_00003437_00000000-9_x.compute_60.cubin" "--image3=kind=ptx,sm=70,file=/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx" "--image3=kind=elf,sm=70,file=/tmp/tmpxft_00003437_00000000-11_x.compute_70.sm_70.cubin" --embedded-fatbin="/tmp/tmpxft_00003437_00000000-3_x.fatbin.c"
#$ rm /tmp/tmpxft_00003437_00000000-3_x.fatbin
#$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -c -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -Wno-psabi "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -m64 "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.cpp" -o "x.cu.o"
In this stage, the outputs from the prior two stages are assembled into a .fatbin
via the fatbinary
invocation, then the original preprocessed host code is combined with the .fatbin
and assembled into the final .o
by the host compiler. These stages must run sequentially, but can be executed by sccache-dist
workers (the final host compiler call can use the existing sccache-dist
logic for preprocessing + distributing the work).
Additional Benefits
In addition to supporting sccache-dist
in nvcc
, this new behavior also benefits sccache
clients that aren't configured to use distributed compilation, because sccache
can now avoid compiling the underlying .ptx
and .cubin
device compilation artifacts assembled into the final .o
.
For example, a CI job could compile code for all supported device architectures:
$ nvcc ... \
-gencode=arch=compute_60,code=[sm_60] \
-gencode=arch=compute_70,code=[sm_70] \
-gencode=arch=compute_80,code=[sm_80] \
-gencode=arch=compute_90,code=[compute_90,sm_90]
The above produces an object file with a certain hash, let's call it hash_all
.
A developer may want to compile the same code with the same options, but for a smaller subset of architectures that match the GPU on their machine:
$ nvcc ... -gencode=arch=compute_90,code=[compute_90,sm_90]
Since the above produces an object file with a different hash (hash_subset
), today sccache
yields a cache miss on this .o
file and re-runs nvcc
(which itself runs cicc
and ptxas
) because the arguments + input don't match hash_all
produced in CI.
However with the proposed changes, while sccache
would still yield a cache miss for the .o
produced by the nvcc
command, it would yield a cache hit on the underlying .ptx
and .cubin
files produced by cicc
and ptxas
respectively, skipping the lions share of the actual compilation done by nvcc
.
Tasks
Work is ongoing in this branch.
- Add
cicc
andptxas
as first-class compilers supported by sccache - Support bundling
cicc
andptxas
toolchains from client's CUDA toolkit - Refactor
nvcc
compiler to callnvcc --dryrun
, run each sub-command throughsccache
as appropriate