Supporting `nvcc` in `sccache-dist`

Opening this issue to describe and track tasks related to implementing `nvcc` support in `sccache-dist`.

**tl;dr;** `sccache` should add `cicc` and `ptxas` as first-class compilers, complete with support for hashing their inputs, caching their outputs, and distributing their work.

## Background

`sccache-dist` mode relies on a compiler's ability to compile preprocessed input. The source file is preprocessed on the client, looked up in the cache, and on cache miss the toolchain + preprocessed file is sent to the `sccache-dist` scheduler for compilation.

This model is not supported by NVIDIA's CUDA compiler `nvcc`, because `nvcc` lacks support for compiling preprocessed input. This does not represent a deficit in `nvcc`, rather it's an inability to align this feature with what `nvcc` actually does under the hood.

A CUDA C++ file contains standard C++ (CPU-executed) code and CUDA device code side-by-side. Internally `nvcc` runs a number of preprocessor steps to separate this code into host and device code that are each compiled by different compilers. `nvcc` can also be instructed to compile the same CUDA device code for different architectures and bundle them into a "fat binary".

The preprocessor output for each device architecture is potentially different, thus there is no single preprocessed input file nvcc can produce that could be fed back in to the compiler later. (A rough analogy is if `gcc` supported compiling and assembling objects for x86 + ARM which could be executed on either platform).

Rather than attempt to trick `nvcc` into compiling preprocessed input, `sccache` can decompose and distribute `nvcc`'s constituent sub-compiler invocations.

## Proposal

`sccache` should add `cicc` and `ptxas` as first-class compilers, complete with support for hashing their inputs, caching their outputs, and distributing their work. `sccache` should change its `nvcc` compiler to run the underlying host and device compiler steps individually, with each step re-entering the standard hash + cache + (distributed) compilation pipeline.

`sccache` can do this by utilizing the `nvcc --dryrun` flag, which outputs the underlying calls executed by `nvcc`:

<details><summary>Click to expand nvcc --dryrun output</summary><pre>$ nvcc -c x.cu -o x.cu.o -gencode=arch=compute_60,code=[sm_60] -gencode=arch=compute_70,code=[compute_70,sm_70] --dryrun #$ _NVVM_BRANCH_=nvvm #$ _SPACE_= #$ _CUDART_=cudart #$ _HERE_=/usr/local/cuda/bin #$ _THERE_=/usr/local/cuda/bin #$ _TARGET_SIZE_= #$ _TARGET_DIR_= #$ _TARGET_DIR_=targets/x86_64-linux #$ TOP=/usr/local/cuda/bin/.. #$ CICC_PATH=/usr/local/cuda/bin/../nvvm/bin #$ CICC_NEXT_PATH=/usr/local/cuda/bin/../nvvm-next/bin #$ NVVMIR_LIBRARY_DIR=/usr/local/cuda/bin/../nvvm/libdevice #$ LD_LIBRARY_PATH=/usr/local/cuda/bin/../lib: #$ PATH=/usr/local/cuda/bin/../nvvm/bin:/usr/local/cuda/bin:/home/ptaylor/.nvm/versions/node/v22.4.0/bin:/home/ptaylor/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/ptaylor/.fzf/bin:/usr/local/cuda/bin #$ INCLUDES="-I/usr/local/cuda/bin/../targets/x86_64-linux/include" #$ LIBRARIES= "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib" #$ CUDAFE_FLAGS= #$ PTXAS_FLAGS= #$ gcc -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-5_x.cpp4.ii" #$ cudafe++ --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed --m64 --parse_templates --gen_c_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.cpp" --stub_file_name "tmpxft_00003437_00000000-6_x.compute_70.cudafe1.stub.c" --gen_module_id_file --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" "/tmp/tmpxft_00003437_00000000-5_x.cpp4.ii" #$ gcc -D__CUDA_ARCH__=600 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-8_x.compute_60.cpp1.ii" #$ cicc --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed -arch compute_60 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_00003437_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.c" --stub_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.gpu" "/tmp/tmpxft_00003437_00000000-8_x.compute_60.cpp1.ii" -o "/tmp/tmpxft_00003437_00000000-7_x.compute_60.ptx" #$ ptxas -arch=sm_60 -m64 "/tmp/tmpxft_00003437_00000000-7_x.compute_60.ptx" -o "/tmp/tmpxft_00003437_00000000-9_x.compute_60.cubin" #$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-10_x.compute_70.cpp1.ii" #$ cicc --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed -arch compute_70 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_00003437_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.c" --stub_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.gpu" "/tmp/tmpxft_00003437_00000000-10_x.compute_70.cpp1.ii" -o "/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx" #$ ptxas -arch=sm_70 -m64 "/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx" -o "/tmp/tmpxft_00003437_00000000-11_x.compute_70.sm_70.cubin" #$ fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=60,file=/tmp/tmpxft_00003437_00000000-9_x.compute_60.cubin" "--image3=kind=ptx,sm=70,file=/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx" "--image3=kind=elf,sm=70,file=/tmp/tmpxft_00003437_00000000-11_x.compute_70.sm_70.cubin" --embedded-fatbin="/tmp/tmpxft_00003437_00000000-3_x.fatbin.c" #$ rm /tmp/tmpxft_00003437_00000000-3_x.fatbin #$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -c -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -Wno-psabi "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -m64 "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.cpp" -o "x.cu.o" </pre></details>

This output represents a sequence of preprocessing steps that must run on the sccache client, followed by compilation steps on the preprocessed result that can be distributed to `sccache-dist` workers.

## Explanation

Here's a rough break down of the command stages above:

```shell
#$ gcc -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-5_x.cpp4.ii" 
#$ cudafe++ --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed --m64 --parse_templates --gen_c_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.cpp" --stub_file_name "tmpxft_00003437_00000000-6_x.compute_70.cudafe1.stub.c" --gen_module_id_file --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" "/tmp/tmpxft_00003437_00000000-5_x.cpp4.ii" 
```

These two lines run the host preprocessor to resolve host-side macros and inline `#include`s, then run the CUDA front-end to separate the source into host and device source files. The sccache client should run both these steps before requesting any compilation jobs.

```shell
#$ gcc -D__CUDA_ARCH__=600 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-8_x.compute_60.cpp1.ii" 
#$ cicc --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed -arch compute_60 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_00003437_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.c" --stub_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00003437_00000000-7_x.compute_60.cudafe1.gpu" "/tmp/tmpxft_00003437_00000000-8_x.compute_60.cpp1.ii" -o "/tmp/tmpxft_00003437_00000000-7_x.compute_60.ptx"
#$ ptxas -arch=sm_60 -m64 "/tmp/tmpxft_00003437_00000000-7_x.compute_60.ptx" -o "/tmp/tmpxft_00003437_00000000-9_x.compute_60.cubin" 
```

In this phase,`nvcc`:
1. runs the host compiler to preprocess the input file (`x.cu`)
2. runs `cicc` on the output of step 1 to generate a `.ptx` file
3. runs `ptxas` on the output of step 2 to assemble the PTX into a `.cubin`

All these steps must run sequentially. Step 1 must run on the sccache client, but 2 and 3 can be executed by `sccache-dist` workers.

```shell
#$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=82 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "x.cu" -o "/tmp/tmpxft_00003437_00000000-10_x.compute_70.cpp1.ii" 
#$ cicc --c++17 --gnu_version=130200 --display_error_number --orig_src_file_name "x.cu" --orig_src_path_name "x.cu" --allow_managed -arch compute_70 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_00003437_00000000-3_x.fatbin.c" -tused --module_id_file_name "/tmp/tmpxft_00003437_00000000-4_x.module_id" --gen_c_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.c" --stub_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.gpu" "/tmp/tmpxft_00003437_00000000-10_x.compute_70.cpp1.ii" -o "/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx"
#$ ptxas -arch=sm_70 -m64 "/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx" -o "/tmp/tmpxft_00003437_00000000-11_x.compute_70.sm_70.cubin" 
```

This is similar to the prior commands, except for a different GPU arch `sm_70`. These commands must still run sequentially with respect to each other, but they can run _in parallel_ to the commands from the prior stage.

```shell
#$ fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=60,file=/tmp/tmpxft_00003437_00000000-9_x.compute_60.cubin" "--image3=kind=ptx,sm=70,file=/tmp/tmpxft_00003437_00000000-6_x.compute_70.ptx" "--image3=kind=elf,sm=70,file=/tmp/tmpxft_00003437_00000000-11_x.compute_70.sm_70.cubin" --embedded-fatbin="/tmp/tmpxft_00003437_00000000-3_x.fatbin.c" 
#$ rm /tmp/tmpxft_00003437_00000000-3_x.fatbin
#$ gcc -D__CUDA_ARCH__=700 -D__CUDA_ARCH_LIST__=600,700 -D__NV_LEGACY_LAUNCH -c -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -Wno-psabi "-I/usr/local/cuda/bin/../targets/x86_64-linux/include" -m64 "/tmp/tmpxft_00003437_00000000-6_x.compute_70.cudafe1.cpp" -o "x.cu.o" 
```

In this stage, the outputs from the prior two stages are assembled into a `.fatbin` via the `fatbinary` invocation, then the original preprocessed host code is combined with the `.fatbin` and assembled into the final `.o` by the host compiler. These stages must run sequentially, but can be executed by `sccache-dist` workers (the final host compiler call can use the existing `sccache-dist` logic for preprocessing + distributing the work).


## Additional Benefits

In addition to supporting `sccache-dist` in `nvcc`, this new behavior also benefits `sccache` clients that aren't configured to use distributed compilation, because `sccache` can now avoid compiling the underlying `.ptx` and `.cubin` device compilation artifacts assembled into the final `.o`.

For example, a CI job could compile code for all supported device architectures:
```shell
$ nvcc ... \
 -gencode=arch=compute_60,code=[sm_60] \
 -gencode=arch=compute_70,code=[sm_70] \
 -gencode=arch=compute_80,code=[sm_80] \
 -gencode=arch=compute_90,code=[compute_90,sm_90]
```

The above produces an object file with a certain hash, let's call it `hash_all`.

A developer may want to compile the same code with the same options, but for a smaller subset of architectures that match the GPU on their machine:
```shell
$ nvcc ... -gencode=arch=compute_90,code=[compute_90,sm_90]
```

Since the above produces an object file with a different hash (`hash_subset`), today `sccache` yields a cache miss on this `.o` file and re-runs `nvcc` (which itself runs `cicc` and `ptxas`) because the arguments + input don't match `hash_all` produced in CI.

However with the proposed changes, while `sccache` would still yield a cache miss for the `.o` produced by the `nvcc` command, it would yield a _cache hit_ on the underlying `.ptx` and `.cubin` files produced by `cicc` and `ptxas` respectively, skipping the lions share of the actual compilation done by `nvcc`.

## Tasks

Work is ongoing in [this branch](https://github.com/trxcllnt/sccache/tree/fea/nvcc-sccache-dist).

1. [x] Add `cicc` and `ptxas` as first-class compilers supported by sccache
2. [x] Support bundling `cicc` and `ptxas` toolchains from client's CUDA toolkit
3. [x] Refactor `nvcc` compiler to call `nvcc --dryrun`, run each sub-command through `sccache` as appropriate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Supporting `nvcc` in `sccache-dist` #2238

Background

Proposal

Explanation

Additional Benefits

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Supporting nvcc in sccache-dist #2238

Description

Background

Proposal

Explanation

Additional Benefits

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Supporting `nvcc` in `sccache-dist` #2238