-
Notifications
You must be signed in to change notification settings - Fork 984
Description
Proposal: Weekly compute-sanitizer runs for cuDF
See implementation in #20542.
Summary
Add a GitHub Actions workflow that runs all NVIDIA compute-sanitizer tools (memcheck, racecheck, initcheck, synccheck) on libcudf tests. The workflow dynamically discovers test executables and runs them in parallel on separate GPU runners, providing comprehensive memory safety and concurrency validation.
Motivation
We currently have nightly CI tests for memcheck but no coverage for other tools (racecheck, initcheck, synccheck).
We would like to increase our visibility into low-level correctness of libcudf code.
Goals:
- Improve visibility into correctness: Each tool/test gets its own job with individual pass/fail status
- Enable comprehensive checking: Support all four compute-sanitizer tools, not just memcheck
- Facilitate targeted testing: Allow manual triggering with specific tools and tests
- Provide weekly validation: Scheduled runs catch issues before release
Implementation Overview
The proposed implementation consists of two GitHub Actions workflows and two bash scripts:
Workflows
-
compute-sanitizer-trigger.yaml: Orchestration workflow- Scheduled weekly (Saturday 10:00 UTC, after RAPIDS nightlies complete) and manually dispatchable
- Runs test discovery once
- Launches four parallel jobs (one per tool:
memcheck,racecheck,initcheck,synccheck) - Each tool runs on all discovered tests via
compute-sanitizer-run.yaml
-
compute-sanitizer-run.yaml: Reusable execution workflow- Takes tool name and test list as inputs
- Creates matrix job (one job per test)
- Runs on
linux-amd64-gpu-l4-latest-1runners - Uses
continue-on-error: trueto ensure all tests run
Scripts
-
ci/discover_libcudf_tests.sh: Test discovery- Installs test environment from conda
- Finds all
*_TESTexecutables in$CONDA_PREFIX/bin/gtests/libcudf/ - Outputs JSON array to
$GITHUB_OUTPUT
-
ci/run_compute_sanitizer_test.sh: Single test execution- Takes tool name and test name as arguments
- Installs test environment from conda
- Sets required environment variables
- Runs:
compute-sanitizer --tool <TOOL> --kernel-name-exclude kns=nvcomp --error-exitcode=1 <TEST>
Key Design Decisions
1. Multi-Tool Support
Decision: Support all four compute-sanitizer tools (memcheck, racecheck, initcheck, synccheck)
Rationale:
- memcheck: Detects out-of-bounds and misaligned memory access errors
- racecheck: Identifies shared memory data race hazards
- initcheck: Finds uses of uninitialized device global memory
- synccheck: Detects invalid usage of synchronization primitives
Each tool provides unique validation capabilities for different classes of CUDA programming errors.
2. Reusable Workflow Pattern
Decision: Split into trigger and run workflows
Rationale:
compute-sanitizer-run.yamlis reusable viaworkflow_calland manual dispatch- Enables targeted testing: developers can trigger specific tool + test combinations
- Reduces duplication: single workflow definition used by all four tools
- Follows GitHub Actions best practices for composability
3. Dynamic Test Discovery
Decision: Discover tests at runtime using bash script
Rationale:
- Automatically adapts when new tests are added to libcudf
- No manual maintenance of test lists in workflow files
- Ensures complete coverage without hardcoded assumptions
- Discovery runs on CPU runner (minimal cost)
4. Continue-on-Error Strategy
Decision: Set continue-on-error: true on matrix jobs
Rationale:
- Run all tests even if some fail
- Provides complete picture of failures in single run
- Individual test results visible in GitHub Actions UI
- Trade-off: Consumes more GPU resources if many tests fail, but provides better debugging information
5. Kernel Exclusions
Decision: Exclude nvcomp kernels from checking
Rationale:
- nvcomp is a third-party compression library with known compute-sanitizer issues
- Exclusion prevents false positives from library code outside cuDF's control
- Implemented via:
--kernel-name-exclude kns=nvcomp
Environment Variables
Required environment variables (borrowed from ci/run_cudf_memcheck_ctests.sh):
GTEST_CUDF_RMM_MODE=cuda: Configures RMM memory allocator to use CUDA modeGTEST_BRIEF=1: Enables concise Google Test output (TODO: we may remove this)LIBCUDF_MEMCHECK_ENABLED=1: Workaround for compute-sanitizer bug 4553815
Architecture Details
Test Discovery Flow
ci/discover_libcudf_tests.sh
├── Install test environment via rapids-dependency-file-generator
├── Find tests: ls $CONDA_PREFIX/bin/gtests/libcudf/*_TEST
├── Generate JSON: ["AST_TEST", "BINARYOP_TEST", ...]
└── Output to $GITHUB_OUTPUT
Test Execution Flow (per tool, per test)
ci/run_compute_sanitizer_test.sh <tool> <test>
├── Install test environment via rapids-dependency-file-generator
├── Set environment variables (GTEST_CUDF_RMM_MODE, etc.)
├── Verify GPU availability (nvidia-smi)
├── Execute: compute-sanitizer --tool <tool> \
│ --kernel-name-exclude kns=nvcomp \
│ --error-exitcode=1 \
│ $CONDA_PREFIX/bin/gtests/libcudf/<test>
└── Exit with compute-sanitizer's exit code
Workflow Topology
compute-sanitizer-trigger.yaml (weekly schedule + manual)
├── discover-sanitizer-tests (CPU runner)
│ └── ci/discover_libcudf_tests.sh
├── run-sanitizer-tests-memcheck
│ └── compute-sanitizer-run.yaml (matrix: all tests)
├── run-sanitizer-tests-racecheck
│ └── compute-sanitizer-run.yaml (matrix: all tests)
├── run-sanitizer-tests-initcheck
│ └── compute-sanitizer-run.yaml (matrix: all tests)
└── run-sanitizer-tests-synccheck
└── compute-sanitizer-run.yaml (matrix: all tests)
Usage Examples
Manual Trigger: Run memcheck on a few tests
- Navigate to Actions → Compute Sanitizer Run
- Click "Run workflow"
- Select tool:
memcheck - Specify tests:
["AST_TEST", "BINARYOP_TEST"](or use discovery output)
Manual Trigger: Run racecheck on specific test
- Navigate to Actions → Compute Sanitizer Run
- Click "Run workflow"
- Select tool:
racecheck - Specify tests:
["JOIN_TEST"]
Local Testing
# Install libcudf-tests conda package
conda create -n libcudf-test -c rapidsai-nightly -c conda-forge libcudf-tests libcudf
# Run single test with memcheck
./ci/run_compute_sanitizer_test.sh memcheck AST_TEST
# Run single test with racecheck and gtest filter
./ci/run_compute_sanitizer_test.sh racecheck GROUPBY_TEST --gtest_filter="GroupByTest.Sum*"Open Questions
Resource limits: Should we implement per-job timeouts to prevent hanging tests? The default GitHub Actions timeout is 6 hours. We probably need to manually skip any tests that we expect to exceed some limit, or possibly break them up into smaller test executables.