Skip to content

Weekly compute-sanitizer runs for cuDF #20699

@bdice

Description

@bdice

Proposal: Weekly compute-sanitizer runs for cuDF

See implementation in #20542.

Summary

Add a GitHub Actions workflow that runs all NVIDIA compute-sanitizer tools (memcheck, racecheck, initcheck, synccheck) on libcudf tests. The workflow dynamically discovers test executables and runs them in parallel on separate GPU runners, providing comprehensive memory safety and concurrency validation.

Motivation

We currently have nightly CI tests for memcheck but no coverage for other tools (racecheck, initcheck, synccheck).
We would like to increase our visibility into low-level correctness of libcudf code.

Goals:

  1. Improve visibility into correctness: Each tool/test gets its own job with individual pass/fail status
  2. Enable comprehensive checking: Support all four compute-sanitizer tools, not just memcheck
  3. Facilitate targeted testing: Allow manual triggering with specific tools and tests
  4. Provide weekly validation: Scheduled runs catch issues before release

Implementation Overview

The proposed implementation consists of two GitHub Actions workflows and two bash scripts:

Workflows

  1. compute-sanitizer-trigger.yaml: Orchestration workflow

    • Scheduled weekly (Saturday 10:00 UTC, after RAPIDS nightlies complete) and manually dispatchable
    • Runs test discovery once
    • Launches four parallel jobs (one per tool: memcheck, racecheck, initcheck, synccheck)
    • Each tool runs on all discovered tests via compute-sanitizer-run.yaml
  2. compute-sanitizer-run.yaml: Reusable execution workflow

    • Takes tool name and test list as inputs
    • Creates matrix job (one job per test)
    • Runs on linux-amd64-gpu-l4-latest-1 runners
    • Uses continue-on-error: true to ensure all tests run

Scripts

  1. ci/discover_libcudf_tests.sh: Test discovery

    • Installs test environment from conda
    • Finds all *_TEST executables in $CONDA_PREFIX/bin/gtests/libcudf/
    • Outputs JSON array to $GITHUB_OUTPUT
  2. ci/run_compute_sanitizer_test.sh: Single test execution

    • Takes tool name and test name as arguments
    • Installs test environment from conda
    • Sets required environment variables
    • Runs: compute-sanitizer --tool <TOOL> --kernel-name-exclude kns=nvcomp --error-exitcode=1 <TEST>

Key Design Decisions

1. Multi-Tool Support

Decision: Support all four compute-sanitizer tools (memcheck, racecheck, initcheck, synccheck)

Rationale:

  • memcheck: Detects out-of-bounds and misaligned memory access errors
  • racecheck: Identifies shared memory data race hazards
  • initcheck: Finds uses of uninitialized device global memory
  • synccheck: Detects invalid usage of synchronization primitives

Each tool provides unique validation capabilities for different classes of CUDA programming errors.

2. Reusable Workflow Pattern

Decision: Split into trigger and run workflows

Rationale:

  • compute-sanitizer-run.yaml is reusable via workflow_call and manual dispatch
  • Enables targeted testing: developers can trigger specific tool + test combinations
  • Reduces duplication: single workflow definition used by all four tools
  • Follows GitHub Actions best practices for composability

3. Dynamic Test Discovery

Decision: Discover tests at runtime using bash script

Rationale:

  • Automatically adapts when new tests are added to libcudf
  • No manual maintenance of test lists in workflow files
  • Ensures complete coverage without hardcoded assumptions
  • Discovery runs on CPU runner (minimal cost)

4. Continue-on-Error Strategy

Decision: Set continue-on-error: true on matrix jobs

Rationale:

  • Run all tests even if some fail
  • Provides complete picture of failures in single run
  • Individual test results visible in GitHub Actions UI
  • Trade-off: Consumes more GPU resources if many tests fail, but provides better debugging information

5. Kernel Exclusions

Decision: Exclude nvcomp kernels from checking

Rationale:

  • nvcomp is a third-party compression library with known compute-sanitizer issues
  • Exclusion prevents false positives from library code outside cuDF's control
  • Implemented via: --kernel-name-exclude kns=nvcomp

Environment Variables

Required environment variables (borrowed from ci/run_cudf_memcheck_ctests.sh):

  • GTEST_CUDF_RMM_MODE=cuda: Configures RMM memory allocator to use CUDA mode
  • GTEST_BRIEF=1: Enables concise Google Test output (TODO: we may remove this)
  • LIBCUDF_MEMCHECK_ENABLED=1: Workaround for compute-sanitizer bug 4553815

Architecture Details

Test Discovery Flow

ci/discover_libcudf_tests.sh
├── Install test environment via rapids-dependency-file-generator
├── Find tests: ls $CONDA_PREFIX/bin/gtests/libcudf/*_TEST
├── Generate JSON: ["AST_TEST", "BINARYOP_TEST", ...]
└── Output to $GITHUB_OUTPUT

Test Execution Flow (per tool, per test)

ci/run_compute_sanitizer_test.sh <tool> <test>
├── Install test environment via rapids-dependency-file-generator
├── Set environment variables (GTEST_CUDF_RMM_MODE, etc.)
├── Verify GPU availability (nvidia-smi)
├── Execute: compute-sanitizer --tool <tool> \
│              --kernel-name-exclude kns=nvcomp \
│              --error-exitcode=1 \
│              $CONDA_PREFIX/bin/gtests/libcudf/<test>
└── Exit with compute-sanitizer's exit code

Workflow Topology

compute-sanitizer-trigger.yaml (weekly schedule + manual)
├── discover-sanitizer-tests (CPU runner)
│   └── ci/discover_libcudf_tests.sh
├── run-sanitizer-tests-memcheck
│   └── compute-sanitizer-run.yaml (matrix: all tests)
├── run-sanitizer-tests-racecheck
│   └── compute-sanitizer-run.yaml (matrix: all tests)
├── run-sanitizer-tests-initcheck
│   └── compute-sanitizer-run.yaml (matrix: all tests)
└── run-sanitizer-tests-synccheck
    └── compute-sanitizer-run.yaml (matrix: all tests)

Usage Examples

Manual Trigger: Run memcheck on a few tests

  1. Navigate to Actions → Compute Sanitizer Run
  2. Click "Run workflow"
  3. Select tool: memcheck
  4. Specify tests: ["AST_TEST", "BINARYOP_TEST"] (or use discovery output)

Manual Trigger: Run racecheck on specific test

  1. Navigate to Actions → Compute Sanitizer Run
  2. Click "Run workflow"
  3. Select tool: racecheck
  4. Specify tests: ["JOIN_TEST"]

Local Testing

# Install libcudf-tests conda package
conda create -n libcudf-test -c rapidsai-nightly -c conda-forge libcudf-tests libcudf

# Run single test with memcheck
./ci/run_compute_sanitizer_test.sh memcheck AST_TEST

# Run single test with racecheck and gtest filter
./ci/run_compute_sanitizer_test.sh racecheck GROUPBY_TEST --gtest_filter="GroupByTest.Sum*"

Open Questions

Resource limits: Should we implement per-job timeouts to prevent hanging tests? The default GitHub Actions timeout is 6 hours. We probably need to manually skip any tests that we expect to exceed some limit, or possibly break them up into smaller test executables.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions