Weekly compute-sanitizer runs for cuDF

# Proposal: Weekly compute-sanitizer runs for cuDF

See implementation in #20542.

## Summary

Add a GitHub Actions workflow that runs all NVIDIA compute-sanitizer tools (`memcheck`, `racecheck`, `initcheck`, `synccheck`) on libcudf tests. The workflow dynamically discovers test executables and runs them in parallel on separate GPU runners, providing comprehensive memory safety and concurrency validation.

## Motivation

We currently have nightly CI tests for `memcheck` but no coverage for other tools (`racecheck`, `initcheck`, `synccheck`).
We would like to increase our visibility into low-level correctness of libcudf code.

Goals:

1. **Improve visibility into correctness**: Each tool/test gets its own job with individual pass/fail status
2. **Enable comprehensive checking**: Support all four compute-sanitizer tools, not just memcheck
3. **Facilitate targeted testing**: Allow manual triggering with specific tools and tests
4. **Provide weekly validation**: Scheduled runs catch issues before release

## Implementation Overview

The proposed implementation consists of two GitHub Actions workflows and two bash scripts:

### Workflows

1. **`compute-sanitizer-trigger.yaml`**: Orchestration workflow
   - Scheduled weekly (Saturday 10:00 UTC, after RAPIDS nightlies complete) and manually dispatchable
   - Runs test discovery once
   - Launches four parallel jobs (one per tool: `memcheck`, `racecheck`, `initcheck`, `synccheck`)
   - Each tool runs on all discovered tests via `compute-sanitizer-run.yaml`

2. **`compute-sanitizer-run.yaml`**: Reusable execution workflow
   - Takes tool name and test list as inputs
   - Creates matrix job (one job per test)
   - Runs on `linux-amd64-gpu-l4-latest-1` runners
   - Uses `continue-on-error: true` to ensure all tests run

### Scripts

1. **`ci/discover_libcudf_tests.sh`**: Test discovery
   - Installs test environment from conda
   - Finds all `*_TEST` executables in `$CONDA_PREFIX/bin/gtests/libcudf/`
   - Outputs JSON array to `$GITHUB_OUTPUT`

2. **`ci/run_compute_sanitizer_test.sh`**: Single test execution
   - Takes tool name and test name as arguments
   - Installs test environment from conda
   - Sets required environment variables
   - Runs: `compute-sanitizer --tool <TOOL> --kernel-name-exclude kns=nvcomp --error-exitcode=1 <TEST>`

## Key Design Decisions

### 1. Multi-Tool Support

**Decision**: Support all four compute-sanitizer tools (memcheck, racecheck, initcheck, synccheck)

**Rationale**:
- **memcheck**: Detects out-of-bounds and misaligned memory access errors
- **racecheck**: Identifies shared memory data race hazards
- **initcheck**: Finds uses of uninitialized device global memory
- **synccheck**: Detects invalid usage of synchronization primitives

Each tool provides unique validation capabilities for different classes of CUDA programming errors.

### 2. Reusable Workflow Pattern

**Decision**: Split into trigger and run workflows

**Rationale**:
- `compute-sanitizer-run.yaml` is reusable via `workflow_call` and manual dispatch
- Enables targeted testing: developers can trigger specific tool + test combinations
- Reduces duplication: single workflow definition used by all four tools
- Follows GitHub Actions best practices for composability

### 3. Dynamic Test Discovery

**Decision**: Discover tests at runtime using bash script

**Rationale**:
- Automatically adapts when new tests are added to libcudf
- No manual maintenance of test lists in workflow files
- Ensures complete coverage without hardcoded assumptions
- Discovery runs on CPU runner (minimal cost)

### 4. Continue-on-Error Strategy

**Decision**: Set `continue-on-error: true` on matrix jobs

**Rationale**:
- Run all tests even if some fail
- Provides complete picture of failures in single run
- Individual test results visible in GitHub Actions UI
- Trade-off: Consumes more GPU resources if many tests fail, but provides better debugging information

### 5. Kernel Exclusions

**Decision**: Exclude nvcomp kernels from checking

**Rationale**:
- nvcomp is a third-party compression library with known compute-sanitizer issues
- Exclusion prevents false positives from library code outside cuDF's control
- Implemented via: `--kernel-name-exclude kns=nvcomp`

## Environment Variables

Required environment variables (borrowed from `ci/run_cudf_memcheck_ctests.sh`):

- **`GTEST_CUDF_RMM_MODE=cuda`**: Configures RMM memory allocator to use CUDA mode
- **`GTEST_BRIEF=1`**: Enables concise Google Test output (TODO: we may remove this)
- **`LIBCUDF_MEMCHECK_ENABLED=1`**: Workaround for compute-sanitizer bug 4553815

## Architecture Details

### Test Discovery Flow

```
ci/discover_libcudf_tests.sh
├── Install test environment via rapids-dependency-file-generator
├── Find tests: ls $CONDA_PREFIX/bin/gtests/libcudf/*_TEST
├── Generate JSON: ["AST_TEST", "BINARYOP_TEST", ...]
└── Output to $GITHUB_OUTPUT
```

### Test Execution Flow (per tool, per test)

```
ci/run_compute_sanitizer_test.sh <tool> <test>
├── Install test environment via rapids-dependency-file-generator
├── Set environment variables (GTEST_CUDF_RMM_MODE, etc.)
├── Verify GPU availability (nvidia-smi)
├── Execute: compute-sanitizer --tool <tool> \
│              --kernel-name-exclude kns=nvcomp \
│              --error-exitcode=1 \
│              $CONDA_PREFIX/bin/gtests/libcudf/<test>
└── Exit with compute-sanitizer's exit code
```

### Workflow Topology

```
compute-sanitizer-trigger.yaml (weekly schedule + manual)
├── discover-sanitizer-tests (CPU runner)
│   └── ci/discover_libcudf_tests.sh
├── run-sanitizer-tests-memcheck
│   └── compute-sanitizer-run.yaml (matrix: all tests)
├── run-sanitizer-tests-racecheck
│   └── compute-sanitizer-run.yaml (matrix: all tests)
├── run-sanitizer-tests-initcheck
│   └── compute-sanitizer-run.yaml (matrix: all tests)
└── run-sanitizer-tests-synccheck
    └── compute-sanitizer-run.yaml (matrix: all tests)
```

## Usage Examples

### Manual Trigger: Run memcheck on a few tests
1. Navigate to Actions → Compute Sanitizer Run
2. Click "Run workflow"
3. Select tool: `memcheck`
4. Specify tests: `["AST_TEST", "BINARYOP_TEST"]` (or use discovery output)

### Manual Trigger: Run racecheck on specific test
1. Navigate to Actions → Compute Sanitizer Run
2. Click "Run workflow"
3. Select tool: `racecheck`
4. Specify tests: `["JOIN_TEST"]`

### Local Testing
```bash
# Install libcudf-tests conda package
conda create -n libcudf-test -c rapidsai-nightly -c conda-forge libcudf-tests libcudf

# Run single test with memcheck
./ci/run_compute_sanitizer_test.sh memcheck AST_TEST

# Run single test with racecheck and gtest filter
./ci/run_compute_sanitizer_test.sh racecheck GROUPBY_TEST --gtest_filter="GroupByTest.Sum*"
```

## Open Questions

**Resource limits**: Should we implement per-job timeouts to prevent hanging tests? The default GitHub Actions timeout is 6 hours. We probably need to manually skip any tests that we expect to exceed some limit, or possibly break them up into smaller test executables.

## References

- [NVIDIA Compute Sanitizer Documentation](https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Weekly compute-sanitizer runs for cuDF #20699

Proposal: Weekly compute-sanitizer runs for cuDF

Summary

Motivation

Implementation Overview

Workflows

Scripts

Key Design Decisions

1. Multi-Tool Support

2. Reusable Workflow Pattern

3. Dynamic Test Discovery

4. Continue-on-Error Strategy

5. Kernel Exclusions

Environment Variables

Architecture Details

Test Discovery Flow

Test Execution Flow (per tool, per test)

Workflow Topology

Usage Examples

Manual Trigger: Run memcheck on a few tests

Manual Trigger: Run racecheck on specific test

Local Testing

Open Questions

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Weekly compute-sanitizer runs for cuDF #20699

Description

Proposal: Weekly compute-sanitizer runs for cuDF

Summary

Motivation

Implementation Overview

Workflows

Scripts

Key Design Decisions

1. Multi-Tool Support

2. Reusable Workflow Pattern

3. Dynamic Test Discovery

4. Continue-on-Error Strategy

5. Kernel Exclusions

Environment Variables

Architecture Details

Test Discovery Flow

Test Execution Flow (per tool, per test)

Workflow Topology

Usage Examples

Manual Trigger: Run memcheck on a few tests

Manual Trigger: Run racecheck on specific test

Local Testing

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions