Skip to content

TMT: run tests with GPUs #1101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 20, 2025
Merged

TMT: run tests with GPUs #1101

merged 3 commits into from
Jun 20, 2025

Conversation

lsm5
Copy link
Member

@lsm5 lsm5 commented Apr 2, 2025

This commit adds TMT test jobs triggered via Packit that fetches an instance with NVIDIA GPU, specified in plans/no-rpm.fmf, and can be verified in the gpu_info test result.

In addition, system tests (nocontainer), validate, and unit tests are also triggered via TMT.

Fixes: #1054

TODO:

  1. Enable bats-docker tests
  2. Resolve f41 validate test failures

Summary by Sourcery

Tests:

  • Update test metadata configuration to enable GPU-based test execution

Summary by Sourcery

Enable GPU-accelerated and comprehensive TMT-based test workflows via Packit and new FMF plans, updating configuration and test scripts to support the enhanced testing pipeline.

New Features:

  • Add TMT-driven GPU tests via Packit using new /plans/rpm and /plans/no-rpm FMF plans for Fedora and CentOS
  • Provide a bats-tests.sh script to manually run docker or nocontainer bats tests under TMT

Enhancements:

  • Consolidate Fedora Copr build targets into fedora-all in .packit.yaml
  • Standardize container build path resolution in container_build.sh

CI:

  • Configure Packit jobs to trigger TMT plans for system, validate, and unit tests

Tests:

  • Mock versioned image lookup in test_accel_image for stable unit testing
  • Update system help test to handle both rootless and root default store paths
  • Add FMF plan files to register TMT-based system and unit test runs

Copy link
Contributor

sourcery-ai bot commented Apr 2, 2025

Reviewer's Guide

This PR configures Packit to trigger TMT tests on NVIDIA GPU instances by adding dedicated RPM and no-RPM jobs, updates unit and system tests for deterministic behavior, fixes container build paths, and provides a TMT orchestration script alongside FMF plans for automated test runs.

Class diagram for new and updated TMT test job configuration

classDiagram
    class PackitJob {
        +string job
        +string trigger
        +list packages
        +list targets
        +string tmt_plan
        +string identifier
        +bool skip_build
    }
    class FMFPlan {
        +string name
        +list tests
        +string hardware_requirements
    }
    PackitJob "*" -- "*" FMFPlan : uses

    class GPUInstance {
        +string type
        +string vendor
    }
    FMFPlan "1" -- "*" GPUInstance : requests

    %% Highlight new/modified jobs
    class PackitJob {
        <<new/modified>>
    }
    class FMFPlan {
        <<new/modified>>
    }
Loading

File-Level Changes

Change Details Files
Updated Packit CI configuration to define TMT test jobs with GPU support
  • Replaced multiple Fedora targets with 'fedora-all'
  • Added 'tests' jobs with tmt_plan and identifiers for rpm/no-rpm scenarios
.packit.yaml
Isolated GPU image logic in unit tests by mocking version checks
  • Decorated test_accel_image with patch to force versioned image fallback
  • Mocked attempt_to_use_versioned to False for independent testing
test/unit/test_common.py
Enhanced system test to handle rootless vs. non-rootless environments
  • Wrapped default store assertion with rootless conditional
  • Fallback to '/var/lib/ramalama' when not rootless
test/system/015-help.bats
Fixed container build script path resolution
  • Prefixed container-images path with './' in build loop
container_build.sh
Added TMT orchestration script and FMF plans for rpm/no-rpm workflows
  • Introduced bats-tests.sh for manual docker/nocontainer TMT runs
  • Added FMF plans for rpm and no-rpm testing
test/tmt/bats-tests.sh
plans/no-rpm.fmf
plans/rpm.fmf
test/tmt/no-rpm.fmf

Possibly linked issues

  • GPU testing with TMT #1054: The PR adds TMT test jobs and configuration to run tests with GPUs, directly addressing the issue's goal.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@lsm5 lsm5 force-pushed the tmt-gpu branch 7 times, most recently from 8e6be74 to 8b8828a Compare April 2, 2025 14:08
@lsm5
Copy link
Member Author

lsm5 commented Apr 2, 2025

@ericcurtin @rhatdan we're able to access gpu instances via TMT and that can be verified through the TMT log. See /test/tmt/gpu_info results in the tmt log. Check Show passed tests to see it.

But, make tests is failing on not finding llama-run and llama-bench commands. How do I get those?

@lsm5 lsm5 linked an issue Apr 2, 2025 that may be closed by this pull request
@rhatdan
Copy link
Member

rhatdan commented Apr 18, 2025

Look at:

.github/workflows/ci.yml: sudo ./container-images/scripts/build_llama_and_whisper.sh

This builds the released version of llama.cpp and whisper.cpp and installs them in the host or in a container.

@lsm5 lsm5 force-pushed the tmt-gpu branch 2 times, most recently from 1b9f459 to eaa58c5 Compare April 23, 2025 10:33
@lsm5 lsm5 force-pushed the tmt-gpu branch 4 times, most recently from 54cc323 to 5116408 Compare June 18, 2025 12:40
@lsm5
Copy link
Member Author

lsm5 commented Jun 18, 2025

I'm now seeing these 2 errors in bats-nocontainer:

not ok 11 [015] ramalama verify default store in 552ms
# (from function `bail-now' in file test/system/helpers.podman.bash, line 122,
#  from function `is' in file test/system/helpers.podman.bash, line 1016,
#  in test file test/system/015-help.bats, line 174)
#   `is "$output" ".*default: ${HOME}/.local/share/ramalama"  "Verify default store"' failed
#
# [13:04:52.219867373] # /var/ARTIFACTS/work-bats-nocontainerd837lq0d/plans/bats-nocontainer/tree/bin/ramalama --help
# [13:04:52.429558178] usage: ramalama [-h] [--container] [--debug] [--dryrun] [--engine ENGINE]
#                 [--image IMAGE] [--keep-groups] [--nocontainer] [--quiet]
#                 [--runtime {llama.cpp,vllm}] [--store STORE]
#                 [--use-model-store]
#                 {bench,benchmark,chat,client,containers,ps,convert,help,info,inspect,list,ls,login,logout,perplexity,pull,push,rag,rm,run,serve,stop,version} ...

----snip ramalama command output----

# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #|     FAIL: Verify default store
# #| expected: '.*default: /root/.local/share/ramalama' (using expr)
# #|   actual: 'usage: ramalama [-h] [--container] [--debug] [--dryrun] [--engine ENGINE]'

and this as well (looks like issues accessing the url).

not ok 37 [050] ramalama pull huggingface in 9460ms
# tags: distro-integration
# (from function `bail-now' in file test/system/helpers.podman.bash, line 122,
#  from function `die' in file test/system/helpers.podman.bash, line 848,
#  from function `run_ramalama' in file test/system/helpers.bash, line 186,
#  in test file test/system/050-pull.bats, line 80)
#   `run_ramalama pull hf://TinyLlama/TinyLlama-1.1B-Chat-v1.0' failed
#
# [13:08:45.808278621] # /var/ARTIFACTS/work-bats-nocontainerd837lq0d/plans/bats-nocontainer/tree/bin/ramalama pull hf://Felladrin/gguf-smollm-360M-instruct-add-basics/smollm-360M-instruct-add-basics.IQ2_XXS.gguf
# [13:08:48.288509401] Downloading huggingface://Felladrin/gguf-smollm-360M-instruct-add-basics/smollm-360M-instruct-add-basics.IQ2_XXS.gguf:latest ...
# Trying to pull huggingface://Felladrin/gguf-smollm-360M-instruct-add-basics/smollm-360M-instruct-add-basics.IQ2_XXS.gguf:latest ...

 ---- snip similar looking messages----

# [13:08:54.407791981] NAME                                                                                             MODIFIED     SIZE
# hf://Felladrin/gguf-smollm-360M-instruct-add-basics/smollm-360M-instruct-add-basics.IQ2_XXS.gguf 1 second ago 196.31 MB
#
# [13:08:54.424280234] # /var/ARTIFACTS/work-bats-nocontainerd837lq0d/plans/bats-nocontainer/tree/bin/ramalama rm huggingface://Felladrin/gguf-smollm-360M-instruct-add-basics/smollm-360M-instruct-add-basics.IQ2_XXS.gguf
#
# [13:08:54.710105260] # /var/ARTIFACTS/work-bats-nocontainerd837lq0d/plans/bats-nocontainer/tree/bin/ramalama pull hf://TinyLlama/TinyLlama-1.1B-Chat-v1.0
# [13:08:55.018142961] Downloading huggingface://TinyLlama/TinyLlama-1.1B-Chat-v1.0:latest ...
# Trying to pull huggingface://TinyLlama/TinyLlama-1.1B-Chat-v1.0:latest ...
# URL pull failed and huggingface-cli not available
# Error: Failed to pull model: HTTP Error 400: Bad Request
# [13:08:55.022949511] [ rc=1 (** EXPECTED 0 **) ]
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: exit code is 1; expected 0
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The unit tests are failing on

>                   assert accel_image(config) == expected_result
E                   AssertionError: assert 'quay.io/ramalama/rocm:0.9' == 'quay.io/ramalama/rocm:latest'
E                     
E                     - quay.io/ramalama/rocm:latest
E                     ?                       ^^^^^^
E                     + quay.io/ramalama/rocm:0.9
E                     ?                       ^^^

See the detailed logs at: https://artifacts.dev.testing-farm.io/74a1da74-2417-4d94-ab38-e067214441d5/

@lsm5
Copy link
Member Author

lsm5 commented Jun 18, 2025

i see one issue was no python3-huggingface-hub installed.

@lsm5 lsm5 force-pushed the tmt-gpu branch 7 times, most recently from a342d35 to 42d7a7e Compare June 18, 2025 19:22
For the rootful case, the default store is at /var/lib/ramalama.

Signed-off-by: Lokesh Mandvekar <[email protected]>
This commit adds TMT test jobs triggered via Packit that fetches an
instance with NVIDIA GPU, specified in `plans/no-rpm.fmf`, and can be
verified in the gpu_info test result.

In addition, system tests (nocontainer), validate, and unit tests are
also triggered via TMT.

Fixes: containers#1054

TODO:
1. Enable bats-docker tests
2. Resolve f41 validate test failures

Signed-off-by: Lokesh Mandvekar <[email protected]>
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lsm5 - I've reviewed your changes - here's some feedback:

  • plans/no-rpm.fmf and plans/rpm.fmf are added but empty—please populate them with the FMF metadata needed for TMT to pick up those test plans.
  • Replacing the two Fedora targets with fedora-all may pull in unintended variants—please verify that it matches the original scope of development and latest-stable.
  • The new bats-tests.sh script duplicates existing CI orchestration logic—consider reusing or refactoring current CI scripts to avoid maintaining parallel test runners.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- plans/no-rpm.fmf and plans/rpm.fmf are added but empty—please populate them with the FMF metadata needed for TMT to pick up those test plans.
- Replacing the two Fedora targets with `fedora-all` may pull in unintended variants—please verify that it matches the original scope of development and latest-stable.
- The new bats-tests.sh script duplicates existing CI orchestration logic—consider reusing or refactoring current CI scripts to avoid maintaining parallel test runners.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@lsm5
Copy link
Member Author

lsm5 commented Jun 19, 2025

  • plans/no-rpm.fmf and plans/rpm.fmf are added but empty—please populate them with the FMF metadata needed for TMT to pick up those test plans.

They are not empty.

  • Replacing the two Fedora targets with fedora-all may pull in unintended variants—please verify that it matches the original scope of development and latest-stable.

This was intended because validate test breaks on F41.

  • The new bats-tests.sh script duplicates existing CI orchestration logic—consider reusing or refactoring current CI scripts to avoid maintaining parallel test runners.

needed for TMT tests such that the scripts can also be run locally without any TMT environment. Ideally this config should live inside Makefile, but that can be for later.

@lsm5
Copy link
Member Author

lsm5 commented Jun 19, 2025

@ericcurtin @rhatdan @smooge PTAL. There's one commit from @sarroutbi from #1567 as well to fix a unit test issue.

@rhatdan
Copy link
Member

rhatdan commented Jun 20, 2025

LGTM

@rhatdan rhatdan merged commit 3f87444 into containers:main Jun 20, 2025
20 of 21 checks passed
@lsm5 lsm5 deleted the tmt-gpu branch June 20, 2025 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU testing with TMT
3 participants