-
Notifications
You must be signed in to change notification settings - Fork 62
[ci] Rework test_runs_on plumbing for release workflows. #1100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments, but looks good! will wait until runs in PR description finish
test_pytorch_wheels: | ||
if: ${{ needs.generate_target_to_run.outputs.test_runs_on != '' }} | ||
needs: [build_pytorch_wheels, generate_target_to_run] | ||
name: Test | ${{ inputs.amdgpu_family }} | ${{ inputs.test_runs_on }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could probably remove ${{ inputs.amdgpu_family }}
,as it seems redundant like so: https://github.com/ROCm/TheRock/actions/runs/16454016664/job/46508939246
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for build! the release workflow file has Release | gfx1151 | Python 3.12
and I think build can be build, while test can just be Test | ${{ inputs.test_runs_on }}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I tried a few variations on names here but settled for consistency across the workflows for now. Figured we could rename in a follow-up change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is an improvement to have a single source for this, I think this makes it way harder to manually dispatch e.g. .github/workflows/build_portable_linux_pytorch_wheels.yml
as one needs to know the runner label / string if I am not mistaken?
How about we default to |
Thinking through this a bit more... I think we could keep the existing TheRock/build_tools/github_actions/amdgpu_family_matrix.py Lines 31 to 40 in 6e7430e
|
Yes, that would be really valuable. Having in mind the latest discussion about moving runners and labels think it's best to only need to know the arch one wants to run on. Not sure if
➕ |
Splitting this off from #1100. Not tested recently.
…dows-release-plumbing
didn't have too much time last week but thinking about some refactoring right now and hoping to land some upgrades soon |
Synced, added some options/defaults, and kicked off a few more test runs (see updated PR description). I can also look back at
if we want. |
type: string | ||
default: linux-mi325-1gpu-ossci-rocm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit concerned that having a default will result in building for one family from the choices but missing to set the correct label here and testing on the wrong family afterwards. Not having a default would mean to know / look up one more label. Though I am not blocking on this even though still think it would be nice to not know the labels at all.
If testing can be skipped here, how does this colludes with #1072?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I can try restoring the script we had and fixing it. We're still not running pytorch tests on Windows until this lands though, since gfx1151 is not triggering test jobs :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed some changes. I'll test them then re-request review once they are ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lovely thanks for the test!
options: | ||
- gfx110X-dgpu | ||
- gfx1151 | ||
- gfx120X-all | ||
- gfx94X-dcgpu | ||
- gfx950-dcgpu | ||
default: gfx94X-dcgpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also
- keep this as a freeform
string
field - add xfail families from https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/amdgpu_family_matrix.py
I think limiting to only families that we have here makes sense though:
TheRock/build_tools/third_party/s3_management/manage.py
Lines 38 to 44 in 231dd86
PREFIXES = [ | |
"v2/gfx110X-dgpu", | |
"v2/gfx1151", | |
"v2/gfx120X-all", | |
"v2/gfx94X-dcgpu", | |
"v2/gfx950-dcgpu", | |
] |
(but we will want to expand that list, and now we have multiple locations to update)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, needing to update multiple locations is unfortunate but I agree that it makes sense to limit it here instead of having a freeform string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for all the work to improve this @ScottTodd!
Progress on #589 and #1097.
This changes
configure_target_run.py
to look for the target family in either the "inner family" or the "outer key", so it correctly chooses runners for gfx1151 instead of skipping that test configuration sincegfx115X
andgfx1151
do not match. I also considered explicit data flow oftest_runs_on
through the workflows, but we preferred to keep some automatic detection so developers triggering workflows do not need to manually line up test families with test runners.Recent test runs:
7.0.0rc20250804
: https://github.com/ROCm/TheRock/actions/runs/167877408577.0.0rc20250805
: https://github.com/ROCm/TheRock/actions/runs/167877474327.0.0rc20250805
: https://github.com/ROCm/TheRock/actions/runs/16787789191