Skip to content

[ci] Rework test_runs_on plumbing for release workflows. #1100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Aug 6, 2025

Conversation

ScottTodd
Copy link
Member

@ScottTodd ScottTodd commented Jul 22, 2025

Progress on #589 and #1097.

This changes configure_target_run.py to look for the target family in either the "inner family" or the "outer key", so it correctly chooses runners for gfx1151 instead of skipping that test configuration since gfx115X and gfx1151 do not match. I also considered explicit data flow of test_runs_on through the workflows, but we preferred to keep some automatic detection so developers triggering workflows do not need to manually line up test families with test runners.

Recent test runs:

Copy link
Contributor

@geomin12 geomin12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments, but looks good! will wait until runs in PR description finish

test_pytorch_wheels:
if: ${{ needs.generate_target_to_run.outputs.test_runs_on != '' }}
needs: [build_pytorch_wheels, generate_target_to_run]
name: Test | ${{ inputs.amdgpu_family }} | ${{ inputs.test_runs_on }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could probably remove ${{ inputs.amdgpu_family }},as it seems redundant like so: https://github.com/ROCm/TheRock/actions/runs/16454016664/job/46508939246

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for build! the release workflow file has Release | gfx1151 | Python 3.12 and I think build can be build, while test can just be Test | ${{ inputs.test_runs_on }}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I tried a few variations on names here but settled for consistency across the workflows for now. Figured we could rename in a follow-up change.

Copy link
Member

@marbre marbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is an improvement to have a single source for this, I think this makes it way harder to manually dispatch e.g. .github/workflows/build_portable_linux_pytorch_wheels.yml as one needs to know the runner label / string if I am not mistaken?

@ScottTodd
Copy link
Member Author

While it is an improvement to have a single source for this, I think this makes it way harder to manually dispatch e.g. .github/workflows/build_portable_linux_pytorch_wheels.yml as one needs to know the runner label / string if I am not mistaken?

How about we default to gfx94X-dcgpu + linux-mi300-1gpu-ossci-rocm on Linux and gfx1151 + windows-strix-halo-gpu-rocm on Windows? We could also add treat empty string as "auto", or have empty string mean no testing and default to "auto".

@ScottTodd
Copy link
Member Author

ScottTodd commented Jul 23, 2025

Thinking through this a bit more... I think we could keep the existing build_tools/github_actions/configure_target_run.py, if we change it look at both the underlying family and the top level family_info. That would be gfx115x and gfx1151 here:

"gfx115x": {
"linux": {
"test-runs-on": "",
"family": "gfx1151",
},
"windows": {
"test-runs-on": "windows-strix-halo-gpu-rocm",
"family": "gfx1151",
},
},
We could also have the script log its inputs and what it is considering for each option, since debugging without that information is rather difficult.

@marbre
Copy link
Member

marbre commented Jul 24, 2025

Thinking through this a bit more... I think we could keep the existing build_tools/github_actions/configure_target_run.py, if we change it look at both the underlying family and the top level family_info. That would be gfx115x and gfx1151 here:

"gfx115x": {
"linux": {
"test-runs-on": "",
"family": "gfx1151",
},
"windows": {
"test-runs-on": "windows-strix-halo-gpu-rocm",
"family": "gfx1151",
},
},

Yes, that would be really valuable. Having in mind the latest discussion about moving runners and labels think it's best to only need to know the arch one wants to run on. Not sure if family and family_info are the right names though. Until now gfx1151 is a family but I rather think it is the target and gfx115x should be the family?

We could also have the script log its inputs and what it is considering for each option, since debugging without that information is rather difficult.

ScottTodd added a commit that referenced this pull request Jul 24, 2025
Splitting this off from #1100.

Not tested recently.
@ScottTodd
Copy link
Member Author

I'm just about ready to pick this up again. Planning to merge and resolve conflicts then take another look at the scripts and plumbing. @geomin12 were you able to find time to think about #1097 or work on any other refactoring approaches?

@geomin12
Copy link
Contributor

I'm just about ready to pick this up again. Planning to merge and resolve conflicts then take another look at the scripts and plumbing. @geomin12 were you able to find time to think about #1097 or work on any other refactoring approaches?

didn't have too much time last week but thinking about some refactoring right now and hoping to land some upgrades soon

@ScottTodd
Copy link
Member Author

Synced, added some options/defaults, and kicked off a few more test runs (see updated PR description).

I can also look back at

Thinking through this a bit more... I think we could keep the existing build_tools/github_actions/configure_target_run.py, if we change it look at both the underlying family and the top level family_info.

if we want.

@ScottTodd ScottTodd marked this pull request as ready for review August 5, 2025 17:30
@ScottTodd ScottTodd requested review from geomin12 and marbre August 5, 2025 17:30
@ScottTodd ScottTodd requested a review from geomin12 August 5, 2025 20:50
type: string
default: linux-mi325-1gpu-ossci-rocm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit concerned that having a default will result in building for one family from the choices but missing to set the correct label here and testing on the wrong family afterwards. Not having a default would mean to know / look up one more label. Though I am not blocking on this even though still think it would be nice to not know the labels at all.

If testing can be skipped here, how does this colludes with #1072?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I can try restoring the script we had and fixing it. We're still not running pytorch tests on Windows until this lands though, since gfx1151 is not triggering test jobs :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed some changes. I'll test them then re-request review once they are ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ScottTodd ScottTodd marked this pull request as draft August 6, 2025 20:24
@ScottTodd ScottTodd marked this pull request as ready for review August 6, 2025 21:26
@ScottTodd ScottTodd requested review from geomin12 and marbre August 6, 2025 21:26
Copy link
Contributor

@geomin12 geomin12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lovely thanks for the test!

Comment on lines +39 to +45
options:
- gfx110X-dgpu
- gfx1151
- gfx120X-all
- gfx94X-dcgpu
- gfx950-dcgpu
default: gfx94X-dcgpu
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also

  1. keep this as a freeform string field
  2. add xfail families from https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/amdgpu_family_matrix.py

I think limiting to only families that we have here makes sense though:

PREFIXES = [
"v2/gfx110X-dgpu",
"v2/gfx1151",
"v2/gfx120X-all",
"v2/gfx94X-dcgpu",
"v2/gfx950-dcgpu",
]

(but we will want to expand that list, and now we have multiple locations to update)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, needing to update multiple locations is unfortunate but I agree that it makes sense to limit it here instead of having a freeform string.

Copy link
Member

@marbre marbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for all the work to improve this @ScottTodd!

@ScottTodd ScottTodd merged commit 21439d0 into main Aug 6, 2025
30 of 32 checks passed
@ScottTodd ScottTodd deleted the users/scotttodd/windows-release-plumbing branch August 6, 2025 23:42
@github-project-automation github-project-automation bot moved this from TODO to Done in TheRock Triage Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants