[ci] Rework test_runs_on plumbing for release workflows. #1100

ScottTodd · 2025-07-22T20:57:37Z

Progress on #589 and #1097.

This changes configure_target_run.py to look for the target family in either the "inner family" or the "outer key", so it correctly chooses runners for gfx1151 instead of skipping that test configuration since gfx115X and gfx1151 do not match. I also considered explicit data flow of test_runs_on through the workflows, but we preferred to keep some automatic detection so developers triggering workflows do not need to manually line up test families with test runners.

Recent test runs:

"Build Windows PyTorch Wheels" with default inputs and ROCm version 7.0.0rc20250804: https://github.com/ROCm/TheRock/actions/runs/16787740857
"Build Portable Linux PyTorch Wheels" with default inputs and ROCm version 7.0.0rc20250805: https://github.com/ROCm/TheRock/actions/runs/16787747432
"Build Portable Linux PyTorch Wheels" with gfx110X-dgpu and ROCm version 7.0.0rc20250805: https://github.com/ROCm/TheRock/actions/runs/16787789191

geomin12

some comments, but looks good! will wait until runs in PR description finish

geomin12 · 2025-07-22T21:48:40Z

.github/workflows/build_portable_linux_pytorch_wheels.yml

  test_pytorch_wheels:
-    if: ${{ needs.generate_target_to_run.outputs.test_runs_on != '' }}
-    needs: [build_pytorch_wheels, generate_target_to_run]
+    name: Test | ${{ inputs.amdgpu_family }} | ${{ inputs.test_runs_on }}


you could probably remove ${{ inputs.amdgpu_family }},as it seems redundant like so: https://github.com/ROCm/TheRock/actions/runs/16454016664/job/46508939246

same for build! the release workflow file has Release | gfx1151 | Python 3.12 and I think build can be build, while test can just be Test | ${{ inputs.test_runs_on }}

Agreed. I tried a few variations on names here but settled for consistency across the workflows for now. Figured we could rename in a follow-up change.

marbre

While it is an improvement to have a single source for this, I think this makes it way harder to manually dispatch e.g. .github/workflows/build_portable_linux_pytorch_wheels.yml as one needs to know the runner label / string if I am not mistaken?

ScottTodd · 2025-07-22T23:03:25Z

While it is an improvement to have a single source for this, I think this makes it way harder to manually dispatch e.g. .github/workflows/build_portable_linux_pytorch_wheels.yml as one needs to know the runner label / string if I am not mistaken?

How about we default to gfx94X-dcgpu + linux-mi300-1gpu-ossci-rocm on Linux and gfx1151 + windows-strix-halo-gpu-rocm on Windows? We could also add treat empty string as "auto", or have empty string mean no testing and default to "auto".

ScottTodd · 2025-07-23T15:07:43Z

Thinking through this a bit more... I think we could keep the existing build_tools/github_actions/configure_target_run.py, if we change it look at both the underlying family and the top level family_info. That would be gfx115x and gfx1151 here:

TheRock/build_tools/github_actions/amdgpu_family_matrix.py

Lines 31 to 40 in 6e7430e

    
           "gfx115x": { 
        
               "linux": { 
        
                   "test-runs-on": "", 
        
                   "family": "gfx1151", 
        
               }, 
        
               "windows": { 
        
                   "test-runs-on": "windows-strix-halo-gpu-rocm", 
        
                   "family": "gfx1151", 
        
               }, 
        
           },

We could also have the script log its inputs and what it is considering for each option, since debugging without that information is rather difficult.

marbre · 2025-07-24T22:00:21Z

Thinking through this a bit more... I think we could keep the existing build_tools/github_actions/configure_target_run.py, if we change it look at both the underlying family and the top level family_info. That would be gfx115x and gfx1151 here:

TheRock/build_tools/github_actions/amdgpu_family_matrix.py

Lines 31 to 40 in 6e7430e

"gfx115x": {

"linux": {

"test-runs-on": "",

"family": "gfx1151",

},

"windows": {

"test-runs-on": "windows-strix-halo-gpu-rocm",

"family": "gfx1151",

},

},

Yes, that would be really valuable. Having in mind the latest discussion about moving runners and labels think it's best to only need to know the arch one wants to run on. Not sure if family and family_info are the right names though. Until now gfx1151 is a family but I rather think it is the target and gfx115x should be the family?

We could also have the script log its inputs and what it is considering for each option, since debugging without that information is rather difficult.

➕

Splitting this off from #1100. Not tested recently.

ScottTodd · 2025-07-28T18:06:28Z

I'm just about ready to pick this up again. Planning to merge and resolve conflicts then take another look at the scripts and plumbing. @geomin12 were you able to find time to think about #1097 or work on any other refactoring approaches?

…dows-release-plumbing

geomin12 · 2025-07-28T18:15:55Z

I'm just about ready to pick this up again. Planning to merge and resolve conflicts then take another look at the scripts and plumbing. @geomin12 were you able to find time to think about #1097 or work on any other refactoring approaches?

didn't have too much time last week but thinking about some refactoring right now and hoping to land some upgrades soon

…dows-release-plumbing

ScottTodd · 2025-08-04T22:58:00Z

Synced, added some options/defaults, and kicked off a few more test runs (see updated PR description).

I can also look back at

Thinking through this a bit more... I think we could keep the existing build_tools/github_actions/configure_target_run.py, if we change it look at both the underlying family and the top level family_info.

if we want.

.github/workflows/build_portable_linux_pytorch_wheels.yml

marbre · 2025-08-06T11:51:49Z

.github/workflows/release_portable_linux_pytorch_wheels.yml

        type: string
+        default: linux-mi325-1gpu-ossci-rocm


I am a bit concerned that having a default will result in building for one family from the choices but missing to set the correct label here and testing on the wrong family afterwards. Not having a default would mean to know / look up one more label. Though I am not blocking on this even though still think it would be nice to not know the labels at all.

If testing can be skipped here, how does this colludes with #1072?

Okay, I can try restoring the script we had and fixing it. We're still not running pytorch tests on Windows until this lands though, since gfx1151 is not triggering test jobs :/

Pushed some changes. I'll test them then re-request review once they are ready.

geomin12

lovely thanks for the test!

ScottTodd · 2025-08-06T23:36:00Z

.github/workflows/build_portable_linux_pytorch_wheels.yml

+        options:
+          - gfx110X-dgpu
+          - gfx1151
+          - gfx120X-all
+          - gfx94X-dcgpu
+          - gfx950-dcgpu
+        default: gfx94X-dcgpu


We could also

keep this as a freeform string field

add xfail families from https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/amdgpu_family_matrix.py

I think limiting to only families that we have here makes sense though:

TheRock/build_tools/third_party/s3_management/manage.py

Lines 38 to 44 in 231dd86

PREFIXES = [

"v2/gfx110X-dgpu",

"v2/gfx1151",

"v2/gfx120X-all",

"v2/gfx94X-dcgpu",

"v2/gfx950-dcgpu",

]

(but we will want to expand that list, and now we have multiple locations to update)

Yeah, needing to update multiple locations is unfortunate but I agree that it makes sense to limit it here instead of having a freeform string.

marbre

Looks good to me. Thanks for all the work to improve this @ScottTodd!

ScottTodd added 4 commits July 22, 2025 11:40

Add more documentation to fetch_package_targets.py.

caf9445

Plumb test_runs_on through Windows release workflows.

661b390

Write torch version to github action outputs.

f5f51e9

Cleanup, port changes to Linux too.

e0d25c8

github-project-automation bot added this to TheRock Triage Jul 22, 2025

github-project-automation bot moved this to TODO in TheRock Triage Jul 22, 2025

ScottTodd requested review from marbre and geomin12 July 22, 2025 21:08

geomin12 reviewed Jul 22, 2025

View reviewed changes

marbre reviewed Jul 22, 2025

View reviewed changes

marbre mentioned this pull request Jul 24, 2025

Add gating mechanism to ensure PyTorch wheels pass tests before releasing #1110

Open

ScottTodd mentioned this pull request Jul 24, 2025

[torch] Replace inline bash with write_torch_version.py. #1120

Merged

ScottTodd added a commit that referenced this pull request Jul 24, 2025

[torch] Replace inline bash with write_torch_version.py. (#1120)

1b403be

Splitting this off from #1100. Not tested recently.

Merge remote-tracking branch 'upstream/main' into users/scotttodd/win…

58f28cc

…dows-release-plumbing

ScottTodd added 3 commits July 28, 2025 11:28

Delete write_torch_version.py (replaced with plural version).

5c29800

Merge remote-tracking branch 'upstream/main' into users/scotttodd/win…

7cee3eb

…dows-release-plumbing

Add choices and defaults for amdgpu_family and test_runs_on.

f668ede

Set default python version for Windows too.

7106d5b

ScottTodd marked this pull request as ready for review August 5, 2025 17:30

ScottTodd requested review from geomin12 and marbre August 5, 2025 17:30

geomin12 reviewed Aug 5, 2025

View reviewed changes

.github/workflows/build_portable_linux_pytorch_wheels.yml Outdated Show resolved Hide resolved

Expand on descriptions for test_runs_on workflow inputs.

3aeb303

ScottTodd requested a review from geomin12 August 5, 2025 20:50

geomin12 approved these changes Aug 5, 2025

View reviewed changes

marbre reviewed Aug 6, 2025

View reviewed changes

Restore fetch_package_targets.py, fix for gfx1151.

eb7a048

ScottTodd marked this pull request as draft August 6, 2025 20:24

ScottTodd marked this pull request as ready for review August 6, 2025 21:26

ScottTodd requested review from geomin12 and marbre August 6, 2025 21:26

geomin12 approved these changes Aug 6, 2025

View reviewed changes

ScottTodd commented Aug 6, 2025

View reviewed changes

marbre approved these changes Aug 6, 2025

View reviewed changes

ScottTodd merged commit 21439d0 into main Aug 6, 2025
30 of 32 checks passed

ScottTodd deleted the users/scotttodd/windows-release-plumbing branch August 6, 2025 23:42

github-project-automation bot moved this from TODO to Done in TheRock Triage Aug 6, 2025

	PREFIXES = [
	"v2/gfx110X-dgpu",
	"v2/gfx1151",
	"v2/gfx120X-all",
	"v2/gfx94X-dcgpu",
	"v2/gfx950-dcgpu",
	]

[ci] Rework test_runs_on plumbing for release workflows. #1100

[ci] Rework test_runs_on plumbing for release workflows. #1100

Uh oh!

Conversation

ScottTodd commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geomin12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marbre left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScottTodd commented Jul 22, 2025

Uh oh!

ScottTodd commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marbre commented Jul 24, 2025

Uh oh!

ScottTodd commented Jul 28, 2025

Uh oh!

geomin12 commented Jul 28, 2025

Uh oh!

ScottTodd commented Aug 4, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geomin12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marbre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ScottTodd commented Jul 22, 2025 •

edited

Loading

marbre left a comment •

edited

Loading

ScottTodd commented Jul 23, 2025 •

edited

Loading