Skip to content

Conversation

richardsliu
Copy link
Contributor

Why are these changes needed?

Adding TPU usage metrics for tensorcore_utilization and hbm_utilization to reporter_agent.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

richardsliu and others added 6 commits June 9, 2025 21:17
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
richardsliu and others added 4 commits June 9, 2025 14:18
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Copy link
Contributor

@alanwguo alanwguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@can-anyscale, does this PR interfere at all with your OTEL migration PR?

@can-anyscale
Copy link
Collaborator

thanks for tagging @alanwguo , not at all, looking great

Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
@richardsliu
Copy link
Contributor Author

Added a few additional runtime metrics, please take another look.

Copy link
Contributor

@alanwguo alanwguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add some visualizations of these metrics to default_dashboard_panels.py?

This will add this to the core grafana dashboard used in the ray dashboard.
We can also add it to be embedded into the Ray Dashboard UI in the Metrics.tsx file.

This can be done as a follow-up PR as well.

Overall, this PR looks good. I'm testing e2e before approving.

@richardsliu
Copy link
Contributor Author

Sounds good, I'll modify the dashboard in a follow up PR.

@masoudcharkhabi masoudcharkhabi added the k8s-proj K8s and Ray OSS label Jun 12, 2025
@alanwguo alanwguo added the go add ONLY when ready to merge, run all tests label Jun 12, 2025
@alanwguo
Copy link
Contributor

@jjyao can you review and merge?

@jjyao jjyao unassigned alanwguo and jjyao Jun 12, 2025
@jjyao
Copy link
Collaborator

jjyao commented Jun 12, 2025

@can-anyscale @MengjinYan could you review this PR since you are working on metrics right now.

Copy link
Collaborator

@can-anyscale can-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try:
for family in text_string_to_metric_families(metrics):
for sample in family.samples:
if sample.name == "memory_bandwidth_utilization":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seem weird to have all of these sample name hardcoded here; how often these name changes; if they change will we have malformed metric collected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
@richardsliu
Copy link
Contributor Author

@can-anyscale Fixed, please take another look.

Copy link
Collaborator

@can-anyscale can-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one additional question

Signed-off-by: Richard Liu <[email protected]>
@richardsliu
Copy link
Contributor Author

@can-anyscale Done.

Copy link
Collaborator

@can-anyscale can-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, did you test this end-to-end and see the right metric reported, etc.?

@richardsliu
Copy link
Contributor Author

Yes, I can see the metrics reported in Prometheus. I am adding the dashboard changes in a follow up PR.

@can-anyscale can-anyscale merged commit b030d5b into ray-project:master Jun 14, 2025
5 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in K8s and Ray (go/k8s-ray-oss) Jun 14, 2025
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Adding TPU usage metrics for tensorcore_utilization and hbm_utilization
to reporter_agent.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [X] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [X] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [X] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] Unit tests
   - [X] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Adding TPU usage metrics for tensorcore_utilization and hbm_utilization
to reporter_agent.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [X] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [X] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [X] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [X] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [X] Unit tests
   - [X] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests k8s-proj K8s and Ray OSS
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants