Add tpu usage metrics to reporter_agent #53678

richardsliu · 2025-06-09T21:16:50Z

Why are these changes needed?

Adding TPU usage metrics for tensorcore_utilization and hbm_utilization to reporter_agent.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Richard Liu <[email protected]>

alanwguo

Nice!

@can-anyscale, does this PR interfere at all with your OTEL migration PR?

python/ray/dashboard/consts.py

python/ray/dashboard/modules/reporter/reporter_agent.py

can-anyscale · 2025-06-10T17:43:37Z

thanks for tagging @alanwguo , not at all, looking great

Signed-off-by: Richard Liu <[email protected]>

richardsliu · 2025-06-10T22:43:50Z

Added a few additional runtime metrics, please take another look.

alanwguo

Can you also add some visualizations of these metrics to default_dashboard_panels.py?

This will add this to the core grafana dashboard used in the ray dashboard.
We can also add it to be embedded into the Ray Dashboard UI in the Metrics.tsx file.

This can be done as a follow-up PR as well.

Overall, this PR looks good. I'm testing e2e before approving.

richardsliu · 2025-06-11T04:02:48Z

Sounds good, I'll modify the dashboard in a follow up PR.

alanwguo · 2025-06-12T23:28:19Z

@jjyao can you review and merge?

jjyao · 2025-06-12T23:37:19Z

@can-anyscale @MengjinYan could you review this PR since you are working on metrics right now.

can-anyscale

python/ray/dashboard/modules/reporter/reporter_agent.py

can-anyscale · 2025-06-12T23:57:25Z

python/ray/dashboard/modules/reporter/reporter_agent.py

+        try:
+            for family in text_string_to_metric_families(metrics):
+                for sample in family.samples:
+                    if sample.name == "memory_bandwidth_utilization":


seem weird to have all of these sample name hardcoded here; how often these name changes; if they change will we have malformed metric collected

The schema should not change: https://cloud.google.com/monitoring/api/metrics_gcp#gcp-tpu

python/ray/dashboard/modules/reporter/reporter_agent.py

python/ray/dashboard/modules/reporter/tests/test_reporter.py

Signed-off-by: Richard Liu <[email protected]>

richardsliu · 2025-06-13T21:31:28Z

@can-anyscale Fixed, please take another look.

can-anyscale

just one additional question

python/ray/dashboard/modules/reporter/reporter_agent.py

Signed-off-by: Richard Liu <[email protected]>

richardsliu · 2025-06-13T23:03:50Z

@can-anyscale Done.

can-anyscale

Thanks, did you test this end-to-end and see the right metric reported, etc.?

richardsliu · 2025-06-14T00:49:32Z

Yes, I can see the metrics reported in Prometheus. I am adding the dashboard changes in a follow up PR.

## Why are these changes needed? Adding TPU usage metrics for tensorcore_utilization and hbm_utilization to reporter_agent. ## Related issue number  ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've included any doc changes needed for https://docs.ray.io/en/master/. - [X] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [X] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liu <[email protected]> Signed-off-by: Richard Liu <[email protected]> Signed-off-by: elliot-barn <[email protected]>

richardsliu and others added 6 commits June 9, 2025 21:17

add TPU metrics

64f4464

Signed-off-by: Richard Liu <[email protected]>

remove log

d9a9dd3

Signed-off-by: Richard Liu <[email protected]>

fix bugs

140530a

Signed-off-by: Richard Liu <[email protected]>

handle device plugin failures

a0ce742

Signed-off-by: Richard Liu <[email protected]>

format

f907ee0

Signed-off-by: Richard Liu <[email protected]>

catch exception

913ef74

Signed-off-by: Richard Liu <[email protected]>

richardsliu force-pushed the metric branch from 506c196 to 913ef74 Compare June 9, 2025 21:17

richardsliu and others added 4 commits June 9, 2025 14:18

Merge branch 'master' into metric

ad3fea0

Signed-off-by: Richard Liu <[email protected]>

fix

1736fbc

Signed-off-by: Richard Liu <[email protected]>

add test

0c4563c

Signed-off-by: Richard Liu <[email protected]>

add test

a454867

Signed-off-by: Richard Liu <[email protected]>

alanwguo reviewed Jun 10, 2025

View reviewed changes

python/ray/dashboard/consts.py Show resolved Hide resolved

python/ray/dashboard/modules/reporter/reporter_agent.py Outdated Show resolved Hide resolved

python/ray/dashboard/modules/reporter/reporter_agent.py Outdated Show resolved Hide resolved

richardsliu added 2 commits June 10, 2025 21:37

add comments

71431f0

Signed-off-by: Richard Liu <[email protected]>

add runtime metrics

bf311dd

Signed-off-by: Richard Liu <[email protected]>

alanwguo reviewed Jun 10, 2025

View reviewed changes

masoudcharkhabi added the k8s-proj K8s and Ray OSS label Jun 12, 2025

github-project-automation bot added this to K8s and Ray (go/k8s-ray-oss) Jun 12, 2025

masoudcharkhabi moved this to In Progress in K8s and Ray (go/k8s-ray-oss) Jun 12, 2025

masoudcharkhabi assigned masoudcharkhabi, jjyao and alanwguo and unassigned masoudcharkhabi Jun 12, 2025

alanwguo added the go add ONLY when ready to merge, run all tests label Jun 12, 2025

alanwguo approved these changes Jun 12, 2025

View reviewed changes

jjyao assigned MengjinYan and can-anyscale Jun 12, 2025

jjyao unassigned alanwguo and jjyao Jun 12, 2025

can-anyscale reviewed Jun 13, 2025

View reviewed changes

richardsliu added 5 commits June 13, 2025 19:41

add unit test

5d2d732

Signed-off-by: Richard Liu <[email protected]>

lint

0b7dc95

Signed-off-by: Richard Liu <[email protected]>

lint

6a077eb

Signed-off-by: Richard Liu <[email protected]>

lint

4996ee0

Signed-off-by: Richard Liu <[email protected]>

lint

7d775fc

Signed-off-by: Richard Liu <[email protected]>

can-anyscale reviewed Jun 13, 2025

View reviewed changes

python/ray/dashboard/modules/reporter/reporter_agent.py Outdated Show resolved Hide resolved

readability

1a947ce

Signed-off-by: Richard Liu <[email protected]>

can-anyscale approved these changes Jun 14, 2025

View reviewed changes

can-anyscale merged commit b030d5b into ray-project:master Jun 14, 2025
5 checks passed

github-project-automation bot moved this from In Progress to Done in K8s and Ray (go/k8s-ray-oss) Jun 14, 2025

andrewsykim mentioned this pull request Aug 15, 2025

Set TPU_DEVICE_PLUGIN_ADDR by default ai-on-gke/kuberay-tpu-webhook#11

Open

Add tpu usage metrics to reporter_agent #53678

Add tpu usage metrics to reporter_agent #53678

Uh oh!

Conversation

richardsliu commented Jun 9, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

alanwguo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

can-anyscale commented Jun 10, 2025

Uh oh!

richardsliu commented Jun 10, 2025

Uh oh!

alanwguo left a comment

Choose a reason for hiding this comment

Uh oh!

richardsliu commented Jun 11, 2025

Uh oh!

alanwguo commented Jun 12, 2025

Uh oh!

jjyao commented Jun 12, 2025

Uh oh!

can-anyscale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

can-anyscale Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

richardsliu Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richardsliu commented Jun 13, 2025

Uh oh!

can-anyscale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

richardsliu commented Jun 13, 2025

Uh oh!

can-anyscale left a comment

Choose a reason for hiding this comment

Uh oh!

richardsliu commented Jun 14, 2025

Uh oh!

Uh oh!

Uh oh!

alanwguo left a comment •

edited

Loading