-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Add tpu usage metrics to reporter_agent #53678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@can-anyscale, does this PR interfere at all with your OTEL migration PR?
thanks for tagging @alanwguo , not at all, looking great |
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Added a few additional runtime metrics, please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add some visualizations of these metrics to default_dashboard_panels.py
?
This will add this to the core grafana dashboard used in the ray dashboard.
We can also add it to be embedded into the Ray Dashboard UI in the Metrics.tsx
file.
This can be done as a follow-up PR as well.
Overall, this PR looks good. I'm testing e2e before approving.
Sounds good, I'll modify the dashboard in a follow up PR. |
@jjyao can you review and merge? |
@can-anyscale @MengjinYan could you review this PR since you are working on metrics right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try: | ||
for family in text_string_to_metric_families(metrics): | ||
for sample in family.samples: | ||
if sample.name == "memory_bandwidth_utilization": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seem weird to have all of these sample name hardcoded here; how often these name changes; if they change will we have malformed metric collected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema should not change: https://cloud.google.com/monitoring/api/metrics_gcp#gcp-tpu
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Richard Liu <[email protected]>
@can-anyscale Fixed, please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one additional question
Signed-off-by: Richard Liu <[email protected]>
@can-anyscale Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can see the metrics reported in Prometheus. I am adding the dashboard changes in a follow up PR. |
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Adding TPU usage metrics for tensorcore_utilization and hbm_utilization to reporter_agent. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've included any doc changes needed for https://docs.ray.io/en/master/. - [X] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [X] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liu <[email protected]> Signed-off-by: Richard Liu <[email protected]> Signed-off-by: elliot-barn <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Adding TPU usage metrics for tensorcore_utilization and hbm_utilization to reporter_agent. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [X] I've included any doc changes needed for https://docs.ray.io/en/master/. - [X] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [X] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [X] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liu <[email protected]> Signed-off-by: Richard Liu <[email protected]> Signed-off-by: elliot-barn <[email protected]>
Why are these changes needed?
Adding TPU usage metrics for tensorcore_utilization and hbm_utilization to reporter_agent.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.