-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Suggestion Description
Feature Request
for out of the box rvs
CLI to have diagnostics levels an -r
flag that defines different levels (which take different amount of time to run. In addition to personal experience with rvs
(and an frequent user of NVIDIA dcgm
), we have heard multiple AMD clouds tell us that they would appreciate a feature like this.
cc: @hliuca
Context Of Current Poor RVS User Experience
Currently RVS doesn't provide an out of the box Diagnostics Level and the only way to select test suites are:
- run the whole test suite (which is non practical for different common situations such as job prolog)
- define a custom yaml to set which tests to be a part of the test suite to run (depending on how long u want the test to run)
This is compared to NVIDIA DCGM diagnostics which provides out of the box levels via a simple CLI in addition to NVIDIA DCGM allowing for custom yaml
sudo dcgmi diag -r <INSERT_LEVEL>
where <INSERT_LEVEL>
is from 1,2,3,4
Level 1 test suite takes 1 sec & is used an readiness check that runs as a prolog before the job starts
Level 2 test suite takes < 2 mins & is used an epilogue on a job failure
Level 3 test suite takes < 30 mins & is used for failure post-mortem diag
Level 4 test suite takes 1-2 hours & is used for failure post-mortem diag


Operating System
No response
GPU
No response
ROCm Component
No response