Skip to content

[Feature]: Parity With NVIDIA DCGM - Out of the Box Diagnostics Levels #961

@functionstackx

Description

@functionstackx

Suggestion Description

Feature Request

for out of the box rvs CLI to have diagnostics levels an -r flag that defines different levels (which take different amount of time to run. In addition to personal experience with rvs (and an frequent user of NVIDIA dcgm), we have heard multiple AMD clouds tell us that they would appreciate a feature like this.

cc: @hliuca

Context Of Current Poor RVS User Experience

Currently RVS doesn't provide an out of the box Diagnostics Level and the only way to select test suites are:

  1. run the whole test suite (which is non practical for different common situations such as job prolog)
  2. define a custom yaml to set which tests to be a part of the test suite to run (depending on how long u want the test to run)

This is compared to NVIDIA DCGM diagnostics which provides out of the box levels via a simple CLI in addition to NVIDIA DCGM allowing for custom yaml

sudo dcgmi diag -r <INSERT_LEVEL>

where <INSERT_LEVEL> is from 1,2,3,4

Level 1 test suite takes 1 sec & is used an readiness check that runs as a prolog before the job starts
Level 2 test suite takes < 2 mins & is used an epilogue on a job failure
Level 3 test suite takes < 30 mins & is used for failure post-mortem diag
Level 4 test suite takes 1-2 hours & is used for failure post-mortem diag

Image Image

Operating System

No response

GPU

No response

ROCm Component

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions