Skip to content

[Feature]: Parity With NVIDIA DCGM - Pulse Test #963

@functionstackx

Description

@functionstackx

Suggestion Description

To catch reliability issues earlier, NVIDIA DCGM has an advanced test for creating spikes in the current flow on the board to ensure the VRM & PSU can handle fluctuations (which may be caused by cpu kernel launch bound applications or kernels that are natively high micro-fluctuation).

I have searched all of ROCmValidationSuite's documentation & codebase and haven't found anything related to this

https://rocm.docs.amd.com/projects/ROCmValidationSuite/en/latest/conceptual/rvs-modules.html

When you get the chance, can u look into implementing this?

cc: @hliuca

Excerpt About Pulse Test from https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html#pulse-test-diagnostic

The Pulse Test is part of the new level 4 tests. The pulse test is meant to fluctuate the power usage to create spikes in current flow on the board to ensure that the power supply is fully functional and can handle wide fluctuations in current.

By default, the test runs kernels with high transiency in order to create spikes in the current running to the GPU. Default parameters have been verified to create worst-case scenario failures by measuring with oscilloscopes.

The test iteratively runs different kernels while tweaking internal parameters to ensure that spikes are produced; work across GPU is synchronized to create extra stress on the power supply.

Operating System

No response

GPU

No response

ROCm Component

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions