-
Notifications
You must be signed in to change notification settings - Fork 41
Open
Labels
Description
Suggestion Description
To catch reliability issues earlier, NVIDIA DCGM has an advanced test for creating spikes in the current flow on the board to ensure the VRM & PSU can handle fluctuations (which may be caused by cpu kernel launch bound applications or kernels that are natively high micro-fluctuation).
I have searched all of ROCmValidationSuite
's documentation & codebase and haven't found anything related to this
https://rocm.docs.amd.com/projects/ROCmValidationSuite/en/latest/conceptual/rvs-modules.html
When you get the chance, can u look into implementing this?
cc: @hliuca
Excerpt About Pulse Test from https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html#pulse-test-diagnostic
The Pulse Test is part of the new level 4 tests. The pulse test is meant to fluctuate the power usage to create spikes in current flow on the board to ensure that the power supply is fully functional and can handle wide fluctuations in current.
By default, the test runs kernels with high transiency in order to create spikes in the current running to the GPU. Default parameters have been verified to create worst-case scenario failures by measuring with oscilloscopes.
The test iteratively runs different kernels while tweaking internal parameters to ensure that spikes are produced; work across GPU is synchronized to create extra stress on the power supply.
Operating System
No response
GPU
No response
ROCm Component
No response