Skip to content

Commit ff1195e

Browse files
charis-poag-amdJeniferC99
authored andcommitted
[SWDEV-535159] Add support for GPU partition metrics (#490)
[SWDEV-535159] Add support for GPU partition metrics Changes include: - Internal logic to smart-switch between gpu_metrics/xcp_metrics files - [WIP] Initial plumbing for new partition metric API Change-Id: I4340fb1b48bac0117d80d5d486b9e871430d5cd8 Signed-off-by: Charis Poag <[email protected]> Add amdsmi_get_gpu_partition_metrics_info() + minor cleanup Change-Id: I5d60604f18baddbd03852dc90e88aa0b8107d50e Signed-off-by: Charis Poag <[email protected]> Fix partition metric logic + update logging/tests Change-Id: I9e89b19ead17694c54e224f8e13ff8ee3eb2e22a Signed-off-by: Charis Poag <[email protected]> Adjust amd-smi metric/monitor/default to show (some) partition information Change-Id: I2e8d2745876a19bdaec3c039daa97345c9f701b5 Signed-off-by: Charis Poag <[email protected]> Add C++ tests Change-Id: Ib9eb0b57a6d7a280992e05a4c6eba632826952ef Signed-off-by: Charis Poag <[email protected]> Remove modification of energy counter, not needed Change-Id: I5c48eaaae248ee6dc79abba609d837ec35d78022 Signed-off-by: Charis Poag <[email protected]> [CLI] amd-smi metric: cleaned up N/A'd multi-valued to show just N/A Changes: 1. amd-smi metric: cleaned up N/A'd multi-valued to show just N/A ex. JPEG_ACTIVITY: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] Now just shows: N/A 2. [Python Unit Test] Changed testname TestAmdSmiPythonBDF(unittest.TestCase) -> AmdSmiPythonUnitTest Test name was confusing. Change-Id: Ieb3b036f30002fd22362508eb9fc5d443df395ae Signed-off-by: Charis Poag <[email protected]> Log cleanup Change-Id: I1b1a95f1844d35bec7a7bd8cb996f87e4914c069 Signed-off-by: Charis Poag <[email protected]> Add amd-smi partition-metrics CLI + general cleanup Change-Id: Ia91488e6cb3a4d62b4087afbddfe0b3bb9378fdc Signed-off-by: Charis Poag <[email protected]> [1.3 metrics] Remove forwards compatibility for partition metrics Change-Id: Iab928983e6f6f1587bc9307f6f3fa2b2696ca6f7 Signed-off-by: Charis Poag <[email protected]> Fixed violation output not showing % + general cleanup Change-Id: Icac1b0a55b18c7628b07109ae0c377d17e0825f1 Signed-off-by: Charis Poag <[email protected]> Clean up amdsmi_get_gpu_partition_metrics_info & amd-smi partition-metric outputs Change-Id: I6427028b980874641e9ffb3b5d88ad493dbf9cf4 Signed-off-by: Charis Poag <[email protected]> * Fix metrics not found + extra logging/formatting Change-Id: I841a27bb2c305e97ec7579a13ac915e5be497c3a Signed-off-by: Charis Poag <[email protected]> * Update license to current default Change-Id: I0de9b8a2d5dbbeab4491097f0354ba17b0d30866 Signed-off-by: Charis Poag <[email protected]> * Cleanup for review Change-Id: I96ed25c3f2b8968eea1af24c5e5860c2b4e74e6e Signed-off-by: Charis Poag <[email protected]> * Moderize updated/new interal APIs. Change-Id: I3c48a250eeb703709b14cb5ffa68268d8321626c Signed-off-by: Charis Poag <[email protected]> * Remove extra logging in dynamic metrics Change-Id: Idb97547bcbe143d6fa1cb5cb278ffe4da615ce14 Signed-off-by: Charis Poag <[email protected]> * Remove amd-smi partition-metric command Change-Id: Ib83c17e5cd7e0da3798198943bddd46c296b411c Signed-off-by: Charis Poag <[email protected]> * Move new CLI updates to another PR + minor fixes Change-Id: I3b1163eec12f9b5f7d95ee33de08e168cec1b1fe Signed-off-by: Charis Poag <[email protected]> * Allow dynamic metrics to work for gpu/xcp metrics 1.9+/1.1+ Updated some logging as well. Change-Id: I2ed9f5a5ef8afb1520508820ca6153525f0644b4 Signed-off-by: Charis Poag <[email protected]> * Allow dyn gpu/xcp metric v1.9+/v1.1+ Added tests for quick check Change-Id: I576d6f6582a55afb08e5ac57791ce95e2fa184a2 Signed-off-by: Charis Poag <[email protected]> * Update tests for larger subset of version checks Change-Id: I3cdf4f8bb4fc6161f4c76566939f90545d0f362a Signed-off-by: Charis Poag <[email protected]> * Fix XCP metrics in gpu/partition metric pre-v1.9/v1.1 (dynamic) Change-Id: I4dabc1ed6bef6b86c8e7f92bf9cb5992f3966fe2 Signed-off-by: Charis Poag <[email protected]> --------- Change-Id: I8ab1752743b04f1c7791d0405a7bccd7128b01ae Signed-off-by: Charis Poag <[email protected]>
1 parent 701e3ff commit ff1195e

22 files changed

+2282
-892
lines changed

amdsmi_cli/amdsmi_commands.py

Lines changed: 16 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1672,6 +1672,7 @@ def metric_gpu(self, args, multiple_devices=False, watching_output=False, gpu=No
16721672
# Add timestamp and store values for specified arguments
16731673
values_dict = {}
16741674

1675+
is_partition_metrics = False # True if we get the metrics from xcp_metrics file (amdsmi_get_gpu_partition_metrics_info)
16751676
#get metric info only once per gpu, this will speed up data output
16761677
try:
16771678
# Get GPU Metrics table
@@ -1680,19 +1681,10 @@ def metric_gpu(self, args, multiple_devices=False, watching_output=False, gpu=No
16801681
logging.debug("#3 - Unable to load GPU Metrics table for %s | %s", gpu_id, e.get_error_info())
16811682
gpu_metric = amdsmi_interface._NA_amdsmi_get_gpu_metrics_info()
16821683

1683-
# Workaround for XCP (partition) metrics not providing num_partition in v1.0
1684-
# Confirmed with driver team that we can default to 1 if num_partition is not defined.
1685-
# Pending partitions exist, ie. partition_id > 0. See logic below.
1686-
try:
1687-
partition_id = amdsmi_interface.amdsmi_get_gpu_kfd_info(args.gpu)['current_partition_id']
1688-
except amdsmi_exception.AmdSmiLibraryException as e:
1689-
logging.debug("Failed to get current partition id for gpu %s | %s", gpu_id, e.get_error_info())
1690-
partition_id = "N/A"
1691-
1692-
num_partition = gpu_metric['num_partition']
1693-
if num_partition == "N/A":
1694-
num_partition = 1 # Workaround for XCP metrics not providing num_partition in v1.0
1695-
logging.debug(f"num_partition is N/A and partition_id: {partition_id} (greater > 0).\nModified num_partition: {num_partition} to adjust for XCP metrics.")
1684+
# Workaround for XCP (partition) metrics not providing num_partition in v1.9+/v1.1+
1685+
# Provides original formatting for earlier metric versions
1686+
partition_metric_info = self.helpers._get_metric_version_and_partition_info(gpu_metric, is_partition_metrics, gpu_id, args.gpu)
1687+
num_partition = partition_metric_info['num_partition']
16961688

16971689
if self.logger.is_json_format():
16981690
values_dict['gpu'] = int(gpu_id)
@@ -2719,7 +2711,7 @@ def metric_gpu(self, args, multiple_devices=False, watching_output=False, gpu=No
27192711
value[k][index] = self.helpers.unit_format(self.logger, activity, activity_unit)
27202712
value[k] = '[' + ", ".join(value[k]) + ']'
27212713
elif value != "N/A":
2722-
value = self.helpers.unit_format(self.logger, value, activity_unit)
2714+
throttle_status[key] = self.helpers.unit_format(self.logger, value, activity_unit)
27232715
if self.logger.is_json_format():
27242716
if isinstance(value, (list, dict)):
27252717
for k, v in value.items():
@@ -3130,7 +3122,6 @@ def metric_core(self, args, multiple_devices=False, core=None, core_boost_limit=
31303122
if not self.logger.is_json_format():
31313123
self.logger.print_output(multiple_device_enabled=multiple_devices_csv_override)
31323124

3133-
31343125
def metric(self, args, multiple_devices=False, watching_output=False, gpu=None,
31353126
usage=None, watch=None, watch_time=None, iterations=None, power=None,
31363127
clock=None, temperature=None, ecc=None, ecc_blocks=None, pcie=None,
@@ -5744,6 +5735,7 @@ def monitor(self, args, multiple_devices=False, watching_output=False, gpu=None,
57445735
except amdsmi_exception.AmdSmiLibraryException as e:
57455736
logging.debug("#5 - Unable to load GPU Metrics table for %s | %s", gpu_id, e.get_error_info())
57465737

5738+
is_partition_metrics = False # True if we get the metrics from xcp_metrics file (amdsmi_get_gpu_partition_metrics_info)
57475739
#get metric info only once per gpu, this will speed up data output
57485740
try:
57495741
# Get GPU Metrics table
@@ -5755,25 +5747,15 @@ def monitor(self, args, multiple_devices=False, watching_output=False, gpu=None,
57555747
gpu_metrics_info = amdsmi_interface._NA_amdsmi_get_gpu_metrics_info()
57565748
logging.debug("Unable to load GPU Metrics table for %s | %s", gpu_id, e.get_error_info())
57575749

5758-
# Workaround for XCP (partition) metrics not providing num_partition in v1.0
5759-
# Confirmed with driver team that we can default to 1 if num_partition is not defined.
5760-
# Pending partitions exist, ie. partition_id > 0. See logic below.
5761-
try:
5762-
partition_id = amdsmi_interface.amdsmi_get_gpu_kfd_info(args.gpu)['current_partition_id']
5763-
except amdsmi_exception.AmdSmiLibraryException as e:
5764-
logging.debug("Failed to get current partition id for gpu %s | %s", gpu_id, e.get_error_info())
5765-
partition_id = "N/A"
5750+
# Workaround for XCP (partition) metrics not providing num_partition in v1.9+/v1.1+
5751+
# Provides original formatting for earlier metric versions
5752+
partition_metric_info = self.helpers._get_metric_version_and_partition_info(gpu_metrics_info, is_partition_metrics, gpu_id, args.gpu)
5753+
partition_id = partition_metric_info['partition_id']
5754+
num_partition = partition_metric_info['num_partition']
57665755

5767-
num_partition = gpu_metrics_info['num_partition']
5768-
if num_partition == "N/A":
5769-
num_partition = partition_id
5770-
5771-
num_xcp = num_partition # used later for XCP metrics
5756+
# Update logger for XCP display (only if applicable)
57725757
self.logger.table_header += 'XCP'.rjust(5, ' ')
5773-
self.logger.store_output(args.gpu, 'xcp', partition_id) # Starting with partition_id.
5774-
# Outputs which have xcp details
5775-
# will update this value via num_xcp.
5776-
# This value will help map to primary device.
5758+
self.logger.store_output(args.gpu, 'xcp', partition_id) # Store partition_id initially; can be updated via num_xcp
57775759

57785760
# Store the pcie_bw values due to possible increase in bandwidth due to repeated gpu_metrics calls
57795761
if args.pcie:
@@ -6013,7 +5995,7 @@ def monitor(self, args, multiple_devices=False, watching_output=False, gpu=None,
60135995
"unit" : freq_unit}
60145996
except (KeyError, amdsmi_exception.AmdSmiLibraryException) as e:
60155997
monitor_values['dclock'] = "N/A"
6016-
logging.debug("Failed to get vclock on gpu %s | %s", gpu_id, e)
5998+
logging.debug("Failed to get dclock on gpu %s | %s", gpu_id, e)
60175999

60186000
self.logger.table_header += 'DCLOCK'.rjust(10)
60196001

@@ -6356,7 +6338,7 @@ def monitor(self, args, multiple_devices=False, watching_output=False, gpu=None,
63566338
self.logger.store_multiple_device_output()
63576339
current_xcp += 1
63586340
else:
6359-
self.logger.store_output(args.gpu, 'xcp', num_xcp)
6341+
self.logger.store_output(args.gpu, 'xcp', partition_id)
63606342
self.logger.store_output(args.gpu, 'values', monitor_values)
63616343

63626344
# Store typical output for all commands (XCP data will be handled separately, eg. violation status)

amdsmi_cli/amdsmi_helpers.py

Lines changed: 70 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1017,7 +1017,6 @@ def unit_format(self, logger, value, unit):
10171017
"""This function will format output with unit based on the logger output format
10181018
10191019
params:
1020-
args - argparser args to pass to subcommand
10211020
logger (AMDSMILogger) - Logger to print out output
10221021
value - the value to be formatted
10231022
unit - the unit to be formatted with the value
@@ -1040,6 +1039,9 @@ def unit_format(self, logger, value, unit):
10401039
return {"value": value, "unit": unit}
10411040
else:
10421041
return value
1042+
if logger.is_csv_format():
1043+
# For CSV, return the raw value (number or "N/A"), not a string
1044+
return value
10431045
if logger.is_human_readable_format():
10441046
if unit:
10451047
return f"{value} {unit}".rstrip()
@@ -1637,3 +1639,70 @@ def average_flattened_ints(data, context="data"):
16371639
# Flatten nested lists and filter integers
16381640
flat = [v for value in data for v in (value if isinstance(value, list) else [value]) if isinstance(v, int)]
16391641
return round(sum(flat) / len(flat)) if flat else "N/A"
1642+
1643+
def _get_metric_version_and_partition_info(self, gpu_metrics_info, is_partition_metrics, gpu_id, gpu_handle):
1644+
"""
1645+
Helper method to compute metric version, partition ID, and num_partition for dynamic metrics.
1646+
Handles logging updates internally for reusability.
1647+
1648+
Args:
1649+
gpu_metrics_info (dict): GPU metrics info from amdsmi_get_gpu_metrics_info.
1650+
is_partition_metrics (bool): Whether this is for partition metrics.
1651+
gpu_id (int): GPU ID for logging.
1652+
gpu_handle: GPU device handle for KFD info retrieval.
1653+
1654+
Returns:
1655+
dict: {
1656+
'metric_version': float or "N/A",
1657+
'partition_id': int or "N/A",
1658+
'num_partition': int or "N/A",
1659+
'num_xcp': int or "N/A" # Alias for num_partition
1660+
}
1661+
"""
1662+
# Compute metric version from header revisions
1663+
metric_version = "N/A"
1664+
format_rev = gpu_metrics_info.get('common_header.format_revision', "N/A")
1665+
content_rev = gpu_metrics_info.get('common_header.content_revision', "N/A")
1666+
if format_rev != "N/A" and content_rev != "N/A":
1667+
try:
1668+
metric_version = float(f"{format_rev}.{content_rev}")
1669+
except ValueError:
1670+
metric_version = "N/A" # Fallback if conversion fails
1671+
1672+
# Retrieve partition ID from KFD info
1673+
partition_id = "N/A"
1674+
try:
1675+
kfd_info = amdsmi_interface.amdsmi_get_gpu_kfd_info(gpu_handle)
1676+
partition_id = kfd_info.get('current_partition_id', "N/A")
1677+
except amdsmi_exception.AmdSmiLibraryException as e:
1678+
logging.debug("Failed to get current partition ID for GPU %s | %s", gpu_id, e.get_error_info())
1679+
1680+
# Determine num_partition with fallback logic for dynamic metrics
1681+
num_partition = gpu_metrics_info.get('num_partition', "N/A")
1682+
if metric_version != "N/A" and num_partition == "N/A":
1683+
# Workaround: Default to 1 for newer metric versions if num_partition is missing
1684+
# (Confirmed with driver team; applies to GPU and partition metrics)
1685+
if not is_partition_metrics and metric_version >= 1.9:
1686+
num_partition = 1
1687+
elif is_partition_metrics and metric_version >= 1.1:
1688+
num_partition = 1
1689+
elif partition_id != "N/A" and partition_id > 0:
1690+
# Fallback to partition_id if partitions exist but num_partition is unavailable
1691+
num_partition = partition_id
1692+
# Else: Remains "N/A" if no conditions match
1693+
1694+
# Alias num_xcp for XCP metrics usage
1695+
num_xcp = num_partition
1696+
1697+
# Debug logging
1698+
logging.debug(
1699+
"GPU %s | Metric version: %s, num_partition: %s, partition_id: %s, num_xcp: %s",
1700+
gpu_id, metric_version, num_partition, partition_id, num_xcp
1701+
)
1702+
1703+
return {
1704+
'metric_version': metric_version,
1705+
'partition_id': partition_id,
1706+
'num_partition': num_partition,
1707+
'num_xcp': num_xcp
1708+
}

amdsmi_cli/amdsmi_parser.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -918,7 +918,6 @@ def _add_bad_pages_parser(self, subparsers: argparse._SubParsersAction, func):
918918
self._add_device_arguments(bad_pages_parser, required=False)
919919
self._add_command_modifiers(bad_pages_parser)
920920

921-
922921
def _add_metric_parser(self, subparsers: argparse._SubParsersAction, func):
923922
# Subparser help text
924923
metric_help = "Gets metric/performance information about the specified GPU"

include/amd_smi/amdsmi.h

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4055,6 +4055,30 @@ amdsmi_get_gpu_metrics_header_info(amdsmi_processor_handle processor_handle, amd
40554055
amdsmi_status_t amdsmi_get_gpu_metrics_info(amdsmi_processor_handle processor_handle,
40564056
amdsmi_gpu_metrics_t *pgpu_metrics);
40574057

4058+
/**
4059+
* @brief This function retrieves the partition metrics information.
4060+
*
4061+
* @ingroup tagClkPowerPerfQuery
4062+
*
4063+
* @platform{gpu_bm_linux} @platform{guest_1vf}
4064+
*
4065+
* @details Given a processor handle @p processor_handle and a pointer to a
4066+
* ::amdsmi_gpu_metrics_t structure @p pgpu_metrics, this function will populate
4067+
* @p pgpu_metrics. See ::amdsmi_gpu_metrics_t for more details.
4068+
*
4069+
* @param[in] processor_handle a processor handle
4070+
*
4071+
* @param[in,out] pgpu_metrics a pointer to an ::amdsmi_gpu_metrics_t structure
4072+
* If this parameter is nullptr, this function will return
4073+
* ::AMDSMI_STATUS_INVAL if the function is supported with the provided,
4074+
* arguments and ::AMDSMI_STATUS_NOT_SUPPORTED if it is not supported with the
4075+
* provided arguments.
4076+
*
4077+
* @return ::amdsmi_status_t | ::AMDSMI_STATUS_SUCCESS on success, non-zero on fail
4078+
*/
4079+
amdsmi_status_t amdsmi_get_gpu_partition_metrics_info(amdsmi_processor_handle processor_handle,
4080+
amdsmi_gpu_metrics_t *pgpu_metrics);
4081+
40584082
/**
40594083
* @brief Get the pm metrics table with provided device index.
40604084
*

py-interface/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,7 @@
170170
from .amdsmi_interface import amdsmi_get_clk_freq
171171
from .amdsmi_interface import amdsmi_get_gpu_od_volt_info
172172
from .amdsmi_interface import amdsmi_get_gpu_metrics_info
173+
from .amdsmi_interface import amdsmi_get_gpu_partition_metrics_info
173174
from .amdsmi_interface import amdsmi_get_gpu_od_volt_curve_regions
174175
from .amdsmi_interface import amdsmi_is_gpu_power_management_enabled
175176

0 commit comments

Comments
 (0)