Only enumerate ROCm-capable AMD GPUs #1500

alaviss · 2025-06-10T19:52:32Z

Discover AMD graphics devices using AMDKFD topology instead of enumerating the PCIe bus. This interface exposes a lot more information about potential devices, allowing RamaLama to filter out unsupported devices.

Currently, devices older than GFX9 are filtered, as they are no longer supported by ROCm.

Ref: #1482

Summary by Sourcery

Use the AMDKFD sysfs topology interface to detect and select ROCm-capable AMD GPUs by architecture and VRAM capacity.

New Features:

Introduce an amdkfd module to parse KFD properties and enumerate GPUs via /sys/devices/virtual/kfd
Replace PCI-based enumeration with AMDKFD for ROCm GPU detection
Filter out GPUs older than GFX9 (gfx_target_version < 90000)
Aggregate VRAM across memory banks and require a minimum of 1 GiB (MIN_VRAM_BYTES)

Discover AMD graphics devices using AMDKFD topology instead of enumerating the PCIe bus. This interface exposes a lot more information about potential devices, allowing RamaLama to filter out unsupported devices. Currently, devices older than GFX9 are filtered, as they are no longer supported by ROCm. Signed-off-by: Leorize <[email protected]>

alaviss · 2025-06-10T19:56:05Z

At the moment, RamaLama will fallback to CPU inference for unsupported AMD GPUs since -ngl is not passed if RamaLama doesn't detect an accelerator. I'm not sure how to detect if a Vulkan-capable accelerator is available to force passing in -ngl.

alaviss · 2025-06-10T19:58:26Z

Ideally this should be tested on AMD APUs systems. I have only verified this implementation with my dGPU.

ericcurtin · 2025-06-10T20:04:05Z

Not a big fan of inner functions, it encourages indenting for little benefit, can we just make them normal functions?

ericcurtin · 2025-06-10T20:05:34Z

Thanks for completing this, the logic seems to make sense, not that I tested

ericcurtin · 2025-06-10T20:12:07Z

As regards "I'm not sure how to detect if a Vulkan-capable accelerator is available to force passing:"

We recently integrated this change:

ggml-org/llama.cpp#14099

Maybe we want to default to --ngl 999 all round and see how that goes. My concern would be machines with limited VRAM, would they play nicely with that default.

We could parse this just before we execute llama-server also potentially:

$ vulkaninfo 2>&1 | grep deviceType
	deviceType        = PHYSICAL_DEVICE_TYPE_CPU

Signed-off-by: Leorize <[email protected]>

alaviss · 2025-06-10T20:23:00Z

Maybe we want to default to --ngl 999 all round and see how that goes

I actually tested that, and it tries to do repacks for CPU. I'm not sure if that's a good or a bad thing. But either way I think that change should be done separately.

We could parse this just before we execute llama-server also potentially:
$ vulkaninfo 2>&1 | grep deviceType
	deviceType        = PHYSICAL_DEVICE_TYPE_CPU

vulkaninfo is not installed by default anywhere afaict.

ericcurtin · 2025-06-10T21:21:58Z

Maybe we want to default to --ngl 999 all round and see how that goes

I actually tested that, and it tries to do repacks for CPU. I'm not sure if that's a good or a bad thing. But either way I think that change should be done separately.

Yeah... There's a patch from llama.cpp merged hours ago that might change this behaviour though. Did you try with that patch?

We could parse this just before we execute llama-server also potentially:
$ vulkaninfo 2>&1 | grep deviceType
	deviceType        = PHYSICAL_DEVICE_TYPE_CPU
vulkaninfo is not installed by default anywhere afaict.

It's installed in all our container images with vulkan, so if it's done just before where llama-server is execute it should be fine in the containerized case.

Copilot

Pull Request Overview

This PR replaces PCIe-based enumeration with AMDKFD topology to detect ROCm-capable AMD GPUs, filters out unsupported older architectures, and accounts for VRAM across memory banks.

Add ramalama.amdkfd module for parsing KFD properties and listing GPU nodes
Update check_rocm_amd() to use KFD topology, filter gfx versions <90000, and sum VRAM from public/private heaps
Introduce MIN_VRAM_BYTES constant (1 GiB) for minimum VRAM threshold

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
ramalama/common.py	Import new module, add VRAM threshold constant, overhaul GPU scan
ramalama/amdkfd.py	New utilities for reading KFD sysfs properties and listing GPUs

Comments suppressed due to low confidence (3)

ramalama/common.py:453

[nitpick] The variable name np is ambiguous and commonly associated with NumPy; consider renaming to node_path or similar for clarity.

for i, (np, props) in enumerate(amdkfd.gpus()):

ramalama/common.py:470

This assignment is outside the if mem_bytes > ... block, so gpu_num will be set on every iteration rather than only when a larger VRAM GPU is found. It should be indented inside the if.

gpu_num = i

ramalama/amdkfd.py:12

The new gpus() function and parse_props() logic lack tests; consider adding unit tests or mocks for sysfs reads to verify correct parsing and filtering.

def gpus():

Copilot · 2025-06-10T21:24:12Z

ramalama/common.py

+        mem_banks_count = int(props['mem_banks_count'])
+        mem_bytes = 0
+        for bank in range(mem_banks_count):
+            bank_props = amdkfd.parse_props(np + f'/mem_banks/{bank}/properties')


[nitpick] Path construction via string concatenation can be fragile; prefer os.path.join(np, 'mem_banks', str(bank), 'properties').

Suggested change

bank_props = amdkfd.parse_props(np + f'/mem_banks/{bank}/properties')

bank_props = amdkfd.parse_props(os.path.join(np, 'mem_banks', str(bank), 'properties'))

Copilot · 2025-06-10T21:24:12Z

ramalama/common.py

+            # See /usr/include/linux/kfd_sysfs.h for possible heap types
+            #
+            # Count public and private framebuffer memory as VRAM
+            if bank_props['heap_type'] in [1, 2]:


[nitpick] Magic numbers 1 and 2 for heap types are unclear; consider defining named constants or an enum for readability and future maintenance.

Suggested change

if bank_props['heap_type'] in [1, 2]:

if bank_props['heap_type'] in [HEAP_TYPE_PUBLIC, HEAP_TYPE_PRIVATE]:

If this is correct we should try and take this is another PR @alaviss . I have no idea what 1 and 2 mean.

alaviss requested review from rhatdan, ericcurtin, bmahabirbu, maxamillion, swarajpande5, jhjaggars, cgruver, slp and engelmi as code owners June 10, 2025 19:52

alaviss force-pushed the push-pwxuznmnqptr branch from 497e20c to 7bf6748 Compare June 10, 2025 19:53

alaviss force-pushed the push-pwxuznmnqptr branch from 7bf6748 to fab8765 Compare June 10, 2025 19:54

ericcurtin approved these changes Jun 10, 2025

View reviewed changes

ericcurtin requested a review from Copilot June 10, 2025 20:02

This comment was marked as outdated.

Sign in to view

alaviss added 3 commits June 10, 2025 15:17

Extract amdkfd utilities to its own module

ecb9fb0

Signed-off-by: Leorize <[email protected]>

Extract VRAM minimum into a constant

93e36ac

Signed-off-by: Leorize <[email protected]>

Apply formatting fixes

db4a7d2

Signed-off-by: Leorize <[email protected]>

ericcurtin approved these changes Jun 10, 2025

View reviewed changes

ericcurtin requested a review from Copilot June 10, 2025 21:22

Copilot AI reviewed Jun 10, 2025

View reviewed changes

ericcurtin merged commit 4808a49 into containers:main Jun 10, 2025
14 of 16 checks passed

alaviss deleted the push-pwxuznmnqptr branch June 10, 2025 22:21

alaviss mentioned this pull request Jun 10, 2025

amdkfd: add constants for heap types #1501

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Only enumerate ROCm-capable AMD GPUs #1500

Only enumerate ROCm-capable AMD GPUs #1500

Uh oh!

alaviss commented Jun 10, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

alaviss commented Jun 10, 2025 •

edited

Loading

Uh oh!

alaviss commented Jun 10, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

ericcurtin commented Jun 10, 2025

Uh oh!

ericcurtin commented Jun 10, 2025

Uh oh!

ericcurtin commented Jun 10, 2025

Uh oh!

alaviss commented Jun 10, 2025

Uh oh!

ericcurtin commented Jun 10, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 10, 2025

Uh oh!

Copilot AI Jun 10, 2025

Uh oh!

ericcurtin Jun 10, 2025

Uh oh!

Uh oh!

Uh oh!

	bank_props = amdkfd.parse_props(np + f'/mem_banks/{bank}/properties')
	bank_props = amdkfd.parse_props(os.path.join(np, 'mem_banks', str(bank), 'properties'))

	if bank_props['heap_type'] in [1, 2]:
	if bank_props['heap_type'] in [HEAP_TYPE_PUBLIC, HEAP_TYPE_PRIVATE]:

Only enumerate ROCm-capable AMD GPUs #1500

Only enumerate ROCm-capable AMD GPUs #1500

Uh oh!

Conversation

alaviss commented Jun 10, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

alaviss commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alaviss commented Jun 10, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

ericcurtin commented Jun 10, 2025

Uh oh!

ericcurtin commented Jun 10, 2025

Uh oh!

ericcurtin commented Jun 10, 2025

Uh oh!

alaviss commented Jun 10, 2025

Uh oh!

ericcurtin commented Jun 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

ericcurtin Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alaviss commented Jun 10, 2025 •

edited by sourcery-ai bot

Loading

alaviss commented Jun 10, 2025 •

edited

Loading