Skip to content

Fix tests for GPUs lacking memory info from nvidia-smi #10391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 5, 2025

Conversation

drivanov
Copy link
Contributor

@drivanov drivanov commented Aug 1, 2025

The nvidia-smi does not report memory usage on certain newer GPU cards, likely due to changes in driver or hardware support:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.10.07              Driver Version: 580.10.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA Graphics Device         On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   29C    P8              2W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

As a result, two PyG tests are now failing:

2025-07-29 10:39:11,273 - INFO - FAILED test/profile/test_profile.py::test_profileit_cuda - ValueError: invalid literal for int() with base 10: '[N/A]'
2025-07-29 10:39:11,273 - INFO - FAILED test/profile/test_profile_utils.py::test_get_gpu_memory_from_nvidia_smi - ValueError: invalid literal for int() with base 10: '[N/A]'

This PR addresses and fixes the issue.

@drivanov drivanov requested review from wsad1 and rusty1s as code owners August 1, 2025 18:31
@puririshi98
Copy link
Contributor

puririshi98 commented Aug 1, 2025

please test that this work on A100, H100, and B100 and attach the logs. also please make ci green

Copy link

codecov bot commented Aug 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.83%. Comparing base (c211214) to head (f3e6c7e).
⚠️ Report is 79 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10391      +/-   ##
==========================================
- Coverage   86.11%   85.83%   -0.28%     
==========================================
  Files         496      501       +5     
  Lines       33655    34454     +799     
==========================================
+ Hits        28981    29574     +593     
- Misses       4674     4880     +206     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@drivanov
Copy link
Contributor Author

drivanov commented Aug 4, 2025

Providing log files per @puririshi98’s request:
GeForce_RTX_5080.log.TXT
A100.log.TXT
digit.log.TXT
H100.log.TXT
B200.log.TXT

@drivanov
Copy link
Contributor Author

drivanov commented Aug 4, 2025

Additionally, I fixed the following CI issues as requested by @puririshi98:

Run uv run --no-project mypy --cache-dir=/dev/null
error: --install-types failed (an error blocked analysis of which types to install)
torch_geometric/utils/smiles.py:151: error: Module has no attribute "DisableLog"  [attr-defined]
torch_geometric/datasets/qm9.py:205: error: Module has no attribute "DisableLog"  [attr-defined]
torch_geometric/datasets/git_mol_dataset.py:105: error: Module has no attribute "DisableLog"  [attr-defined]
Found 3 errors in 3 files (checked 642 source files)

Copy link
Contributor

@puririshi98 puririshi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thnx

@puririshi98 puririshi98 merged commit 7ec88b9 into pyg-team:master Aug 5, 2025
19 checks passed
@drivanov drivanov deleted the dgx_spark branch August 5, 2025 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants