Skip to content

Conversation

Delaunay
Copy link
Collaborator

No description provided.

@Delaunay Delaunay changed the title Pytorch 2.8 + cu129 Staging Aug 21, 2025
@Delaunay
Copy link
Collaborator Author

ppo.D3
======
  * no training rate retrieved
  * Error codes = 1, 1
  * 1 exceptions found
    * 1 x AttributeError: jax.interpreters.xla.pytype_aval_mappings was deprecated in JAX v0.5.0 and removed in JAX v0.7.0. jax.core.pytype_aval_mappings can be used as a replacement in most cases.
        | Traceback (most recent call last):
        |   File "/tmp/7532512/cuda/results/venv/torch/bin/voir", line 10, in <module>
        |     sys.exit(main())
        |              ^^^^^^
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/voir/cli.py", line 128, in main
        |     ov(sys.argv[1:] if argv is None else argv)
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/voir/phase.py", line 331, in __call__
        |     self._run(*args, **kwargs)
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/voir/overseer.py", line 238, in _run
        |     script, argv, func = _resolve_function(self.options)
        |                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/voir/overseer.py", line 271, in _resolve_function
        |     return script, options.ARGV, resolve_script(script)
        |                                  ^^^^^^^^^^^^^^^^^^^^^^
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/voir/scriptutils.py", line 36, in resolve_script
        |     exec(prep, glb, glb)
        |   File "/home/mila/d/delaunap/milabench/benchmarks/purejaxrl/main.py", line 12, in <module>
        |     from ppo import add_ppo_command, main as ppo_main
        |   File "/home/mila/d/delaunap/milabench/benchmarks/purejaxrl/ppo.py", line 12, in <module>
        |     import distrax
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/distrax/__init__.py", line 18, in <module>
        |     from distrax._src.bijectors.bijector import Bijector
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/distrax/_src/bijectors/bijector.py", line 27, in <module>
        |     tfb = tfp.bijectors
        |           ^^^^^^^^^^^^^
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/tensorflow_probability/python/internal/lazy_loader.py", line 56, in __getattr__
        |     module = self._load()
        |              ^^^^^^^^^^^^
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/tensorflow_probability/python/internal/lazy_loader.py", line 43, in _load
        |     module = importlib.import_module(self.__name__)
        |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |   File "/network/scratch/d/delaunap/shared/cuda/.env/3.12/lib/python3.12/importlib/__init__.py", line 90, in import_module
        |     return _bootstrap._gcd_import(name[level:], package, level)
        |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/tensorflow_probability/substrates/jax/__init__.py", line 42, in <module>
        |     from tensorflow_probability.substrates.jax import bijectors
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/tensorflow_probability/substrates/jax/bijectors/__init__.py", line 19, in <module>
        |     from tensorflow_probability.substrates.jax.bijectors.absolute_value import AbsoluteValue
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/tensorflow_probability/substrates/jax/bijectors/absolute_value.py", line 17, in <module>
        |     from tensorflow_probability.python.internal.backend.jax.compat import v2 as tf
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/tensorflow_probability/python/internal/backend/jax/__init__.py", line 19, in <module>
        |     from tensorflow_probability.python.internal.backend.jax import compat
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/tensorflow_probability/python/internal/backend/jax/compat.py", line 17, in <module>
        |     from tensorflow_probability.python.internal.backend.jax import v1
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/tensorflow_probability/python/internal/backend/jax/v1.py", line 23, in <module>
        |     from tensorflow_probability.python.internal.backend.jax import linalg_impl
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/tensorflow_probability/python/internal/backend/jax/linalg_impl.py", line 23, in <module>
        |     from tensorflow_probability.python.internal.backend.jax import ops
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/tensorflow_probability/python/internal/backend/jax/ops.py", line 681, in <module>
        |     jax.interpreters.xla.pytype_aval_mappings[onp.ndarray])
        |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/jax/_src/deprecations.py", line 54, in getattr
        |     raise AttributeError(message)
        | AttributeError: jax.interpreters.xla.pytype_aval_mappings was deprecated in JAX v0.5.0 and removed in JAX v0.7.0. jax.core.pytype_aval_mappings can be used as a replacement in most cases.
        ```

@Delaunay
Copy link
Collaborator Author

=================
Benchmark results
=================

System
------
cpu:      AMD EPYC 9654 96-Core Processor
n_cpu:    192
product:  NVIDIA H100 80GB HBM3
n_gpu:    8
memory:   81559.0

Breakdown
---------
bench                    | fail |   n | ngpu |       perf |   sem% |   std% | peak_memory |      score | weight
brax                     |    0 |   1 |    8 | 1034789.57 |   0.3% |   2.6% |        2488 | 1034789.57 |   1.00
diffusion-gpus           |    0 |   1 |    8 |     408.14 |   0.1% |   1.1% |       57816 |     408.14 |   1.00
diffusion-single         |    0 |   8 |    1 |      54.60 |   0.4% |   9.5% |       55166 |     441.76 |   0.00
dimenet                  |    2 |   8 |    1 |     502.12 |   1.3% |  27.4% |        5960 |    2280.50 |   1.00
dinov2-giant-gpus        |    0 |   1 |    8 |     939.72 |   0.8% |   6.6% |       68872 |     939.72 |   1.00
dinov2-giant-single      |    0 |   8 |    1 |     113.24 |   0.5% |  11.4% |       78338 |     916.44 |   0.00
dqn                      |    8 |   8 |    1 |        nan |   nan% |   nan% |         nan |        nan |   0.00
bf16                     |    0 |   8 |    1 |     789.72 |   0.3% |   7.7% |         nan |    6358.54 |   0.00
fp16                     |    0 |   8 |    1 |     752.18 |   0.2% |   4.8% |        1312 |    6044.59 |   0.00
fp32                     |    0 |   8 |    1 |      51.86 |   0.1% |   1.5% |        1696 |     415.45 |   0.00
tf32                     |    0 |   8 |    1 |     437.46 |   0.3% |   7.2% |        1696 |    3517.66 |   0.00
bert-fp16                |    0 |   8 |    1 |     496.44 |   0.5% |  11.7% |       15630 |    4015.18 |   0.00
bert-fp32                |    0 |   8 |    1 |     111.85 |   0.2% |   5.6% |       20770 |     900.02 |   0.00
bert-tf32                |    0 |   8 |    1 |     257.16 |   0.3% |   8.1% |       20770 |    2074.15 |   0.00
bert-tf32-fp16           |    0 |   8 |    1 |     497.75 |   0.5% |  11.6% |       15630 |    4022.46 |   1.00
reformer                 |    0 |   8 |    1 |     105.95 |   0.3% |   7.2% |       13020 |     854.25 |   1.00
t5                       |    0 |   8 |    1 |      94.30 |   0.4% |   8.3% |       33986 |     761.08 |   0.00
whisper                  |    0 |   8 |    1 |     925.12 |   0.5% |  11.7% |        8898 |    7478.68 |   0.00
lightning                |    0 |   8 |    1 |    1083.54 |   0.3% |   8.7% |       26824 |    8739.51 |   0.00
lightning-gpus           |    0 |   1 |    8 |    8264.13 |   0.6% |   6.1% |       28220 |    8264.13 |   1.00
llava-single             |    0 |   8 |    1 |       4.14 |   0.4% |  10.2% |       72708 |      33.45 |   1.00
llama                    |    0 |   8 |    1 |     511.79 |   1.6% |  28.5% |       27388 |    4080.56 |   1.00
llm-full-mp-gpus         |    0 |   1 |    8 |     154.83 |   3.0% |  15.9% |       21264 |     154.83 |   1.00
llm-lora-ddp-gpus        |    0 |   1 |    8 |   38771.29 |   0.7% |   3.8% |       18328 |   38771.29 |   1.00
llm-lora-mp-gpus         |    0 |   1 |    8 |    4418.98 |   2.3% |  12.1% |       41886 |    4418.98 |   1.00
llm-lora-single          |    0 |   8 |    1 |    7010.25 |   0.2% |   3.1% |       31314 |   56127.26 |   1.00
pna                      |    0 |   8 |    1 |    6764.34 |   0.6% |  15.1% |       39320 |   54209.56 |   1.00
ppo                      |    8 |   8 |    1 |        nan |   nan% |   nan% |         nan |        nan |   1.00
recursiongfn             |    0 |   8 |    1 |   11733.33 |   0.9% |  21.6% |       16996 |   94578.39 |   1.00
rlhf-gpus                |    0 |   1 |    8 |    5423.80 |   3.6% |  19.5% |       20832 |    5423.80 |   0.00
rlhf-single              |    0 |   8 |    1 |    1501.63 |   1.4% |  33.8% |       18674 |   12014.84 |   1.00
focalnet                 |    0 |   8 |    1 |     707.40 |   0.5% |  12.1% |       23212 |    5734.86 |   0.00
convnext_large-fp16      |    0 |   8 |    1 |     747.63 |   0.9% |  15.8% |       27012 |    6131.22 |   0.00
convnext_large-fp32      |    0 |   8 |    1 |      98.98 |   0.7% |  11.6% |       46862 |     806.45 |   0.00
convnext_large-tf32      |    0 |   8 |    1 |     246.30 |   0.8% |  14.1% |       49234 |    2014.73 |   0.00
convnext_large-tf32-fp16 |    0 |   8 |    1 |     742.90 |   0.9% |  15.8% |       27012 |    6092.09 |   1.00
regnet_y_128gf           |    0 |   8 |    1 |     223.70 |   0.4% |   9.6% |       30696 |    1809.47 |   1.00
resnet152-ddp-gpus       |    0 |   1 |    8 |    8359.97 |   0.0% |   0.4% |       28012 |    8359.97 |   0.00
resnet50                 |    0 |   8 |    1 |    2632.69 |   0.6% |  14.1% |       13444 |   21311.40 |   1.00
resnet50-noio            |    0 |   8 |    1 |    3975.92 |   0.1% |   3.9% |       26956 |   31853.61 |   0.00
vjepa-gpus               |    0 |   1 |    8 |     263.67 |   5.3% |  41.8% |       66118 |     263.67 |   1.00
vjepa-single             |    0 |   8 |    1 |      42.16 |   0.8% |  17.8% |       64558 |     341.12 |   1.00

Scores
------
Failure rate:       6.77% (FAIL)
Score:            2804.83

@Delaunay
Copy link
Collaborator Author

@Delaunay Delaunay force-pushed the staging branch 4 times, most recently from 4c928a8 to 4fd868c Compare August 25, 2025 17:36
* ignore tensorflow-probability

* Pin Dependencies

---------

Co-authored-by: pierre.delaunay <[email protected]>
@Delaunay
Copy link
Collaborator Author

dqn.D7
======
  * no training rate retrieved
  * Error codes = 1, 1
  * 1 exceptions found
    * 1 x AttributeError: jax.tree_map was removed in JAX v0.6.0: use jax.tree.map (jax v0.4.25 or newer) or jax.tree_util.tree_map (any JAX version).
        | Traceback (most recent call last):
        |   File "/tmp/7562516/cuda/results/venv/torch/bin/voir", line 10, in <module>
        |     sys.exit(main())
        |              ^^^^^^
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/voir/cli.py", line 128, in main
        |     ov(sys.argv[1:] if argv is None else argv)
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/voir/phase.py", line 331, in __call__
        |     self._run(*args, **kwargs)
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/voir/overseer.py", line 242, in _run
        |     set_value(func())
        |               ^^
        | ^^
        | ^^
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/voir/scriptutils.py", line 37, in <lambda>
        |     return lambda: exec(mainsection, glb, glb)
        |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |   File "/home/mila/d/delaunap/milabench/benchmarks/purejaxrl/main.py", line 37, in <module>
        |     main()
        |   File "/home/mila/d/delaunap/milabench/benchmarks/purejaxrl/main.py", line 30, in main
        |     benchmark(args)
        |   File "/home/mila/d/delaunap/milabench/benchmarks/purejaxrl/dqn.py", line 357, in main
        |     compiled_fn = train_vjit.lower(rngs).compile()
        |                   ^^^^^^^^^^^^^^^^^^^^^^
        |   File "/home/mila/d/delaunap/milabench/benchmarks/purejaxrl/dqn.py", line 111, in train
        |     target_network_params=jax.tree_map(lambda x: jnp.copy(x), network_params),
        |                           ^^^^^^^^^^^^
        |   File "/home/mila/d/delaunap/scratch/shared/cuda/venv/torch/lib/python3.12/site-packages/jax/_src/deprecations.py", line 54, in getattr
        |     raise AttributeError(message)
        | AttributeError: jax.tree_map was removed in JAX v0.6.0: use jax.tree.map (jax v0.4.25 or newer) or jax.tree_util.tree_map (any JAX version).
        | --------------------
        | For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

@Delaunay
Copy link
Collaborator Author


System
------
cpu:      AMD EPYC 7742 64-Core Processor
n_cpu:    256
product:  NVIDIA A100-SXM4-80GB
n_gpu:    8
memory:   81920.0

Breakdown
---------
bench                    | fail |   n | ngpu |       perf |   sem% |   std% | peak_memory |       score | weight
brax                     |    0 |   1 |    8 |  930605.80 |   0.4% |   3.2% |        2228 |   930605.80 |   1.00
diffusion-gpus           |    0 |   1 |    8 |     212.40 |   0.1% |   0.5% |       59498 |      212.40 |   1.00
diffusion-single         |    0 |   8 |    1 |      27.53 |   0.4% |   9.5% |       53894 |      222.77 |   0.00
dimenet                  |    0 |   8 |    1 |     313.60 |   1.2% |  29.0% |        5460 |     2534.72 |   1.00
dinov2-giant-gpus        |    0 |   1 |    8 |     452.68 |   0.6% |   4.9% |       69012 |      452.68 |   1.00
dinov2-giant-single      |    0 |   8 |    1 |      54.93 |   0.5% |  10.6% |       78644 |      444.50 |   0.00
dqn                      |    8 |   8 |    1 |        nan |   nan% |   nan% |         nan |         nan |   0.00
bf16                     |    0 |   8 |    1 |     291.93 |   0.3% |   7.5% |        1788 |     2351.65 |   0.00
fp16                     |    0 |   8 |    1 |     287.43 |   0.2% |   4.9% |        1788 |     2310.08 |   0.00
fp32                     |    0 |   8 |    1 |      19.13 |   0.0% |   1.0% |        2166 |      153.16 |   0.00
tf32                     |    0 |   8 |    1 |     144.99 |   0.2% |   4.8% |        2166 |     1165.04 |   0.00
bert-fp16                |    0 |   8 |    1 |     275.38 |   0.5% |  11.0% |       16038 |     2227.69 |   0.00
bert-fp32                |    0 |   8 |    1 |      45.41 |   0.2% |   5.0% |       21170 |      365.24 |   0.00
bert-tf32                |    0 |   8 |    1 |     144.67 |   0.4% |   9.2% |       21170 |     1169.07 |   0.00
bert-tf32-fp16           |    0 |   8 |    1 |     274.19 |   0.5% |  12.3% |       16038 |     2217.88 |   1.00
reformer                 |    0 |   8 |    1 |      63.00 |   0.3% |   8.3% |       13450 |      508.49 |   1.00
t5                       |    0 |   8 |    1 |      52.52 |   0.4% |   9.8% |       34388 |      424.62 |   0.00
whisper                  |    0 |   8 |    1 |     500.23 |   0.5% |  11.2% |        9262 |     4046.23 |   0.00
lightning                |    0 |   8 |    1 |     627.58 |   0.3% |   8.9% |       26614 |     5062.36 |   0.00
lightning-gpus           |    0 |   1 |    8 |    4807.85 |   0.8% |   7.6% |       27456 |     4807.85 |   1.00
llava-single             |    0 |   8 |    1 |       2.25 |   0.5% |  10.9% |       15558 |       18.19 |   1.00
llama                    |    0 |   8 |    1 |     329.80 |   1.6% |  28.4% |       27836 |     2633.16 |   1.00
llm-full-mp-gpus         |    0 |   1 |    8 |      21.12 |   3.2% |  16.8% |       22218 |       21.12 |   1.00
llm-lora-ddp-gpus        |    0 |   1 |    8 |   17306.11 |   0.7% |   3.7% |       19472 |    17306.11 |   1.00
llm-lora-mp-gpus         |    0 |   1 |    8 |    1916.11 |   2.3% |  12.4% |       41924 |     1916.11 |   1.00
llm-lora-single          |    0 |   8 |    1 |    3116.12 |   0.1% |   1.9% |       31624 |    24967.77 |   1.00
pna                      |    0 |   8 |    1 |    4264.74 |   0.5% |  11.1% |       39724 |    34282.26 |   1.00
ppo                      |    0 |   8 |    1 | 1501238.49 |   2.6% |  57.1% |        1502 | 12016153.18 |   1.00
recursiongfn             |    0 |   8 |    1 |    6254.37 |   1.7% |  38.6% |       10934 |    50306.32 |   1.00
rlhf-gpus                |    0 |   1 |    8 |    2275.09 |   2.3% |  12.2% |       20636 |     2275.09 |   0.00
rlhf-single              |    0 |   8 |    1 |     739.72 |   0.6% |  13.2% |       18858 |     5915.91 |   1.00
focalnet                 |    0 |   8 |    1 |     410.53 |   0.6% |  12.9% |       23550 |     3328.24 |   0.00
convnext_large-fp16      |    0 |   8 |    1 |     371.35 |   0.9% |  15.8% |       27378 |     3044.48 |   0.00
convnext_large-fp32      |    0 |   8 |    1 |      44.49 |   0.6% |  10.7% |       47202 |      361.98 |   0.00
convnext_large-tf32      |    0 |   8 |    1 |     159.03 |   0.9% |  14.3% |       49636 |     1301.36 |   0.00
convnext_large-tf32-fp16 |    0 |   8 |    1 |     372.25 |   0.9% |  15.8% |       27378 |     3051.93 |   1.00
regnet_y_128gf           |    0 |   8 |    1 |     118.71 |   0.4% |   9.9% |       30730 |      960.49 |   1.00
resnet152-ddp-gpus       |    0 |   1 |    8 |    4846.03 |   0.1% |   0.8% |       27252 |     4846.03 |   0.00
resnet50                 |    0 |   8 |    1 |    1509.35 |   0.6% |  13.2% |       13820 |    12223.16 |   1.00
resnet50-noio            |    0 |   8 |    1 |    2148.71 |   0.1% |   4.6% |       27446 |    17214.80 |   0.00
vjepa-gpus               |    0 |   1 |    8 |     139.72 |   4.7% |  37.5% |       61968 |      139.72 |   1.00
vjepa-single             |    0 |   8 |    1 |      21.95 |   0.8% |  18.3% |       63684 |      177.57 |   1.00

Scores
------
Failure rate:       3.01% (FAIL)
Score:            3197.04

Errors
------
8 errors, details in HTML report.

Comment on lines +466 to +469
return {
"status": "offline",
"reason": result["stderr"]
}

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 12 days ago

To fix the information exposure issue, ensure that exception details (such as those from str(e)) are not sent back to the client in HTTP responses. Instead, return a generic error message, and log the full details (stack trace and exception message) only on the server side. This change should be applied in all relevant exception handlers, especially those that formulate API responses with details from caught exceptions. Specifically, edit remote_command() to return a generic "Internal error occurred" message in its 'stderr' field in the event of an unexpected exception, while still printing/logging exception details server-side. No changes to functionality or API structure are required; only the error message needs to be sanitized.


Suggested changeset 1
milabench/web/slurm.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/milabench/web/slurm.py b/milabench/web/slurm.py
--- a/milabench/web/slurm.py
+++ b/milabench/web/slurm.py
@@ -415,10 +415,11 @@
     except Exception as e:
         import traceback
         traceback.print_exc()
+        # Do not expose internal error details to the client
         return {
             'success': False,
             'stdout': '',
-            'stderr': str(e),
+            'stderr': 'Internal error occurred',
             'returncode': -1
         }
 
EOF
@@ -415,10 +415,11 @@
except Exception as e:
import traceback
traceback.print_exc()
# Do not expose internal error details to the client
return {
'success': False,
'stdout': '',
'stderr': str(e),
'stderr': 'Internal error occurred',
'returncode': -1
}

Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
@Delaunay Delaunay merged commit c6cecc2 into master Aug 26, 2025
3 of 7 checks passed
@Delaunay Delaunay deleted the staging branch August 26, 2025 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant