Skip to content

Onnx Export issues #5485

@JinxedGrim

Description

@JinxedGrim

I have been unable to export the model to onnx. I am using a sample image that is 1344 x 1344.
Interestingly if I set the RPN_NMS_THRESHOLD in the config to 0.00 it exports but the exported model gives trash output

        aug = T.ResizeShortestEdge(
            [1344, 1344], 1344
        )

python export_model.py --sample-image ./lvis_sample_1344.jpg --config-file ../../configs/LVISv0.5-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml --export-method tracing --format onnx --output ./

python export_model.py --sample-image ./lvis_sample_1344.jpg --config-file ../../configs/LVISv0.5-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml --export-method tracing --format onnx --output ./
[08/21 12:09:18 detectron2]: Command line arguments: Namespace(format='onnx', export_method='tracing', config_file='../../configs/LVISv0.5-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml', sample_image='./lvis_sample_1344.jpg', run_eval=False, output='./', opts=[])
[W821 12:09:18.102378298 init.cpp:855] Warning: Use _jit_set_fusion_strategy, bailout depth is deprecated. Setting to (STATIC, 1) (function operator())
[08/21 12:09:19 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from detectron2://ImageNetPretrained/MSRA/R-50.pkl ...
[08/21 12:09:19 d2.checkpoint.c2_model_loading]: Renaming Caffe2 weights ......
[08/21 12:09:19 d2.checkpoint.c2_model_loading]: Following weights matched with submodule backbone.bottom_up - Total num: 54
Some model parameters or buffers are not found in the checkpoint:
backbone.fpn_lateral2.{bias, weight}
backbone.fpn_lateral3.{bias, weight}
backbone.fpn_lateral4.{bias, weight}
backbone.fpn_lateral5.{bias, weight}
backbone.fpn_output2.{bias, weight}
backbone.fpn_output3.{bias, weight}
backbone.fpn_output4.{bias, weight}
backbone.fpn_output5.{bias, weight}
proposal_generator.rpn_head.anchor_deltas.{bias, weight}
proposal_generator.rpn_head.conv.{bias, weight}
proposal_generator.rpn_head.objectness_logits.{bias, weight}
roi_heads.box_head.fc1.{bias, weight}
roi_heads.box_head.fc2.{bias, weight}
roi_heads.box_predictor.bbox_pred.{bias, weight}
roi_heads.box_predictor.cls_score.{bias, weight}
roi_heads.mask_head.deconv.{bias, weight}
roi_heads.mask_head.mask_fcn1.{bias, weight}
roi_heads.mask_head.mask_fcn2.{bias, weight}
roi_heads.mask_head.mask_fcn3.{bias, weight}
roi_heads.mask_head.mask_fcn4.{bias, weight}
roi_heads.mask_head.predictor.{bias, weight}
The checkpoint state_dict contains keys that are not used by the model:
  fc1000.{bias, weight}
  stem.conv1.bias
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/structures/image_list.py:86: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert t.shape[:-2] == tensors[0].shape[:-2], t.shape
/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/functional.py:554: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /pytorch/aten/src/ATen/native/TensorShape.cpp:4314.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/structures/boxes.py:151: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if tensor.numel() == 0:
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/structures/boxes.py:155: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert tensor.dim() == 2 and tensor.size(-1) == 4, tensor.size()
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/structures/boxes.py:151: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if tensor.numel() == 0:
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/structures/boxes.py:155: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert tensor.dim() == 2 and tensor.size(-1) == 4, tensor.size()
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/modeling/proposal_generator/proposal_utils.py:106: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if not valid_mask.all():
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/structures/boxes.py:191: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert torch.isfinite(self.tensor).all(), "Box tensor contains infinite or NaN!"
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/structures/boxes.py:192: TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
  h, w = box_size
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/layers/nms.py:17: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert boxes.shape[-1] == 4
/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/__init__.py:2150: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert condition, message
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/layers/roi_align.py:55: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert rois.dim() == 2 and rois.size(1) == 5
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/modeling/roi_heads/fast_rcnn.py:138: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if not valid_mask.all():
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/structures/boxes.py:151: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if tensor.numel() == 0:
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/structures/boxes.py:155: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert tensor.dim() == 2 and tensor.size(-1) == 4, tensor.size()
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/structures/boxes.py:191: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert torch.isfinite(self.tensor).all(), "Box tensor contains infinite or NaN!"
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/structures/boxes.py:192: TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
  h, w = box_size
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/modeling/roi_heads/fast_rcnn.py:155: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if num_bbox_reg_classes == 1:
/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/layers/nms.py:17: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert boxes.shape[-1] == 4
Traceback (most recent call last):
  File "/home/***/Development/***/Dependencies/detectron2/tools/deploy/export_model.py", line 247, in <module>
    main()  # pragma: no cover
    ^^^^^^
  File "/home/***/Development/***/Dependencies/detectron2/tools/deploy/export_model.py", line 228, in main
    exported_model = export_tracing(torch_model, sample_inputs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/Development/***/Dependencies/detectron2/tools/deploy/export_model.py", line 134, in export_tracing
    torch.onnx.export(traceable_model, (image,), f, opset_version=STABLE_ONNX_OPSET_VERSION)
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/onnx/__init__.py", line 396, in export
    export(
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/onnx/utils.py", line 529, in export
    _export(
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/onnx/utils.py", line 1467, in _export
    graph, params_dict, torch_out = _model_to_graph(
                                    ^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/onnx/utils.py", line 1087, in _model_to_graph
    graph, params, torch_out, module = _create_jit_graph(model, args)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/onnx/utils.py", line 971, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/onnx/utils.py", line 878, in _trace_and_get_graph_from_model
    trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/jit/_trace.py", line 1501, in _get_trace_graph
    outs = ONNXTracedModule(
           ^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/jit/_trace.py", line 138, in forward
    graph, _out = torch._C._create_graph_by_tracing(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/jit/_trace.py", line 129, in wrapper
    outs.append(self.inner(*trace_inputs))
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1741, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/export/flatten.py", line 294, in forward
    outputs = self.inference_func(self.model, *inputs_orig_format)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/Development/***/Dependencies/detectron2/tools/deploy/export_model.py", line 119, in inference
    inst = model.inference(inputs, do_postprocess=False)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/modeling/meta_arch/rcnn.py", line 213, in inference
    results, _ = self.roi_heads(images, features, proposals, None)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1741, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/modeling/roi_heads/roi_heads.py", line 747, in forward
    pred_instances = self._forward_box(features, proposals)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/modeling/roi_heads/roi_heads.py", line 815, in _forward_box
    pred_instances, _ = self.box_predictor.inference(predictions, proposals)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/modeling/roi_heads/fast_rcnn.py", line 479, in inference
    return fast_rcnn_inference(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/modeling/roi_heads/fast_rcnn.py", line 79, in fast_rcnn_inference
    result_per_image = [
                       ^
  File "/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/modeling/roi_heads/fast_rcnn.py", line 80, in <listcomp>
    fast_rcnn_inference_single_image(
  File "/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/modeling/roi_heads/fast_rcnn.py", line 162, in fast_rcnn_inference_single_image
    keep = batched_nms(boxes, scores, filter_inds[:, 1], nms_thresh)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/Development/***/Dependencies/detectron2/tools/deploy/../../detectron2/layers/nms.py", line 22, in batched_nms
    return box_ops.batched_nms(boxes.float(), scores, idxs, iou_threshold)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torchvision/ops/boxes.py", line 76, in batched_nms
    return _batched_nms_coordinate_trick(boxes, scores, idxs, iou_threshold)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/jit/_trace.py", line 1448, in wrapper
    return compiled_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torchvision/ops/boxes.py", line 95, in _batched_nms_coordinate_trick
    offsets = idxs.to(boxes) * (max_coordinate + torch.tensor(1).to(boxes))
    boxes_for_nms = boxes + offsets[:, None]
    keep = nms(boxes_for_nms, scores, iou_threshold)
           ~~~ <--- HERE
    return keep
  File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torchvision/ops/boxes.py", line 41, in nms
        _log_api_usage_once(nms)
    _assert_has_ops()
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
           ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
RuntimeError: Trying to create tensor with negative dimension -2137697952: [-2137697952]
  1. please simplify the steps as much as possible so they do not require additional resources to
    run, such as a private dataset.

Steps to reproduce:

  1. instaall detectron2 and dependencies IE torch etc
  2. Run export

Expected behavior:

I should be able to export to onnx

Environment:

Provide your environment information using the following command:

wget -nc -q https://github.com/facebookresearch/detectron2/raw/main/detectron2/utils/collect_env.py && python collect_env.py

sys.platform linux
Python 3.11.13 (main, Jun 5 2025, 13:12:00) [GCC 11.2.0]
numpy 2.3.0
detectron2 0.6 @/home//Development//detectron2/detectron2
Compiler GCC 11.2
CUDA compiler CUDA 12.9
detectron2 arch flags 5.0, 8.9
DETECTRON2_ENV_MODULE
PyTorch 2.7.1+cu126 @/home//miniconda3/envs/DtcService/lib/python3.11/site-packages/torch
PyTorch debug build False
torch._C._GLIBCXX_USE_CXX11_ABI True
GPU available Yes
GPU 0 NVIDIA GeForce RTX 4090 (arch=8.9)
GPU 1 Quadro K620 (arch=5.0)
Driver version 570.124.06
CUDA_HOME /home/
/miniconda3/envs/DtcService
Pillow 11.2.1
torchvision 0.22.1+cu126 @/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torchvision
torchvision arch flags 5.0, 6.0, 7.0, 7.5, 8.0, 8.6, 9.0
fvcore 0.1.5.post20221221
iopath 0.1.9
cv2 4.11.0


PyTorch built with:

  • GCC 11.2
  • C++ Version: 201703
  • Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 12.6
  • NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  • CuDNN 90.5.1
  • Magma 2.6.1
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, COMMIT_SHA=e2d141dbde55c2a4370fac5165b0561b6af4798b, CUDA_VERSION=12.6, CUDNN_VERSION=9.5.1, CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.7.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

Testing NCCL connectivity ... this should not hang.
W0821 12:11:08.805000 1505018 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 1505054 via signal SIGTERM
Traceback (most recent call last):
File "/home//Development//Dependencies/detectron2/tools/deploy/collect_env.py", line 263, in
main() # pragma: no cover
^^^^^^
File "/home//Development//Dependencies/detectron2/tools/deploy/collect_env.py", line 259, in main
test_nccl_ops()
File "/home//Development//Dependencies/detectron2/tools/deploy/collect_env.py", line 226, in test_nccl_ops
mp.spawn(_test_nccl_worker, nprocs=num_gpu, args=(num_gpu, dist_url), daemon=False)
File "/home//miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/
/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/home/***/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 215, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home//miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
fn(i, args)
File "/home/
/Development/
/Dependencies/detectron2/tools/deploy/collect_env.py", line 234, in _test_nccl_worker
dist.barrier(device_ids=[rank])
File "/home/
*/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/
/miniconda3/envs/DtcService/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4635, in barrier
work = group.barrier(opts=opts)
^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:77, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'operation not supported'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions