[new] CUDA error: an illegal memory access was encountered

### Search before asking

- [X] I have searched the YOLOv5 [issues](https://github.com/ultralytics/yolov5/issues) and found no similar bug report.


### YOLOv5 Component

Detection, Export

### Bug

After successfully exported to yolov5s.engine, I use the code similar to detect.py to get frame results, got CUDA error: an illegal memory access was encountered.
`
[TensorRT] ERROR: ../rtSafe/cuda/caskConvolutionRunner.cpp (373) - Cask Error in checkCaskExecError<false>: 7 (Cask Convolution execution)
[TensorRT] ERROR: FAILED_EXECUTION: std::exception
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/nvidia/alphapose/alphapose/utils/webcam_detector.py", line 131, in frame_preprocess
    img_det = self.image_detection((img_k, orig_img, im_name, im_dim_list_k))
  File "/home/nvidia/alphapose/alphapose/utils/webcam_detector.py", line 140, in image_detection
    dets = self.detector.images_detection(img, im_dim_list)
  File "/home/nvidia/alphapose/detector/yolov5_hub_api.py", line 108, in images_detection
    pred = non_max_suppression(pred, self.conf_thres, self.iou_thres, classes, agnostic_nms, max_det=500)
  File "/home/nvidia/alphapose/detector/yolov5/utils/general.py", line 685, in non_max_suppression
    xc = prediction[..., 4] > conf_thres  # candidates
RuntimeError: CUDA error: an illegal memory access was encountered
`

### Environment

YOLOv5 v6.1, Jetson Xavier NX, CUDA 10.2, cudnn8.0, trt 7.1.3

### Minimal Reproducible Example

```
def load_model(self):
    args = self.detector_opt
    device = select_device(0)
    model = DetectMultiBackend(self.model_weights, device=device,     
    data='/home/nvidia/alphapose/detector/yolov5/data/coco128.yaml')
    stride, names, pt, jit, onnx, engine = model.stride, model.names, model.pt, model.jit, model.onnx, model.engine
    imgsz = check_img_size(self.inp_dim, s=stride)  # check image size
        
    # Half
    self.half &= (pt or jit or onnx or engine) and device.type != 'cpu'  # FP16 supported on limited backends with CUDA
    if pt or jit:
        self.model.model.half() if half else model.model.float()
    self.model = model
```
    def images_detection(self, imgs, orig_dim_list=[]):
        """
        Feed the img data into object detection network and 
        collect bbox w.r.t original image size
        Input: imgs(torch.FloatTensor,(b,3,h,w)): pre-processed mini-batch image input
               orig_dim_list(torch.FloatTensor, (b,(w,h,w,h))): original mini-batch image size
        Output: dets(torch.cuda.FloatTensor,(n,(batch_idx,x1,y1,x2,y2,c,s,idx of cls))): human detection results
        """
        if not self.model:
            self.load_model()

        dets = []
        # [b, c, h, w]
        print(imgs.shape)
        # self.model.warmup(imgsz=(1 if pt else bs, 3, *imgsz), half=half)  # warmup
        dt, seen = [0.0, 0.0, 0.0], 0
        for idx, im in enumerate(imgs):
            dets.append([])
            #t1 = time_sync()
            #im = torch.from_numpy(im).to(device)
            print(im.shape) # should be torch.Size([3, 448, 448]
            im = im.half() if self.half else im.float()  # uint8 to fp16/32
            im /= 255  # 0 - 255 to 0.0 - 1.0
            if len(im.shape) == 3:
                im = im[None]  # expand for batch dim
            #t2 = time_sync()
            #dt[0] += t2 - t1

            # Inference
            visualize = increment_path(save_dir / Path(path).stem, mkdir=True) if self.visualize else False
            pred = self.model(im, augment=False, visualize=visualize)
            #t3 = time_sync()
            #dt[1] += t3 - t2

            # NMS
            agnostic_nms = False
            classes = 0
            pred = non_max_suppression(pred, self.conf_thres, self.iou_thres, classes, agnostic_nms, max_det=500)
            #dt[2] += time_sync() - t3

            for p in pred:
                if p[-1] == 0:
                    dets[-1].append([idx, p[0], p[1], p[2], p[3], p[4], p[4], 0])

        
        return dets

### Additional

_No response_

### Are you willing to submit a PR?

- [ ] Yes I'd like to help by submitting a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[new] CUDA error: an illegal memory access was encountered #7474

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[new] CUDA error: an illegal memory access was encountered #7474

Description

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions