-
-
Notifications
You must be signed in to change notification settings - Fork 17.2k
Description
Search before asking
- I have searched the YOLOv5 issues and found no similar feature requests.
Description
Hi YOLOv5 Community,
I wanna share with you our gains on TensorRT with a new pipeline to deal with YOLOv5, we embed the whole post-processing (namely the EfficientNMS_TRT
plugin) into the Graph with onnx-graghsurgeon
. The ablation experiment results are below. The first one is the result without running EfficientNMS_TRT
, and the second one is the result with EfficientNMS_TRT
embedded. As you can see, the inference time is even reduced, we guess it is because the data copied to the device will be much less after doing EfficientNMS_TRT
. (The mean Latency of D2H is reduced from 0.868048 ms
to 0.0102295 ms
, running on Nivdia Geforce GTX 1080ti, using TensorRT 8.2 with yolov5n6
and scaling images to 512x640
.) Check https://zhiqwang.com/yolov5-rt-stack/notebooks/onnx-graphsurgeon-inference-tensorrt.html for more details.
Use case
Deploy YOLOv5 quickly on TensorRT, and in this way, you do not need to write any C++ codes for post-processing.
Known cons
We have to update the TensorRT to 8.2 to call the EfficientNMS_TRT
plugin. And seems that there is a bug about the float16 of this plugin: NVIDIA/TensorRT#1758 (comment).
Related Issue
Additional
And onnx-graphsurgeon
is easy to install, you can just use their prebuilt wheels:
python3 -m pip install onnx_graphsurgeon --index-url https://pypi.ngc.nvidia.com
The detailed results:
[I] === Performance summary w/o EfficientNMS_TRT plugin ===
[I] Throughput: 383.298 qps
[I] Latency: min = 3.66479 ms, max = 5.41199 ms, mean = 4.00543 ms, median = 3.99316 ms, percentile(99%) = 4.23831 ms
[I] End-to-End Host Latency: min = 3.76599 ms, max = 6.45874 ms, mean = 5.08597 ms, median = 5.07544 ms, percentile(99%) = 5.50839 ms
[I] Enqueue Time: min = 0.743408 ms, max = 5.27966 ms, mean = 0.940805 ms, median = 0.924805 ms, percentile(99%) = 1.37329 ms
[I] H2D Latency: min = 0.502045 ms, max = 0.62674 ms, mean = 0.538255 ms, median = 0.537354 ms, percentile(99%) = 0.582153 ms
[I] GPU Compute Time: min = 2.23233 ms, max = 3.92395 ms, mean = 2.59913 ms, median = 2.58661 ms, percentile(99%) = 2.8201 ms
[I] D2H Latency: min = 0.851807 ms, max = 0.900421 ms, mean = 0.868048 ms, median = 0.867676 ms, percentile(99%) = 0.889191 ms
[I] Total Host Walltime: 3.0081 s
[I] Total GPU Compute Time: 2.99679 s
[I] Explanations of the performance metrics are printed in the verbose logs.
[I]
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # trtexec --onnx=yolov5n6-no-nms.onnx --workspace=8096
[I] === Performance summary w/ EfficientNMS_TRT plugin ===
[I] Throughput: 389.234 qps
[I] Latency: min = 2.81482 ms, max = 9.77234 ms, mean = 3.1062 ms, median = 3.07642 ms, percentile(99%) = 3.33548 ms
[I] End-to-End Host Latency: min = 2.82202 ms, max = 11.6749 ms, mean = 4.939 ms, median = 4.95587 ms, percentile(99%) = 5.45207 ms
[I] Enqueue Time: min = 0.999878 ms, max = 11.3833 ms, mean = 1.28942 ms, median = 1.18579 ms, percentile(99%) = 4.53088 ms
[I] H2D Latency: min = 0.488159 ms, max = 0.633881 ms, mean = 0.546754 ms, median = 0.546631 ms, percentile(99%) = 0.570557 ms
[I] GPU Compute Time: min = 2.30298 ms, max = 9.21094 ms, mean = 2.54921 ms, median = 2.51904 ms, percentile(99%) = 2.78528 ms
[I] D2H Latency: min = 0.00610352 ms, max = 0.302734 ms, mean = 0.0102295 ms, median = 0.00976562 ms, percentile(99%) = 0.0151367 ms
[I] Total Host Walltime: 3.00591 s
[I] Total GPU Compute Time: 2.98258 s
[I] Explanations of the performance metrics are printed in the verbose logs.
[I]
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # trtexec --onnx=yolov5n6-efficient-nms.onnx --workspace=8096
Are you willing to submit a PR?
- Yes I'd like to help by submitting a PR!