Question about the cuda graph acceleration #558

Jianghanxiao · 2025-03-01T07:50:32Z

Jianghanxiao
Mar 1, 2025

Hi, I notice a speed slowdown if I lanuch capture a graph multiple times. The first time, it's very quickly, however, later it will become slow. After some time, it bcomes quicly again and slow again.... Is there some potential reason for this? I guess it may be related to the cuda memory overwriting strategy? Is there some way in warp to manually release the previous launched graph?

Answered by shi-eric

Mar 4, 2025

Sounds like you’re running into the same issue as #277 with the wp.ScopedTimer not measuring the same time you were expecting it to measure. See the profiling docs for how to measure GPU performance more accurately, e.g. using CUDA events.

View full answer

AnkaChan · 2025-03-01T23:10:49Z

AnkaChan
Mar 1, 2025
Maintainer

Hi Jianghanxiao, thank you for the question.

First, can we see the code that you use for profiling? Achieving accurate profiling for GPU programs are quite non-trivial and what you are experiencing sounds like an issue of the profiler rather than the graph launch itself.

For example, once you launch a CUDA program, it won't be executed immediately. Sometimes, the scheduler will wait until there are enough kernel launches or the wait-time elapses.

Sometimes, the profiler itself may add an unignorable overhead to the CUDA launch. Sometimes you need to do some warmups to boost up the GPU's clock. It is very had to analyze what is happening unless we see your code.

3 replies

Jianghanxiao Mar 4, 2025
Author

Thanks for the reply! I use wp.ScopedTimer to profile the time, not sure if it can capture the real time. But if I explicitly call the cuda synchronization, the profile time is different. Therefore, I think maybe this also doesn't capture the real time.

So currently actually I want to understand if there is some known overhead if I call my built forward cues graph a lot of times. And is there some way to help like I can try manually clean the graph memory or something. Or is there some tool I can leverage to analyze the current performance bottleneck.

Sorry for the unclear questions, I'm actually very confused for now. Thanks a lot for all the help!

shi-eric Mar 4, 2025
Maintainer

Sounds like you’re running into the same issue as #277 with the wp.ScopedTimer not measuring the same time you were expecting it to measure. See the profiling docs for how to measure GPU performance more accurately, e.g. using CUDA events.

Answer selected by Jianghanxiao

Jianghanxiao Mar 4, 2025
Author

Thanks so much!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about the cuda graph acceleration #558

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question about the cuda graph acceleration #558

Uh oh!

Jianghanxiao Mar 1, 2025

Replies: 1 comment · 3 replies

Uh oh!

AnkaChan Mar 1, 2025 Maintainer

Uh oh!

Uh oh!

Jianghanxiao Mar 4, 2025 Author

Uh oh!

shi-eric Mar 4, 2025 Maintainer

Uh oh!

Jianghanxiao Mar 4, 2025 Author

Jianghanxiao
Mar 1, 2025

Replies: 1 comment 3 replies

AnkaChan
Mar 1, 2025
Maintainer

Jianghanxiao Mar 4, 2025
Author

shi-eric Mar 4, 2025
Maintainer

Jianghanxiao Mar 4, 2025
Author