GPU memory usage and parallel inferences #25369

x3lif · 2025-07-11T15:15:34Z

x3lif
Jul 11, 2025

Hello Everying

I'm havind a small project with running on Windows with onnxruntime-gpu 1.20.1 in Python.
On this project, i create one InferenceSession by model i want to use.
Then a server is responsible to run the inference using these sessions.

When inferences executes in parallel i see a increase of my GPU memory using everything and negatively impacting the performance of the application (i suspect some swap of data between host and device).

Is there any documentation i'm missing on how to configure a maximum amout of parallel request per InferenceSession ?
What are the recommended way to limit the number of parallel calls for an InferenceSession ?

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU memory usage and parallel inferences #25369

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

GPU memory usage and parallel inferences #25369

Uh oh!

x3lif Jul 11, 2025

Replies: 0 comments

x3lif
Jul 11, 2025