GPU memory usage and parallel inferences #25369
Unanswered
x3lif
asked this question in
Performance Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello Everying
I'm havind a small project with running on Windows with onnxruntime-gpu 1.20.1 in Python.
On this project, i create one InferenceSession by model i want to use.
Then a server is responsible to run the inference using these sessions.
When inferences executes in parallel i see a increase of my GPU memory using everything and negatively impacting the performance of the application (i suspect some swap of data between host and device).
Is there any documentation i'm missing on how to configure a maximum amout of parallel request per InferenceSession ?
What are the recommended way to limit the number of parallel calls for an InferenceSession ?
Thank you
Beta Was this translation helpful? Give feedback.
All reactions