-
Notifications
You must be signed in to change notification settings - Fork 235
Don't warmup by default #1477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't warmup by default #1477
Conversation
Reviewer's GuideThis PR disables the default llama-server warm-up by appending a File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
From touvan in matrix: touvan I'm running one instance on Bazzite, on all AMD hardware (5700X, 32GB RAM, RX 6800 16GB) - it doesn't have rocm, but rocm seems to work in the ramalama container. It does seem to start up, fill in to the GPU, but then it stalls forever on this message: common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ericcurtin - I've reviewed your changes and they look great!
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Is this a reversal, I though we were wanting to warmup the server so when the user first types a query, it responds quicker? |
It was breaking a user's machine. Not sure if we want warmup on by default. Is the performance difference even noticeable to the human eye? It's also slowing down our startup performance. I think llama-server startup performance is even more important than warming up for RamaLama. But it's ok if I'm told otherwise 😄 After a few words printed we will likely be upto full speed anyway. For me if users want want warmed up, speedy inferencing, they should leave llama-server processes pre-forked lying around ready for execution, so that they are always warm. I see two usecases moving forward:
|
I should have posted the full matrix message above. --no-warmup didn't just accelerate things, it fixed a bug for that guy, his GPU started working. A micro-optimization like this that breaks some GPUs doesn't seem worth it to me. And it's not just an optimization, it's a tradeoff, we lose performance in terms of startup time. |
llama-server by default warms up the model with an empty run for performance reasons. We can warm up ourselves with a real query. Warming up was causing issues and delays start time. Signed-off-by: Eric Curtin <[email protected]>
llama-server by default warms up the model with an empty run for performance reasons. We can warm up ourselves with a real query. Warming up was causing issues and delays start time.
Summary by Sourcery
Disable model warmup by default in llama-server invocation
Bug Fixes:
Enhancements:
--no-warmup
flag to disable automatic warmup and allow real-query warmup