Skip to content

Don't warmup by default #1477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 5, 2025
Merged

Don't warmup by default #1477

merged 1 commit into from
Jun 5, 2025

Conversation

ericcurtin
Copy link
Member

@ericcurtin ericcurtin commented Jun 5, 2025

llama-server by default warms up the model with an empty run for performance reasons. We can warm up ourselves with a real query. Warming up was causing issues and delays start time.

Summary by Sourcery

Disable model warmup by default in llama-server invocation

Bug Fixes:

  • Prevent default empty-run warmup that caused startup delays and issues

Enhancements:

  • Add --no-warmup flag to disable automatic warmup and allow real-query warmup

Copy link
Contributor

sourcery-ai bot commented Jun 5, 2025

Reviewer's Guide

This PR disables the default llama-server warm-up by appending a --no-warmup flag to the server execution command, preventing the automatic empty run that was causing startup delays.

File-Level Changes

Change Details Files
Disable automatic model warm-up in server launch
  • Added --no-warmup flag to exec_args generation for the server command
ramalama/model.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@ericcurtin
Copy link
Member Author

From touvan in matrix:

touvan
I'll leave a couple of questions, and answer them if I figure it out before someone else:

I'm running one instance on Bazzite, on all AMD hardware (5700X, 32GB RAM, RX 6800 16GB) - it doesn't have rocm, but rocm seems to work in the ramalama container. It does seem to start up, fill in to the GPU, but then it stalls forever on this message: common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
I'm also trying to run it on macos on an M1 Max with 64GB RAM. This was pretty easy to get running, but it doesn't use the GPU - so it's quite slow (less than half as fast as the same model running on my 5700X). I had to tell it to use docker on macos, but that's the only thing I changed.
Also, is there a way to access other ports, or some way to find that information? I'd like to test running this for codegen with tools like cline. I know it'll be slow - just curious.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ericcurtin - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@rhatdan
Copy link
Member

rhatdan commented Jun 5, 2025

Is this a reversal, I though we were wanting to warmup the server so when the user first types a query, it responds quicker?

@ericcurtin
Copy link
Member Author

ericcurtin commented Jun 5, 2025

Is this a reversal, I though we were wanting to warmup the server so when the user first types a query, it responds quicker?

It was breaking a user's machine. Not sure if we want warmup on by default. Is the performance difference even noticeable to the human eye? It's also slowing down our startup performance. I think llama-server startup performance is even more important than warming up for RamaLama. But it's ok if I'm told otherwise 😄 After a few words printed we will likely be upto full speed anyway.

For me if users want want warmed up, speedy inferencing, they should leave llama-server processes pre-forked lying around ready for execution, so that they are always warm.

I see two usecases moving forward:

  1. Model swapping server: for this case you want to favour startup time rather than warmup as loading a model can take seconds, don't want to delay any further.
  2. Pre-forked server: processes should be warm anyway as they are around for a long time.

@ericcurtin
Copy link
Member Author

I should have posted the full matrix message above. --no-warmup didn't just accelerate things, it fixed a bug for that guy, his GPU started working. A micro-optimization like this that breaks some GPUs doesn't seem worth it to me. And it's not just an optimization, it's a tradeoff, we lose performance in terms of startup time.

llama-server by default warms up the model with an empty run for
performance reasons. We can warm up ourselves with a real query.
Warming up was causing issues and delays start time.

Signed-off-by: Eric Curtin <[email protected]>
@rhatdan rhatdan merged commit fc9b33e into main Jun 5, 2025
12 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants