Don't warmup by default #1477

ericcurtin · 2025-06-05T16:19:55Z

llama-server by default warms up the model with an empty run for performance reasons. We can warm up ourselves with a real query. Warming up was causing issues and delays start time.

Summary by Sourcery

Disable model warmup by default in llama-server invocation

Bug Fixes:

Prevent default empty-run warmup that caused startup delays and issues

Enhancements:

Add --no-warmup flag to disable automatic warmup and allow real-query warmup

sourcery-ai · 2025-06-05T16:20:00Z

Reviewer's Guide

This PR disables the default llama-server warm-up by appending a --no-warmup flag to the server execution command, preventing the automatic empty run that was causing startup delays.

File-Level Changes

Change	Details	Files
Disable automatic model warm-up in server launch	Added `--no-warmup` flag to `exec_args` generation for the server command	`ramalama/model.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

ericcurtin · 2025-06-05T16:20:29Z

From touvan in matrix:

touvan
I'll leave a couple of questions, and answer them if I figure it out before someone else:

I'm running one instance on Bazzite, on all AMD hardware (5700X, 32GB RAM, RX 6800 16GB) - it doesn't have rocm, but rocm seems to work in the ramalama container. It does seem to start up, fill in to the GPU, but then it stalls forever on this message: common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
I'm also trying to run it on macos on an M1 Max with 64GB RAM. This was pretty easy to get running, but it doesn't use the GPU - so it's quite slow (less than half as fast as the same model running on my 5700X). I had to tell it to use docker on macos, but that's the only thing I changed.
Also, is there a way to access other ports, or some way to find that information? I'd like to test running this for codegen with tools like cline. I know it'll be slow - just curious.

sourcery-ai

Hey @ericcurtin - I've reviewed your changes and they look great!

Here's what I looked at during the review

🟢 General issues: all looks good
🟢 Security: all looks good
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

rhatdan · 2025-06-05T18:01:38Z

Is this a reversal, I though we were wanting to warmup the server so when the user first types a query, it responds quicker?

ericcurtin · 2025-06-05T18:35:47Z

Is this a reversal, I though we were wanting to warmup the server so when the user first types a query, it responds quicker?

It was breaking a user's machine. Not sure if we want warmup on by default. Is the performance difference even noticeable to the human eye? It's also slowing down our startup performance. I think llama-server startup performance is even more important than warming up for RamaLama. But it's ok if I'm told otherwise 😄 After a few words printed we will likely be upto full speed anyway.

For me if users want want warmed up, speedy inferencing, they should leave llama-server processes pre-forked lying around ready for execution, so that they are always warm.

I see two usecases moving forward:

Model swapping server: for this case you want to favour startup time rather than warmup as loading a model can take seconds, don't want to delay any further.
Pre-forked server: processes should be warm anyway as they are around for a long time.

ericcurtin · 2025-06-05T18:39:03Z

I should have posted the full matrix message above. --no-warmup didn't just accelerate things, it fixed a bug for that guy, his GPU started working. A micro-optimization like this that breaks some GPUs doesn't seem worth it to me. And it's not just an optimization, it's a tradeoff, we lose performance in terms of startup time.

llama-server by default warms up the model with an empty run for performance reasons. We can warm up ourselves with a real query. Warming up was causing issues and delays start time. Signed-off-by: Eric Curtin <[email protected]>

ericcurtin requested review from rhatdan, bmahabirbu, maxamillion, swarajpande5, jhjaggars, cgruver, slp and engelmi as code owners June 5, 2025 16:19

sourcery-ai bot approved these changes Jun 5, 2025

View reviewed changes

Don't warmup by default

8d2041a

llama-server by default warms up the model with an empty run for performance reasons. We can warm up ourselves with a real query. Warming up was causing issues and delays start time. Signed-off-by: Eric Curtin <[email protected]>

ericcurtin force-pushed the no-warmup branch from df64f98 to 8d2041a Compare June 5, 2025 18:42

rhatdan merged commit fc9b33e into main Jun 5, 2025
12 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't warmup by default #1477

Don't warmup by default #1477

Uh oh!

ericcurtin commented Jun 5, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Jun 5, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

ericcurtin commented Jun 5, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

rhatdan commented Jun 5, 2025

Uh oh!

ericcurtin commented Jun 5, 2025 •

edited

Loading

Uh oh!

ericcurtin commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

Don't warmup by default #1477

Don't warmup by default #1477

Uh oh!

Conversation

ericcurtin commented Jun 5, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

ericcurtin commented Jun 5, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

rhatdan commented Jun 5, 2025

Uh oh!

ericcurtin commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericcurtin commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

ericcurtin commented Jun 5, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jun 5, 2025 •

edited

Loading

ericcurtin commented Jun 5, 2025 •

edited

Loading