Xsn/server sleep #62

ngxson · 2025-12-20T15:22:02Z

Make sure to read the contributing guidelines before submitting a PR

Summary by CodeRabbit

New Features
- Added --sleep-idle-seconds CLI to enable automatic idle sleep (use -1 to disable). When sleeping, the server unloads model memory and reloads on demand; generation requests wake the server. /health remains available; /props reports is_sleeping.
Documentation
- Added "Sleeping on Idle" section describing behavior and exempt endpoints.
Tests
- Added unit test validating sleep/wake behavior.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-20T15:22:18Z

Walkthrough

Adds an idle-sleep feature: new CLI option and config field for sleep timeout, queue idle detection with sleeping/wake transitions that unloads/reloads model memory, public cached JSON props/model_meta exposing is_sleeping, removal of an explicit init() call during startup, routing/generator adjustments, tests and docs updates.

Changes

Cohort / File(s)	Change Summary
CLI & Common config `common/arg.cpp`, `common/common.h`	Add `--sleep-idle-seconds` CLI option (rejects 0 and values < -1) and new `int sleep_idle_seconds = -1` in `common_params`.
CLI startup flow `tools/cli/cli.cpp`	Removed explicit `ctx_cli.ctx_server.init()` call after model load; startup proceeds without that init step.
Server queue & idle detection `tools/server/server-queue.h`, `tools/server/server-queue.cpp`	Add sleeping state (`sleeping`, `req_stop_sleeping`), `time_last_task`, `is_sleeping()`, `wait_until_no_sleep()`, `on_sleeping_state()` callback; change `start_loop()` → `start_loop(int64_t idle_sleep_ms)` and implement idle→sleep→wake transitions; update task timestamping; remove public `on_new_task`/`on_update_slots`.
Server context & HTTP responses `tools/server/server-context.h`, `tools/server/server-context.cpp`	Add `json_server_props`, `json_server_model_meta`, `sleeping` flag, `destroy()`, `handle_sleeping_state(bool)`, `populate_json_responses()`; adjust `load_model()` for resume semantics and conditional wiring; change `server_res_generator` signature to accept `bypass_sleep`; remove `init()` declaration and document `get_llama_context()` may return `nullptr` when sleeping.
Server runtime / shutdown `tools/server/server.cpp`	Remove `ctx_server.init()` call from main flow; null-check `get_llama_context()` before calling `llama_memory_breakdown_print()` on shutdown.
Docs `tools/server/README.md`, `tools/server/README-dev.md`	Add "Sleeping on Idle" section describing behavior and `--sleep-idle-seconds`; add dev-note referencing sleeping mode PR.
Tests & test utils `tools/server/tests/unit/test_sleep.py`, `tools/server/tests/utils.py`	Add `test_server_sleep()` and autouse fixture to exercise sleep/wake; add `sleep_idle_seconds` attribute to `ServerProcess` and include CLI flag when launching test servers.

Sequence Diagram(s)

sequenceDiagram
    participant HTTP as HTTP Server (routes)
    participant Queue as Server Queue
    participant Context as Server Context
    participant Model as Model/Memory

    Note over Queue,Context: start_loop(idle_sleep_ms) monitors activity

    HTTP->>Queue: POST /completion (task)
    activate Queue
    Queue->>Context: ensure model loaded / enqueue task
    Context->>Model: load if needed
    Queue->>Queue: update time_last_task
    deactivate Queue

    Note over Queue: No tasks for idle_sleep_ms → enter sleeping
    Queue->>Queue: set sleeping = true
    Queue->>Context: handle_sleeping_state(true)
    Context->>Model: destroy() / unload model & free memory

    HTTP->>HTTP: GET /props (allowed while sleeping) — returns is_sleeping = true

    HTTP->>Queue: POST /completion (wake request)
    Queue->>Queue: wait_until_no_sleep() sets req_stop_sleeping
    Queue->>Context: handle_sleeping_state(false)
    Context->>Model: load_model(resume=true)
    Queue->>Queue: set sleeping = false
    Queue->>Context: process pending task

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Review focus:
- Race conditions and synchronization around time_last_task, sleeping, and req_stop_sleeping in server-queue.*
- Thread-safety of json_server_props / json_server_model_meta updates and server_res_generator bypass handling
- Correctness and resource safety in destroy() / unload-reload paths and effects of removing init()
- Tests timing sensitivity in tools/server/tests/unit/test_sleep.py

Poem

🐰 I hush the kernels, tuck them tight,

Counting quiet seconds in the night,
When pings return I wake, I sweep,
Reload my caches from their sleep,
Hoppity servers, ready and bright 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description only contains the template boilerplate and lacks any substantive information about the changes, objectives, motivation, or technical details.	Provide a detailed description of the feature, including what the sleep mode does, how it works, configuration options, and why this change is needed.
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.71% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title 'Xsn/server sleep' is vague and does not clearly convey the main change; it appears to reference a branch name rather than summarizing the feature.	Rename the title to something more descriptive, e.g., 'Add server idle sleep mode' or 'Implement automatic server sleeping on inactivity'.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch xsn/server_sleep

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tools/server/server-queue.cpp (1)
176-180: Double reset of req_stop_sleeping is redundant.

req_stop_sleeping is set to false on line 167 before waiting, and again on line 176 after waking. The second reset is unnecessary since the flag was already false when entering the wait, and the only place it's set to true is in wait_until_no_sleep() which then waits for sleeping to become false.
🔎 Proposed simplification
                 QUE_INF("%s", "exiting sleeping state\n");
-                req_stop_sleeping = false;
                 callback_sleeping_state(false);
                 sleeping = false;
                 time_last_task = ggml_time_ms();
tools/server/server-context.cpp (2)
584-597: Consider graceful error handling instead of exit(1) on model reload failure.

Calling exit(1) when model reload fails after waking from sleep is abrupt. While it's difficult to recover from this state, the server could potentially:

Log the error and stay in an error state

Return error responses until manually restarted

Allow retry attempts

The current approach is pragmatic given the complexity of recovery, but consider at least using a named constant or a more descriptive log message indicating this is an unrecoverable error.
🔎 Proposed improvement
         } else {
             SRV_INF("%s", "server is exiting sleeping state\n");
             if (!load_model(params_base)) {
-                SRV_ERR("%s", "fatal: failed to reload model after sleeping\n");
-                exit(1);
+                SRV_ERR("%s", "fatal: failed to reload model after sleeping - server cannot continue\n");
+                std::abort(); // Use abort() to indicate unrecoverable error
             }
         }
3422-3424: Track TODO: get_models should bypass sleep mode.

The TODO comment indicates that get_models should be accessible without waking the server, similar to get_props. Consider implementing this if model metadata is cached (which appears to be partially done via ctx_server.model_name).

Since model_name is cached in server_context_impl and persists through sleep, this endpoint could potentially bypass sleep for the cached metadata. However, model_meta() accesses model which may be null during sleep.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10b4f82 and 44a5a26.

📒 Files selected for processing (11)

common/arg.cpp (1 hunks)
common/common.h (1 hunks)
tools/cli/cli.cpp (0 hunks)
tools/server/README.md (1 hunks)
tools/server/server-context.cpp (13 hunks)
tools/server/server-context.h (0 hunks)
tools/server/server-queue.cpp (5 hunks)
tools/server/server-queue.h (4 hunks)
tools/server/server.cpp (0 hunks)
tools/server/tests/unit/test_sleep.py (1 hunks)
tools/server/tests/utils.py (2 hunks)

💤 Files with no reviewable changes (3)

tools/cli/cli.cpp
tools/server/server-context.h
tools/server/server.cpp

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Always activate the Python virtual environment in .venv and use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in .flake8
Ensure Python code passes pyright type checking as configured in pyrightconfig.json

Files:

tools/server/tests/utils.py
tools/server/tests/unit/test_sleep.py

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

tools/server/tests/utils.py
common/arg.cpp
tools/server/server-context.cpp
tools/server/server-queue.h
common/common.h
tools/server/tests/unit/test_sleep.py
tools/server/server-queue.cpp

tools/server/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Run server unit tests using the provided test infrastructure: cd tools/server/tests && source ../../../.venv/bin/activate && ./tests.sh

Files:

tools/server/tests/utils.py
tools/server/tests/unit/test_sleep.py

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

common/arg.cpp
tools/server/server-context.cpp
tools/server/server-queue.h
common/common.h
tools/server/server-queue.cpp

🧠 Learnings (1)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

🧬 Code graph analysis (2)

tools/server/tests/unit/test_sleep.py (1)

tools/server/tests/utils.py (3)

tinyllama2 (454-465)

start (117-275)

make_request (288-319)

tools/server/server-queue.cpp (1)

tools/server/server-context.cpp (5)

terminate (2760-2762)

terminate (2760-2760)

start_loop (2755-2758)

start_loop (2755-2755)

res (1051-1076)

🪛 Ruff (0.14.8)

tools/server/tests/unit/test_sleep.py

3-3: from utils import * used; unable to detect undefined names

(F403)

5-5: ServerPreset may be undefined, or defined from star imports

(F405)

11-11: ServerPreset may be undefined, or defined from star imports

(F405)

27-27: Avoid equality comparisons to True; use res.body["is_sleeping"]: for truth checks

Replace with res.body["is_sleeping"]

(E712)

39-39: Avoid equality comparisons to False; use not res.body["is_sleeping"]: for false checks

Replace with not res.body["is_sleeping"]

(E712)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: ggml-ci-arm64-cpu-kleidiai
GitHub Check: macOS-latest-cmake-visionos
GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
GitHub Check: ggml-ci-arm64-cpu-high-perf-sve
GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
GitHub Check: ggml-ci-x64-cpu-low-perf
GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN...
GitHub Check: ggml-ci-arm64-cpu-high-perf
GitHub Check: ios-xcode-build
GitHub Check: ggml-ci-arm64-cpu-low-perf
GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
GitHub Check: macOS-latest-cmake-arm64-webgpu
GitHub Check: macOS-latest-cmake-arm64
GitHub Check: WebUI Checks
GitHub Check: server-build (Release)

🔇 Additional comments (20)

tools/server/tests/utils.py (1)

103-103: Sleep-idle flag is correctly threaded through ServerProcess

The new sleep_idle_seconds option and its propagation into server_args are consistent with other optional numeric flags and with the CLI handler; None meaning “use server default” vs. -1/>0 being explicit values is a clean contract. No changes needed here.

Also applies to: 234-235

tools/server/README.md (1)

1624-1633: Sleeping-on-idle docs align with implementation

The description of --sleep-idle-seconds, sleep behavior, and the /health and /props exemptions matches the exposed API and test behavior; this section looks accurate and well-scoped.

common/arg.cpp (1)

2890-2899: CLI wiring and validation for --sleep-idle-seconds look correct

The new server-only option cleanly maps into params.sleep_idle_seconds, with a sensible default and validation (-1 or >0 only). This integrates well with the rest of the argument machinery.

common/common.h (1)

478-479: New sleep_idle_seconds field is consistent with server params

Adding sleep_idle_seconds with a default of -1 and the accompanying comment lines up with the CLI flag semantics and keeps server-specific configuration grouped coherently in common_params.

tools/server/server-queue.h (3)

15-18: LGTM! Sleep state tracking fields are well-organized.

The new fields sleeping, req_stop_sleeping, and time_last_task provide a clean mechanism for idle sleep management. The time_last_task initialization to 0 is fine since it gets properly set in start_loop() before any idle checks occur.

48-55: LGTM! Clean API for sleep state management.

The wait_until_no_sleep() and is_sleeping() methods provide a clear interface for external components to interact with the sleep state. The mutex lock in is_sleeping() ensures thread-safe access.

82-101: Thread-safety warning is helpful but consider defensive null-check for callback.

The comment about non-thread-safe usage is valuable. However, callback_sleeping_state could be null if on_sleeping_state() is never called before start_loop().

Please verify that on_sleeping_state() is always registered before start_loop() is called, or add a null check before invoking the callback in start_loop(). Based on the code in server-context.cpp, it appears the callback is registered in load_model() which is called before start_loop(), but this ordering dependency should be documented or enforced.

tools/server/server-queue.cpp (4)

36-37: LGTM! Consistent time tracking across all queue mutation operations.

Updating time_last_task in post(), post(vector), defer(), and pop_deferred_task() ensures accurate idle time calculation.

Also applies to: 58-59, 67-68, 83-84

87-102: LGTM! Well-implemented wake-from-sleep mechanism.

The wait_until_no_sleep() function correctly handles the synchronization:

Early return if not sleeping

Sets req_stop_sleeping and notifies the main loop

Waits until sleeping becomes false

The condition variable usage is correct for coordinating between the caller and the main loop.

163-181: Callback invoked while holding mutex - verify no deadlock risk.

callback_sleeping_state(true) and callback_sleeping_state(false) are called while mutex_tasks is held (lines 166, 177). If the callback implementation attempts to call any server_queue method that acquires this mutex, it will deadlock.

Based on server-context.cpp, handle_sleeping_state() calls destroy() and load_model(), which don't appear to directly call queue methods that would re-acquire the mutex. However, this is a subtle contract.

Consider documenting this constraint or releasing the lock before invoking callbacks. The current implementation works because handle_sleeping_state() doesn't re-enter server_queue methods, but this is fragile.

149-153: Good defensive update of time_last_task after slot updates.

This ensures that long-running inference operations don't falsely trigger idle sleep immediately after completion.

tools/server/server-context.cpp (9)

547-550: LGTM! Cached JSON responses improve efficiency.

Caching server properties in json_server_props avoids rebuilding the same JSON on every /props request.

561-565: Destructor correctly skips cleanup when sleeping.

When the server is in sleeping state, the model has already been destroyed by handle_sleeping_state(), so skipping destroy() prevents double-free issues.

600-615: LGTM! Resume path correctly skips queue wiring.

When resuming from sleep (is_resume == true), the queue callbacks are already registered from the initial load_model() call, so skipping re-registration is correct.

749-749: Clearing slots on resume is correct but verify no dangling references.

slots.clear() before re-populating ensures fresh slot state on resume. Since the slots are owned by the server_context_impl and referenced by pointer/index in tasks, ensure no tasks are in flight when entering sleep state.

The sleep is entered only when the queue is idle (no tasks), so this should be safe. However, verify that queue_tasks is empty and no deferred tasks reference old slots when entering sleep.

806-811: LGTM! Resume path correctly skips one-time initialization.

Skipping metrics.init(), prompt cache setup, and model name derivation on resume is correct since these are one-time setup operations that should persist across sleep cycles.

2784-2794: LGTM! Bypass sleep logic is clean and efficient.

The bypass_sleep flag with the fast path check for disabled sleep (sleep_idle_seconds < 0) is well-designed. This allows health checks and property queries to respond immediately without waking the server.

3026-3026: LGTM! Health endpoint correctly bypasses sleep.

Health checks should always respond immediately regardless of sleep state, which this achieves with bypass_sleep = true.

3207-3212: LGTM! Props endpoint returns sleep status.

Including is_sleeping in the response allows clients to know the server's current state, which is useful for monitoring and debugging.

2755-2757: Verify unit conversion: seconds to milliseconds.

params.sleep_idle_seconds * 1000 converts seconds to milliseconds. Ensure that sleep_idle_seconds is indeed in seconds (the name suggests it is). If sleep_idle_seconds is negative (disabled), the multiplication preserves the negative value correctly.

tools/server/tests/unit/test_sleep.py

ngxson · 2025-12-20T15:48:30Z

@coderabbitai any potential thread-safety issues?

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tools/server/server-context.cpp (1)
3427-3472: Incomplete: get_models should bypass sleep per TODO.

Line 3427 has a TODO indicating this endpoint should bypass sleep mode, but Line 3429 doesn't pass bypass_sleep=true. This means /models requests will block while the server is sleeping or loading.
🔎 Proposed fix
     // TODO: allow this endpoint to be accessed bypassing sleep mode, same method as get_props
     this->get_models = [this](const server_http_req &) {
-        auto res = std::make_unique<server_res_generator>(ctx_server);
+        auto res = std::make_unique<server_res_generator>(ctx_server, true);
         json model_meta = nullptr;
         if (is_ready()) {
             model_meta = ctx_server.model_meta();
You'll also need to handle the case where ctx_server.ctx might be nullptr during sleep (similar to how get_props handles it by returning cached data).

🧹 Nitpick comments (1)

tools/server/server-context.cpp (1)
567-587: Consider making destroy() idempotent.

The destroy() method is called from multiple places (destructor and sleep transition). While the current implementation appears safe, adding an early-return check at the beginning would make it explicitly idempotent and more robust:
void destroy() {
    if (!llama_init && !mctx) {
        return; // already destroyed
    }
    // ... rest of implementation
}

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44a5a26 and e6ab62c.

📒 Files selected for processing (3)

tools/server/server-context.cpp (13 hunks)
tools/server/server-context.h (1 hunks)
tools/server/server.cpp (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tools/server/server.cpp

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

tools/server/server-context.h
tools/server/server-context.cpp

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

tools/server/server-context.h
tools/server/server-context.cpp

🧠 Learnings (3)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-11-29T22:55:53.865Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility

Applied to files:

tools/server/server-context.h

📚 Learning: 2025-03-13T22:57:04.482Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/imatrix/imatrix.cpp:500-500
Timestamp: 2025-03-13T22:57:04.482Z
Learning: The function llama_batch_ext_init never returns nullptr in the llama.cpp codebase, so null checks after calling this function are unnecessary.

Applied to files:

tools/server/server-context.h

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
GitHub Check: windows-2022-cmake-cuda (12.4)
GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
GitHub Check: ubuntu-24-cmake-vulkan-deb
GitHub Check: windows-latest-cmake-sycl
GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
GitHub Check: macOS-latest-cmake-x64
GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
GitHub Check: ubuntu-latest-llguidance
GitHub Check: macOS-latest-cmake-arm64-webgpu
GitHub Check: server-windows
GitHub Check: pyright type-check
GitHub Check: server-build (ADDRESS, RelWithDebInfo)
GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
GitHub Check: server-build (Release)
GitHub Check: WebUI Checks

🔇 Additional comments (8)

tools/server/server-context.h (1)

35-36: LGTM: Comment accurately reflects new sleeping behavior.

The updated documentation correctly indicates that get_llama_context() may return nullptr during sleep transitions, which aligns with the sleeping state management introduced in this PR.

tools/server/server-context.cpp (7)

547-549: Good: JSON response caching improves performance.

Caching json_server_props avoids re-computing server properties on every request, which is a sensible optimization for read-heavy endpoints like /props.

754-754: Verify intended behavior: slots.clear() on resume.

Line 754 calls slots.clear() even when resuming from sleep (is_resume=true). This erases all slot state including cached prompts and context.

Is this intentional? If the goal is to preserve slot state across sleep/wake cycles, this clear operation defeats that purpose. If the goal is to start fresh on wake, then this is correct.

Please clarify the intended behavior and add a comment explaining the decision.

876-926: LGTM: JSON response population logic is sound.

The populate_json_responses() method correctly builds cached server properties. The JSON parsing includes proper error handling (lines 881-886).

Note: This is only called during initial load, not on resume from sleep. Verify this is acceptable given that server properties shouldn't change between sleep cycles.

2760-2762: LGTM: start_loop() correctly passes sleep configuration.

The sleep idle timeout is properly passed from params_base to the queue's start_loop() method.

2788-2808: LGTM: bypass_sleep mechanism enables health checks during sleep.

The bypass_sleep parameter correctly allows certain endpoints (health, props) to respond immediately even when the server is sleeping, while blocking other requests via wait_until_no_sleep().

The fast-path optimization on Line 2795 (when sleep is disabled) is a nice touch.

3029-3034: LGTM: Health endpoint correctly bypasses sleep.

Using bypass_sleep=true ensures health checks return immediately even during sleep transitions, which is the expected behavior for health monitoring.

3212-3218: LGTM: Props endpoint exposes sleeping state correctly.

The endpoint uses bypass_sleep=true and includes is_sleeping in the response, allowing clients to monitor the server's sleep state without blocking.

coderabbitai · 2025-12-20T18:13:37Z

tools/server/server-context.cpp

    ~server_context_impl() {
+        if (!sleeping) {
+            destroy();
+        }
+    }


⚠️ Potential issue | 🟠 Major

Potential resource leak if destroyed while sleeping.

When the server is in sleeping state, the destructor skips calling destroy(). However, destroy() also cleans up resources beyond llama_init, including:

mctx (multimodal context)

Draft contexts and speculative decoding state in slots (ctx_dft, spec)

batch memory

These resources will leak if the server is destroyed while sleeping.

🔎 Proposed fix

~server_context_impl() { - if (!sleeping) { - destroy(); - } + destroy(); }

And update destroy() to be idempotent:

void destroy() { + if (!llama_init && !mctx) { + return; // already destroyed + } llama_init.reset(); ctx = nullptr; model = nullptr; mtmd_free(mctx); mctx = nullptr; // ... rest of cleanup }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

~server_context_impl() {

if (!sleeping) {

destroy();

}

}

~server_context_impl() {

destroy();

}

🤖 Prompt for AI Agents

In tools/server/server-context.cpp around lines 561 to 565, the destructor currently skips calling destroy() when sleeping which allows resources (mctx, per-slot ctx_dft and spec, batch memory, etc.) to leak; always invoke destroy() from the destructor and make destroy() idempotent: have destroy() check and early-return if already destroyed, free/release mctx, iterate slots to free ctx_dft and spec and clear any speculative state, free batch memory, stop/join any background threads, null out pointers and reset flags after freeing, and guard all frees with null checks to prevent double-free.

tools/server/server-context.cpp

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tools/server/server-context.cpp (1)

569-589: Make destroy() fully idempotent.

The method should be safe to call multiple times to support the destructor fix (always calling destroy()). Add null checks before freeing resources:

Check mctx != nullptr before calling mtmd_free(mctx) (line 574)

Check slot.ctx_dft != nullptr before calling llama_free(slot.ctx_dft) (line 579)

Check slot.spec != nullptr before calling common_speculative_free(slot.spec) (line 582)

Consider adding an early return if llama_init is already null to avoid redundant cleanup

This makes the method safe to call from both the destructor and handle_sleeping_state().

♻️ Duplicate comments (2)

tools/server/server-context.cpp (2)

561-567: Resource leak risk when destroyed while sleeping.

As previously noted, skipping destroy() when sleeping=true can leak resources like mctx, per-slot ctx_dft and spec, and batch memory. The recommended fix is to always call destroy() and make it idempotent.

591-603: Critical thread-safety issue confirmed (per PR discussion).

As noted in the PR comments, this callback is invoked while holding mutex_tasks in server-queue.cpp, blocking all queue operations during expensive destroy() and load_model() calls.

Additionally, GGML_ABORT on Line 599 terminates the process. Consider whether graceful degradation or retry logic would be more appropriate.

🧹 Nitpick comments (2)

tools/server/server-context.cpp (2)
610-621: Potential double-registration of queue callbacks.

If load_model() is called multiple times when sleeping=false, the queue callbacks will be registered multiple times, causing each callback to fire multiple times per event. Consider adding a guard or assertion to ensure this block only executes once:
if (!is_resume) {
    static bool callbacks_registered = false;
    GGML_ASSERT(!callbacks_registered && "callbacks already registered");
    callbacks_registered = true;
    // ... register callbacks ...
}
Or ensure at the architectural level that load_model() is only called once during initial startup and once per wake from sleep.

3428-3430: Consider implementing bypass for get_models endpoint.

For consistency with get_props and get_health, the /models endpoint should also bypass sleep mode, as it returns mostly static metadata that doesn't require a loaded model.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6ab62c and 937b064.

📒 Files selected for processing (1)

tools/server/server-context.cpp (13 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

tools/server/server-context.cpp

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

tools/server/server-context.cpp

🧠 Learnings (4)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-11-29T22:55:53.865Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:47.417Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:06.114Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.

Applied to files:

tools/server/server-context.cpp

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)

GitHub Check: ubuntu-24-cmake-vulkan-deb
GitHub Check: build-cmake-pkg / linux
GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
GitHub Check: ubuntu-24-cmake-webgpu
GitHub Check: ggml-ci-arm64-cpu-low-perf
GitHub Check: ubuntu-24-wasm-webgpu
GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
GitHub Check: ubuntu-24-cmake-vulkan
GitHub Check: ubuntu-latest-cmake-cuda
GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
GitHub Check: ubuntu-latest-llguidance
GitHub Check: macOS-latest-cmake-arm64
GitHub Check: server-windows
GitHub Check: server-build (ADDRESS, RelWithDebInfo)
GitHub Check: server-build (Release)
GitHub Check: WebUI Checks
GitHub Check: server-build (UNDEFINED, RelWithDebInfo)

🔇 Additional comments (11)

tools/server/server-context.cpp (11)
547-549: LGTM - Caching server properties for performance.

The cached json_server_props avoids rebuilding server metadata on every request, which is a good optimization for the /props endpoint.

559-559: LGTM - State tracking for sleep mode.

The sleeping flag cleanly tracks whether the server is in sleep mode.

694-696: LGTM - Avoiding redundant logging setup on resume.

Skipping mtmd_helper_log_set() on resume is correct since logging configuration is global and doesn't need to be reapplied.

755-755: LGTM - Clearing slots for fresh initialization.

The slots.clear() is necessary to remove old slot objects before repopulating. This works correctly with destroy() which frees per-slot resources before this line executes.

772-772: LGTM - Proper error handling for slot initialization failures.

Returning false when draft context or speculator creation fails correctly signals load_model() failure.

Also applies to: 778-778

812-814: Verify prompt_cache validity after sleep/wake cycle.

The early return skips prompt_cache recreation. However, destroy() frees the context but doesn't clear prompt_cache, leaving it with stale state referencing the old context. When the model is reloaded on wake, the cache may contain invalid data.

Consider clearing prompt_cache in destroy() to avoid stale state:
void destroy() {
    // ... existing cleanup ...
    prompt_cache.reset();  // Clear stale cache
}
Or recreate the cache on resume before the early return.

869-875: LGTM - Proper initialization and error handling.

The call to populate_json_responses() correctly builds cached server properties, with appropriate error handling.

877-927: LGTM - Comprehensive server properties builder.

The method correctly builds and caches server metadata, with appropriate error handling for JSON parsing failures.

762-763: LGTM - Enabling idle sleep with correct time conversion.

The sleep idle timeout is correctly passed in milliseconds to start_loop().

2790-2809: LGTM - Clean bypass mechanism for sleep mode.

The bypass_sleep parameter elegantly allows certain endpoints (health, props) to respond without waiting for wake, while most endpoints block until the server is ready.

3032-3032: LGTM - Appropriate bypass usage for health and props.

Both endpoints correctly use bypass_sleep=true:

Health checks should always respond quickly for monitoring

Props endpoint reports the current is_sleeping state by merging it with cached properties

Also applies to: 3214-3218

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

tools/server/server-context.cpp (2)

597-600: Consider graceful degradation on model reload failure.

Using GGML_ABORT when model reload fails provides a clear failure mode, but it's a harsh response to potentially transient issues (e.g., temporary memory pressure).

Given the author's preference for simplicity (per learnings), this is acceptable for the initial implementation, but consider whether returning to a "permanently sleeping" state with an error response might be more user-friendly in the future.

3425-3427: Note on TODO for get_models bypass.

The TODO is valid, but bypassing sleep for /models would require caching the model metadata (similar to how json_server_props is cached) since the model APIs used in model_meta() require the model to be loaded. Consider caching model metadata in populate_json_responses() if this endpoint should be accessible during sleep.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 937b064 and 105e2f3.

📒 Files selected for processing (1)

tools/server/server-context.cpp (13 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

tools/server/server-context.cpp

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

tools/server/server-context.cpp

🧠 Learnings (4)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-11-29T22:55:53.865Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:47.417Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:06.114Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.

Applied to files:

tools/server/server-context.cpp

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
GitHub Check: build-cmake-pkg / linux
GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
GitHub Check: ubuntu-24-cmake-webgpu
GitHub Check: ggml-ci-arm64-cpu-kleidiai
GitHub Check: ubuntu-22-cmake-sycl-fp16
GitHub Check: macOS-latest-cmake-x64
GitHub Check: ubuntu-24-wasm-webgpu
GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
GitHub Check: ubuntu-24-cmake-vulkan-deb
GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
GitHub Check: ubuntu-24-cmake-vulkan
GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
GitHub Check: server-build (ADDRESS, RelWithDebInfo)
GitHub Check: server-build (Release)
GitHub Check: WebUI Checks
GitHub Check: server-windows
GitHub Check: pyright type-check

🔇 Additional comments (9)

tools/server/server-context.cpp (9)

547-549: LGTM!

The cached JSON properties member is appropriately initialized and provides a clean way to avoid rebuilding the response on each request.

561-567: The destructor logic is intentional but verify edge cases.

The comment clarifies that destroy() was already called when entering sleep. However, there's a subtle edge case: if the server is destroyed during the wake transition (after sleeping is set to false in handle_sleeping_state but before load_model completes), destroy() will be called on a partially-initialized state.

Since destroy() already handles null pointers gracefully (e.g., llama_init.reset() on nullptr is safe), this should be fine in practice.

755-755: Important fix: clearing slots before reinitializing.

The slots.clear() ensures that slots from a previous session are properly discarded before creating new ones during resume. This prevents slot accumulation across sleep/wake cycles.

874-924: LGTM!

Clean extraction of JSON response population into a dedicated function with proper error handling for webui config parsing.

2787-2797: LGTM!

The bypass_sleep parameter provides a clean mechanism for endpoints that don't require the model (like /health and /props) to respond immediately without waiting for model reload. The fast-path optimization when sleep is disabled avoids unnecessary synchronization overhead.

3027-3032: LGTM!

Health endpoint correctly bypasses sleep wait, allowing health checks to respond immediately. The sleeping status is appropriately exposed via the /props endpoint instead.

3210-3216: LGTM!

The /props endpoint correctly:

Bypasses sleep wait to remain responsive during sleep

Returns the cached properties plus the current is_sleeping status

Creates a copy before modifying to avoid mutating the cached object

770-779: LGTM!

Proper error handling for speculative decoding initialization - if draft context or speculator fails to initialize, the function correctly returns false to signal the failure.

2759-2760: No action needed. The queue implementation already properly handles negative idle_sleep_ms values—when negative, the should_sleep() lambda returns false, disabling the sleeping procedure as documented. The concern about unexpected behavior with negative timeouts is unfounded.

Likely an incorrect or invalid review comment.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tools/server/server-context.cpp (1)
569-589: Consider adding idempotency check for defensive programming.

The destroy() method is called from two controlled locations (destructor and handle_sleeping_state), but adding an early-return idempotency check would make it more robust against future changes or accidental double-calls.
🔎 Optional enhancement
 void destroy() {
+    if (!llama_init && !mctx) {
+        return; // already destroyed
+    }
     llama_init.reset();
     ctx = nullptr;
     model = nullptr;

     mtmd_free(mctx);
     mctx = nullptr;

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 105e2f3 and fd09f88.

📒 Files selected for processing (2)

tools/server/README.md (1 hunks)
tools/server/server-context.cpp (20 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tools/server/README.md

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

tools/server/server-context.cpp

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

tools/server/server-context.cpp

🧠 Learnings (4)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-11-29T22:55:53.865Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:47.417Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:06.114Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.

Applied to files:

tools/server/server-context.cpp

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: build-cmake-pkg / linux
GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
GitHub Check: ubuntu-22-cmake-musa
GitHub Check: windows-latest-cmake-sycl
GitHub Check: ubuntu-22-cmake-sycl-fp16
GitHub Check: ubuntu-22-cmake-sycl
GitHub Check: ggml-ci-x64-cpu-low-perf
GitHub Check: ubuntu-22-cmake-hip
GitHub Check: windows-latest-cmake-hip
GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
GitHub Check: windows-2022-cmake-cuda (12.4)
GitHub Check: ggml-ci-x64-cpu-high-perf
GitHub Check: macOS-latest-cmake-x64
GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
GitHub Check: pyright type-check
GitHub Check: server-windows
GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
GitHub Check: server-build (ADDRESS, RelWithDebInfo)
GitHub Check: server-build (Release)
GitHub Check: WebUI Checks

🔇 Additional comments (3)

tools/server/server-context.cpp (3)

874-936: LGTM - Well-structured JSON response caching.

The populate_json_responses() method cleanly separates the JSON response building logic and is correctly called after model initialization on both fresh load and resume from sleep. The error handling in the caller (lines 866-872) properly validates the result.

2788-2798: LGTM - Bypass sleep logic is sound.

The server_res_generator constructor correctly implements the bypass logic:

Read-only endpoints (health, props, models) pass bypass_sleep=true and use cached JSON data

Endpoints requiring an active model wait via wait_until_no_sleep(), which triggers wake-up when needed

The fast-path check for disabled sleeping (sleep_idle_seconds < 0) avoids unnecessary overhead

607-622: LGTM - Resume logic correctly preserves queue callbacks.

The is_resume flag appropriately skips re-wiring queue callbacks when waking from sleep, since the lambda captures of this remain valid across the sleep/wake cycle. This avoids creating duplicate callbacks while ensuring the sleeping state handler remains functional.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd09f88 and 0bb9bc4.

📒 Files selected for processing (2)

tools/server/README.md (1 hunks)
tools/server/server-context.cpp (20 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tools/server/README.md

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

tools/server/server-context.cpp

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

tools/server/server-context.cpp

🧠 Learnings (4)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-11-29T22:55:53.865Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:47.417Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:06.114Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.

Applied to files:

tools/server/server-context.cpp

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release)
GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
GitHub Check: ubuntu-24-cmake-webgpu
GitHub Check: ubuntu-latest-llguidance
GitHub Check: ubuntu-24-cmake-vulkan-deb
GitHub Check: macOS-latest-cmake-x64
GitHub Check: macOS-latest-cmake-arm64
GitHub Check: ubuntu-latest-cmake-rpc
GitHub Check: macOS-latest-cmake-arm64-webgpu
GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
GitHub Check: server-build (Release)
GitHub Check: WebUI Checks
GitHub Check: server-build (ADDRESS, RelWithDebInfo)
GitHub Check: server-windows
GitHub Check: pyright type-check

coderabbitai · 2025-12-20T19:17:28Z

tools/server/server-context.cpp

+    // cached responses for HTTP API (read-only from HTTP threads)
+    json json_server_props      = json::object();
+    json json_server_model_meta = json::object();


⚠️ Potential issue | 🔴 Critical

Race condition on JSON response cache access.

The json_server_props and json_server_model_meta members are written by populate_json_responses() (called from load_model() at Line 866) but read directly by HTTP threads in route handlers (Lines 3217, 3445) without synchronization. When the server wakes from sleep, handle_sleeping_state() calls load_model(), which modifies these JSON objects concurrently with HTTP handlers reading them.

C++ does not provide atomic guarantees for complex types like nlohmann::json, so concurrent read/write access creates a data race.

🔎 Proposed fix

Wrap access to these cached JSON objects with a mutex:

+ std::mutex mutex_json_cache; + // cached responses for HTTP API (read-only from HTTP threads) json json_server_props = json::object(); json json_server_model_meta = json::object();

Then protect writes in populate_json_responses():

bool populate_json_responses() { // ... build json_webui_settings, default_generation_settings_for_props, etc. ... + std::lock_guard<std::mutex> lock(mutex_json_cache); json_server_props = { /* ... */ }; json_server_model_meta = { /* ... */ }; return true; }

And protect reads in route handlers:

this->get_props = [this](const server_http_req &) { auto res = std::make_unique<server_res_generator>(ctx_server, true); + json props; + { + std::lock_guard<std::mutex> lock(ctx_server.mutex_json_cache); + props = ctx_server.json_server_props; + } - auto props = ctx_server.json_server_props; props["is_sleeping"] = ctx_server.queue_tasks.is_sleeping(); res->ok(props); return res; };

Apply similar protection for get_models and any other readers.

🤖 Prompt for AI Agents

In tools/server/server-context.cpp around lines 547-549, the json_server_props and json_server_model_meta objects are written without synchronization while HTTP threads read them, causing a data race; to fix, add a std::mutex member (e.g., mutex_json_cache) to the server context class, take a std::lock_guard<std::mutex> when assigning/updating json_server_props and json_server_model_meta inside populate_json_responses() (and any other writers such as load_model()), and take a lock in every HTTP route handler that reads these caches (get_props, get_models, etc.), copying the json into a local variable while holding the lock and then release the lock before mutating the local copy or returning it; ensure all reader/writer sites are updated to use this mutex to eliminate concurrent read/write access.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tools/server/server-context.cpp (1)

569-589: Make destroy() idempotent to support unconditional destructor calls.

To resolve the resource-leak issue in the destructor (lines 561-567), destroy() must become idempotent. Add an early-return check at the start:

void destroy() {
    if (!llama_init && !mctx) {
        return; // already destroyed
    }
    // ... rest of cleanup
}

Additionally, guard individual resource frees with null-checks to prevent double-free if destroy() is invoked multiple times.

🔎 Proposed fix

 void destroy() {
+    if (!llama_init && !mctx) {
+        return; // already destroyed
+    }
+
     llama_init.reset();
     ctx = nullptr;
     model = nullptr;

-    mtmd_free(mctx);
-    mctx = nullptr;
+    if (mctx) {
+        mtmd_free(mctx);
+        mctx = nullptr;
+    }

     // Clear any sampling context
     for (server_slot & slot : slots) {
-        llama_free(slot.ctx_dft);
-        slot.ctx_dft = nullptr;
+        if (slot.ctx_dft) {
+            llama_free(slot.ctx_dft);
+            slot.ctx_dft = nullptr;
+        }

-        common_speculative_free(slot.spec);
-        slot.spec = nullptr;
+        if (slot.spec) {
+            common_speculative_free(slot.spec);
+            slot.spec = nullptr;
+        }

         llama_batch_free(slot.batch_spec);
     }

     llama_batch_free(batch);
 }

♻️ Duplicate comments (2)

tools/server/server-context.cpp (2)

547-549: Race condition on JSON cache access remains unresolved.

As previously flagged, json_server_props and json_server_model_meta are accessed by HTTP threads without synchronization. This review confirms:

Writers: populate_json_responses() (lines 877-939) assigns these objects, called from load_model() during wake-up

Readers: Multiple HTTP endpoints (e.g., get_props at line 3220, get_models at line 3448) read directly without locks

Aggravating factor: get_props uses bypass_sleep=true (line 3219), allowing reads during model reload when writes are occurring

The previously suggested fix (adding a mutex_json_cache and protecting all read/write sites) should be implemented before merging.

Also applies to: 877-939, 3218-3224

561-567: Resource leak when destroyed during sleep (previously flagged).

The destructor still skips destroy() when sleeping=true, leaving mctx, per-slot ctx_dft/spec/batch_spec, and the main batch un-freed. The previously suggested fix—always invoke destroy() and make it idempotent by checking and early-returning if resources are already freed—should be implemented.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0bb9bc4 and d850082.

📒 Files selected for processing (1)

tools/server/server-context.cpp (20 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

tools/server/server-context.cpp

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

tools/server/server-context.cpp

🧠 Learnings (4)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-11-29T22:55:53.865Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:47.417Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:06.114Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.

Applied to files:

tools/server/server-context.cpp

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
GitHub Check: windows-latest-cmake-sycl
GitHub Check: macOS-latest-cmake-ios
GitHub Check: macOS-latest-cmake-visionos
GitHub Check: ubuntu-22-cmake-sycl
GitHub Check: windows-2022-cmake-cuda (12.4)
GitHub Check: ggml-ci-x64-cpu-high-perf
GitHub Check: ggml-ci-x64-cpu-low-perf
GitHub Check: macOS-latest-cmake-tvos
GitHub Check: macOS-latest-cmake-arm64
GitHub Check: ubuntu-latest-llguidance
GitHub Check: macOS-latest-cmake-arm64-webgpu
GitHub Check: ubuntu-latest-cmake-rpc
GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
GitHub Check: server-build (ADDRESS, RelWithDebInfo)
GitHub Check: WebUI Checks
GitHub Check: server-build (Release)
GitHub Check: server-windows
GitHub Check: pyright type-check

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

tools/server/server-context.cpp (1)
547-549: Critical race condition on JSON cache remains unaddressed (duplicate from previous review).

The previous review identified a data race where json_server_props and json_server_model_meta are written by populate_json_responses() (called from load_model() at line 874) but read directly by HTTP threads (lines 3225, 3453) without synchronization. This creates undefined behavior when the server wakes from sleep.

The comment "read-only from HTTP threads" is misleading—these fields are writable during model reload.
🔎 Recommended fix

Add a mutex to protect concurrent access:
+    std::mutex mutex_json_cache;
+
     // cached responses for HTTP API (read-only from HTTP threads)
     json json_server_props      = json::object();
     json json_server_model_meta = json::object();
Protect writes in populate_json_responses():
bool populate_json_responses() {
    // ... build json objects ...
    
+    std::lock_guard<std::mutex> lock(mutex_json_cache);
    json_server_props = { /* ... */ };
    json_server_model_meta = { /* ... */ };
    return true;
}
Protect reads in route handlers (lines 3225, 3453):
json props;
{
    std::lock_guard<std::mutex> lock(ctx_server.mutex_json_cache);
    props = ctx_server.json_server_props;
}
props["is_sleeping"] = ctx_server.queue_tasks.is_sleeping();
Based on learnings, this fix is necessary before merging.

🧹 Nitpick comments (2)

tools/server/README-dev.md (1)
110-111: Use markdown link syntax for bare URLs.

The newly added PR references follow the existing formatting pattern in the file, but all URLs in this section (lines 100–111) are bare URLs. Consider wrapping them in proper markdown link syntax ([text](url)) for better markup consistency and structure, as flagged by markdownlint.

This is an optional refactor that could improve the entire "Notable Related PRs" section uniformly.
🔎 Proposed improvement for lines 110–111 (alternative: apply to all lines 100–111)
- INI presets: https://github.com/ggml-org/llama.cpp/pull/17859 (+ refactoring: https://github.com/ggml-org/llama.cpp/pull/18169)
- Sleeping mode: https://github.com/ggml-org/llama.cpp/pull/18228
+ INI presets: [PR 17859](https://github.com/ggml-org/llama.cpp/pull/17859) (+ refactoring: [PR 18169](https://github.com/ggml-org/llama.cpp/pull/18169))
+ Sleeping mode: [PR 18228](https://github.com/ggml-org/llama.cpp/pull/18228)
tools/server/server-context.cpp (1)
569-589: Consider making destroy() idempotent for defensive programming.

While the current design prevents double-calls through the sleeping flag, adding early-return guards would make destroy() safer and more maintainable:
void destroy() {
    if (!llama_init && !mctx && slots.empty()) {
        return; // already destroyed
    }
    
    llama_init.reset();
    ctx = nullptr;
    model = nullptr;
    
    mtmd_free(mctx);
    mctx = nullptr;
    
    // ... rest of cleanup
}
This prevents potential issues if the calling logic changes in the future.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d850082 and 1663d2f.

📒 Files selected for processing (2)

tools/server/README-dev.md (1 hunks)
tools/server/server-context.cpp (19 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

tools/server/server-context.cpp

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

tools/server/server-context.cpp

🧠 Learnings (6)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-11-29T22:55:53.865Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility

Applied to files:

tools/server/README-dev.md
tools/server/server-context.cpp

📚 Learning: 2025-11-29T22:55:53.865Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to include/llama.h : Exercise careful consideration when making changes to the public API in `include/llama.h`, as API stability is critical

Applied to files:

tools/server/README-dev.md

📚 Learning: 2025-03-13T22:56:47.417Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:06.114Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:57:04.482Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/imatrix/imatrix.cpp:500-500
Timestamp: 2025-03-13T22:57:04.482Z
Learning: The function llama_batch_ext_init never returns nullptr in the llama.cpp codebase, so null checks after calling this function are unnecessary.

Applied to files:

tools/server/server-context.cpp

🪛 GitHub Actions: EditorConfig Checker

tools/server/server-context.cpp

[error] 854-854: Trailing whitespace detected by editorconfig-checker at line 854. Step: editorconfig-checker.

🪛 markdownlint-cli2 (0.18.1)

tools/server/README-dev.md

110-110: Bare URL used

(MD034, no-bare-urls)

110-110: Bare URL used

(MD034, no-bare-urls)

111-111: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)

GitHub Check: ubuntu-24-cmake-vulkan
GitHub Check: ubuntu-22-cmake-hip
GitHub Check: macOS-latest-cmake-visionos
GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
GitHub Check: macOS-latest-cmake-arm64-webgpu
GitHub Check: ubuntu-latest-cmake-rpc
GitHub Check: macOS-latest-cmake-arm64
GitHub Check: macOS-latest-cmake-x64
GitHub Check: ubuntu-latest-llguidance
GitHub Check: server-build (ADDRESS, RelWithDebInfo)
GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
GitHub Check: server-build (Release)
GitHub Check: WebUI Checks
GitHub Check: pyright type-check
GitHub Check: server-windows

🔇 Additional comments (2)

tools/server/server-context.cpp (2)

3037-3039: Good documentation of the sleeping-aware response pattern.

The comment clearly explains why all routes must use server_res_generator, ensuring consistent handling of the sleeping state across all endpoints.

3040-3045: Correct use of bypass_sleep for health endpoint.

Health checks properly bypass the sleep wait, ensuring monitoring systems can query server status without blocking. This is the right design for operational health checks.

tools/server/server-context.cpp

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (5)

tools/server/server-context.cpp (5)

547-549: Data race on cached JSON objects remains unaddressed.

The json_server_props and json_server_model_meta members are written by populate_json_responses() (called during model reload on wake from sleep) while HTTP threads read them concurrently without synchronization. This creates undefined behavior.

561-567: Destructor skips cleanup when sleeping—potential resource leak.

When sleeping is true, the destructor skips destroy(), but mctx, per-slot ctx_dft/spec, and batch may still hold resources that need cleanup. The destroy() method should be made idempotent and always called.

591-603: Thread-safety issue: callback invoked while holding mutex (duplicate).

As noted in previous reviews, handle_sleeping_state() is called from server-queue.cpp while mutex_tasks is held. Both destroy() and load_model() are expensive operations that can block for seconds/minutes, causing all queue operations to stall.

3223-3228: Unsynchronized read of json_server_props (duplicate).

While bypass_sleep=true is correct for the props endpoint, reading ctx_server.json_server_props without synchronization creates a data race with populate_json_responses() writes during model reload.

3447-3459: TODO remains unresolved; race condition persists (duplicate).

The TODO at lines 3447-3448 acknowledges the endpoint is unsafe during model reload. While the comment at lines 3457-3459 warns against accessing dynamic model info, json_server_model_meta is still read without synchronization, and ctx_server.mctx access at line 3455 may observe inconsistent state during sleep transitions.

🧹 Nitpick comments (1)

tools/server/server-context.cpp (1)
559-559: Consider atomic or mutex protection for sleeping flag.

The sleeping flag is accessed from handle_sleeping_state() (which may be called from the queue thread), load_model(), and potentially inspected via queue_tasks.is_sleeping() from HTTP threads. While current usage appears safe due to sequencing, using std::atomic<bool> would provide explicit thread-safety guarantees.
🔎 Suggested change
-    bool sleeping = false;
+    std::atomic<bool> sleeping{false};

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1663d2f and b51da9a.

📒 Files selected for processing (1)

tools/server/server-context.cpp (19 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

tools/server/server-context.cpp

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

tools/server/server-context.cpp

🧠 Learnings (5)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-11-29T22:55:53.865Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:47.417Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:06.114Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:57:04.482Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/imatrix/imatrix.cpp:500-500
Timestamp: 2025-03-13T22:57:04.482Z
Learning: The function llama_batch_ext_init never returns nullptr in the llama.cpp codebase, so null checks after calling this function are unnecessary.

Applied to files:

tools/server/server-context.cpp

🧬 Code graph analysis (1)

tools/server/server-context.cpp (1)

tools/server/server-common.cpp (4)

oaicompat_chat_params_parse (844-1083)

oaicompat_chat_params_parse (844-847)

convert_anthropic_to_oai (1085-1299)

convert_anthropic_to_oai (1085-1085)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
GitHub Check: ggml-ci-x64-cpu-low-perf
GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
GitHub Check: ubuntu-22-cmake-musa
GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
GitHub Check: windows-latest-cmake-sycl
GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
GitHub Check: windows-2022-cmake-cuda (12.4)
GitHub Check: ubuntu-24-cmake-vulkan
GitHub Check: ubuntu-22-cmake-sycl
GitHub Check: android-build
GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
GitHub Check: macOS-latest-cmake-x64
GitHub Check: ubuntu-24-cmake-webgpu
GitHub Check: ubuntu-latest-cmake-rpc
GitHub Check: macOS-latest-cmake-arm64-webgpu
GitHub Check: server-windows

🔇 Additional comments (9)

tools/server/server-context.cpp (9)

742-742: LGTM: Slot clearing before re-initialization.

Clearing slots before the loop ensures clean slate on both fresh load and resume, preventing stale slot state accumulation.

855-880: Well-structured initialization separation.

The separation of init() for one-time setup (queue wiring, metrics) from load_model() for repeatable model loading is a good design choice that cleanly supports the sleep/wake lifecycle.

882-944: populate_json_responses() correctly builds cached responses.

The function properly extracts model metadata and server properties. However, as noted in prior comments, writes to json_server_props and json_server_model_meta require synchronization since HTTP threads read them concurrently.

2767-2770: LGTM: Sleep idle timeout integration.

The idle timeout is correctly converted from seconds to milliseconds and passed to the queue's main loop.

2796-2806: Good design for sleep-aware response generation.

The bypass_sleep parameter allows read-only endpoints (health, props) to respond immediately without waiting for model load, while inference endpoints correctly block via wait_until_no_sleep(). The fast-path optimization when sleeping is disabled (sleep_idle_seconds < 0) avoids unnecessary synchronization overhead.

2823-2834: Clean refactor to support sleep-aware generation.

Passing the pre-constructed server_res_generator as the first parameter allows callers to control the bypass_sleep behavior before entering this function. The move semantics are used correctly.

3040-3045: Correct: Health endpoint bypasses sleep.

Health checks should always respond immediately regardless of sleep state, which is correctly implemented with bypass_sleep=true.

3354-3366: Consistent refactoring across completion endpoints.

The pattern of constructing server_res_generator first (which handles sleep/wake synchronization) and then passing it to handle_completions_impl is applied consistently across all completion-related endpoints.

3036-3039: Good documentation of the sleep-aware pattern.

The comment clearly explains the requirement that all route handlers must start with std::make_unique<server_res_generator>, ensuring consistent sleep handling across endpoints.

tools/server/server-context.cpp

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

tools/server/server-context.cpp (3)
569-589: Consider making destroy() fully idempotent.

Currently, destroy() relies on the sleeping flag in the destructor to prevent double-free. While this works for the current design, making destroy() itself idempotent would improve robustness.
🔎 Suggested approach

Add an early-return guard at the start of destroy():
 void destroy() {
+    if (!llama_init && !mctx && batch.n_tokens == 0) {
+        return; // already destroyed
+    }
     llama_init.reset();
     ctx = nullptr;
     model = nullptr;

     mtmd_free(mctx);
     mctx = nullptr;
     // ... rest of cleanup
 }
This allows destroy() to be safely called multiple times without relying on external state checks.
598-600: Consider more graceful error handling than GGML_ABORT.

If load_model() fails during wake (e.g., model file moved/deleted, OOM), GGML_ABORT will crash the entire server. While the server is in a difficult-to-recover state, consider alternatives:

Stay in sleeping state and log the error, allowing retry on next request

Set an error state that /health can report

Gracefully terminate the server process with an exit code

However, given the intentional blocking during state transition (per learnings), recovery complexity is high. If you decide to keep GGML_ABORT, consider adding a detailed error message explaining why the server cannot continue.
🔎 Suggested approach
         } else {
             SRV_INF("%s", "server is exiting sleeping state\n");
             if (!load_model(params_base)) {
-                GGML_ABORT("failed to reload model after sleeping");
+                SRV_ERR("Critical: failed to reload model after sleeping. Model path: %s\n", 
+                        params_base.model.path.c_str());
+                // Stay in sleeping state; next request will retry
+                return;
             }
         }
         sleeping = new_state;
3447-3453: Consider adding bypass_sleep=true to resolve TODO.

The TODO at lines 3447-3448 can be addressed by adding bypass_sleep=true to the server_res_generator constructor, similar to /health and /props endpoints. Since json_server_model_meta is populated once during initialization and represents static model metadata (vocab size, context length, etc.), it's safe to read during sleep state.
🔎 Proposed fix
     // TODO: this endpoint is unsafe to access during model reloading (i.e. wake up from sleeping)
     // how to make it work even during load_model()?
     this->get_models = [this](const server_http_req &) {
-        auto res = std::make_unique<server_res_generator>(ctx_server);
+        auto res = std::make_unique<server_res_generator>(ctx_server, true);
         json model_meta = nullptr;
         if (is_ready()) {
             model_meta = ctx_server.json_server_model_meta;
         }
This allows clients to query model information immediately, even when the server is sleeping, similar to how /props works.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b51da9a and 06a5ebe.

📒 Files selected for processing (1)

tools/server/server-context.cpp (19 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

tools/server/server-context.cpp

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

tools/server/server-context.cpp

🧠 Learnings (6)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 62
File: tools/server/server-context.cpp:591-603
Timestamp: 2025-12-20T21:18:14.768Z
Learning: In ngxson/llama.cpp server implementation, holding mutex_tasks during callback_sleeping_state() (which calls handle_sleeping_state) is intentional behavior. No tasks should be allowed to be pushed into the queue during sleeping state transition, even though this blocks all queue operations (post, defer, wait_until_no_sleep) for the duration of model unload/reload (potentially minutes).

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-12-20T21:18:14.768Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 62
File: tools/server/server-context.cpp:591-603
Timestamp: 2025-12-20T21:18:14.768Z
Learning: In ngxson/llama.cpp server implementation, holding mutex_tasks during callback_sleeping_state() (which calls handle_sleeping_state) is intentional behavior. No tasks should be allowed to be pushed into the queue during sleeping state transition, even though this blocks all queue operations (post, defer, wait_until_no_sleep) for the duration of model unload/reload (potentially minutes).

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-11-29T22:55:53.865Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:47.417Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:56:06.114Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.

Applied to files:

tools/server/server-context.cpp

📚 Learning: 2025-03-13T22:57:04.482Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/imatrix/imatrix.cpp:500-500
Timestamp: 2025-03-13T22:57:04.482Z
Learning: The function llama_batch_ext_init never returns nullptr in the llama.cpp codebase, so null checks after calling this function are unnecessary.

Applied to files:

tools/server/server-context.cpp

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: windows-latest-cmake-hip
GitHub Check: ggml-ci-x64-cpu-low-perf
GitHub Check: ggml-ci-x64-cpu-high-perf
GitHub Check: windows-2022-cmake-cuda (12.4)
GitHub Check: ubuntu-22-cmake-sycl-fp16
GitHub Check: ubuntu-22-cmake-hip
GitHub Check: windows-latest-cmake-sycl
GitHub Check: ubuntu-22-cmake-musa
GitHub Check: ubuntu-22-cmake-sycl
GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
GitHub Check: macOS-latest-cmake-x64
GitHub Check: ubuntu-24-cmake-vulkan
GitHub Check: macOS-latest-cmake-arm64-webgpu
GitHub Check: server-build (Release)
GitHub Check: WebUI Checks
GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
GitHub Check: server-build (ADDRESS, RelWithDebInfo)
GitHub Check: server-windows
GitHub Check: pyright type-check

🔇 Additional comments (9)

tools/server/server-context.cpp (9)

561-567: Destructor logic is correct per the sleep lifecycle.

The conditional call to destroy() is appropriate: when sleeping=true, resources have already been freed during the sleep entry (line 595). This avoids double-free issues.

Based on learnings, the intentional blocking behavior during sleep state transitions means the destructor will not be called mid-transition, eliminating race conditions between destructor and handle_sleeping_state().

848-852: Early return on resume is correct by design.

The comment at line 855 clarifies that init() is "only called once during initialization." On resume from sleep (is_resume=true), skipping init() is intentional because:

populate_json_responses() was already called during initial startup (server properties are static)

Queue callbacks (on_new_task, on_sleeping_state) are already wired

metrics.init() already set the start time

The model/context/slots/batch recreation before the early return ensures all necessary runtime state is restored.

855-880: Well-structured separation of concerns.

The new init() method correctly separates one-time server initialization (queue wiring, metrics, JSON population) from model loading. The assertions ensure it's only called in the right state, and the conditional call at line 849 ensures it runs once during startup but not on resume.

The sleeping state callback wiring at lines 868-870 properly integrates the sleep lifecycle with the queue system.

882-944: JSON response caching is correctly implemented.

The populate_json_responses() method builds static metadata once during initialization. Since it's called from init() (line 874), which only runs on initial startup (not on resume), this ensures:

JSON caches are populated before HTTP routes become accessible

No race conditions between writers and HTTP readers

Properties remain valid across sleep/wake cycles (they're static)

The structured approach with separate blocks for webui settings, server properties, and model metadata improves maintainability.

2768-2769: LGTM!

The start_loop() method correctly wires up the sleep idle timeout by passing sleep_idle_seconds * 1000 (converting seconds to milliseconds) to the queue system. This integrates the idle-sleep feature with the task queue lifecycle.

2796-2806: Well-designed bypass mechanism for sleep-aware endpoints.

The bypass_sleep parameter allows selective endpoint access during sleep state:

Most endpoints block via wait_until_no_sleep() until the server wakes

Critical endpoints (health, props) can bypass this wait with bypass_sleep=true

Fast path optimization when sleep is disabled (sleep_idle_seconds < 0)

The blocking behavior in wait_until_no_sleep() is intentional per the design, ensuring request ordering and preventing race conditions during model reload.

3040-3044: Appropriate use of bypass_sleep for health endpoint.

Using bypass_sleep=true is correct here: the /health endpoint should remain responsive regardless of sleep state, allowing monitoring systems to track server status. The simple implementation returns {"status": "ok"} immediately without accessing model resources.

3223-3228: Correct implementation with appropriate bypass_sleep usage.

Using bypass_sleep=true allows clients to query server properties (including is_sleeping status) regardless of sleep state. The read of json_server_props is safe because it's populated once during initialization before HTTP routes become accessible.

The dynamic is_sleeping field provides real-time status information, which is valuable for clients to understand server state.

3036-3039: Consistent sleep-aware implementation across all route handlers.

The pattern is correctly applied across all endpoints:

Inference endpoints (completions, chat, etc.) correctly use default bypass_sleep=false, ensuring requests wait for the model to load before processing

Status endpoints (/health, /props) use bypass_sleep=true to remain responsive during sleep

The comment at lines 3037-3038 provides clear guidance for future endpoint implementations

Moving res into handle_completions_impl ensures proper lifecycle management

This consistent approach ensures all endpoints participate correctly in the sleep/wake lifecycle while maintaining appropriate access controls.

Also applies to: 3354-3366, 3368-3380, 3382-3398, 3400-3416

ngxson added 5 commits December 20, 2025 15:21

implement sleeping at queue level

e1d7b43

implement server-context suspend

197e578

add test

db3b78d

add docs

aea8f8c

optimization: add fast path

44a5a26

github-actions bot added examples python server labels Dec 20, 2025

coderabbitai bot reviewed Dec 20, 2025

View reviewed changes

tools/server/tests/unit/test_sleep.py Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

make sure to free llama_init

e6ab62c

coderabbitai bot reviewed Dec 20, 2025

View reviewed changes

nits

937b064

coderabbitai bot reviewed Dec 20, 2025

View reviewed changes

fix use-after-free

105e2f3

coderabbitai bot reviewed Dec 20, 2025

View reviewed changes

allow /models to be accessed during sleeping, fix use-after-free

fd09f88

coderabbitai bot reviewed Dec 20, 2025

View reviewed changes

don't allow accessing /models during sleep, it is not thread-safe

0bb9bc4

coderabbitai bot reviewed Dec 20, 2025

View reviewed changes

fix data race on accessing props and model_meta

d850082

coderabbitai bot reviewed Dec 20, 2025

View reviewed changes

ngxson added 2 commits December 20, 2025 21:51

small clean up

1663d2f

trailing whitespace

b51da9a

coderabbitai bot reviewed Dec 20, 2025

View reviewed changes

tools/server/server-context.cpp Show resolved Hide resolved

tools/server/server-context.cpp Outdated Show resolved Hide resolved

tools/server/server-context.cpp Show resolved Hide resolved

tools/server/server-context.cpp Outdated Show resolved Hide resolved

coderabbitai bot reviewed Dec 20, 2025

View reviewed changes

tools/server/server-context.cpp Show resolved Hide resolved

rm outdated comments

06a5ebe

coderabbitai bot reviewed Dec 20, 2025

View reviewed changes

Xsn/server sleep #62

Are you sure you want to change the base?

Xsn/server sleep #62

Conversation

ngxson commented Dec 20, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxson commented Dec 20, 2025

Uh oh!

This comment was marked as resolved.

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngxson commented Dec 20, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 20, 2025 •

edited

Loading