-
Notifications
You must be signed in to change notification settings - Fork 5
Xsn/server sleep #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Xsn/server sleep #62
Conversation
WalkthroughAdds an idle-sleep feature: new CLI option and config field for sleep timeout, queue idle detection with sleeping/wake transitions that unloads/reloads model memory, public cached JSON props/model_meta exposing Changes
Sequence Diagram(s)sequenceDiagram
participant HTTP as HTTP Server (routes)
participant Queue as Server Queue
participant Context as Server Context
participant Model as Model/Memory
Note over Queue,Context: start_loop(idle_sleep_ms) monitors activity
HTTP->>Queue: POST /completion (task)
activate Queue
Queue->>Context: ensure model loaded / enqueue task
Context->>Model: load if needed
Queue->>Queue: update time_last_task
deactivate Queue
Note over Queue: No tasks for idle_sleep_ms → enter sleeping
Queue->>Queue: set sleeping = true
Queue->>Context: handle_sleeping_state(true)
Context->>Model: destroy() / unload model & free memory
HTTP->>HTTP: GET /props (allowed while sleeping) — returns is_sleeping = true
HTTP->>Queue: POST /completion (wake request)
Queue->>Queue: wait_until_no_sleep() sets req_stop_sleeping
Queue->>Context: handle_sleeping_state(false)
Context->>Model: load_model(resume=true)
Queue->>Queue: set sleeping = false
Queue->>Context: process pending task
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings, 1 inconclusive)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tools/server/server-queue.cpp (1)
176-180: Double reset ofreq_stop_sleepingis redundant.
req_stop_sleepingis set tofalseon line 167 before waiting, and again on line 176 after waking. The second reset is unnecessary since the flag was already false when entering the wait, and the only place it's set to true is inwait_until_no_sleep()which then waits forsleepingto become false.🔎 Proposed simplification
QUE_INF("%s", "exiting sleeping state\n"); - req_stop_sleeping = false; callback_sleeping_state(false); sleeping = false; time_last_task = ggml_time_ms();tools/server/server-context.cpp (2)
584-597: Consider graceful error handling instead ofexit(1)on model reload failure.Calling
exit(1)when model reload fails after waking from sleep is abrupt. While it's difficult to recover from this state, the server could potentially:
- Log the error and stay in an error state
- Return error responses until manually restarted
- Allow retry attempts
The current approach is pragmatic given the complexity of recovery, but consider at least using a named constant or a more descriptive log message indicating this is an unrecoverable error.
🔎 Proposed improvement
} else { SRV_INF("%s", "server is exiting sleeping state\n"); if (!load_model(params_base)) { - SRV_ERR("%s", "fatal: failed to reload model after sleeping\n"); - exit(1); + SRV_ERR("%s", "fatal: failed to reload model after sleeping - server cannot continue\n"); + std::abort(); // Use abort() to indicate unrecoverable error } }
3422-3424: Track TODO:get_modelsshould bypass sleep mode.The TODO comment indicates that
get_modelsshould be accessible without waking the server, similar toget_props. Consider implementing this if model metadata is cached (which appears to be partially done viactx_server.model_name).Since
model_nameis cached inserver_context_impland persists through sleep, this endpoint could potentially bypass sleep for the cached metadata. However,model_meta()accessesmodelwhich may be null during sleep.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
common/arg.cpp(1 hunks)common/common.h(1 hunks)tools/cli/cli.cpp(0 hunks)tools/server/README.md(1 hunks)tools/server/server-context.cpp(13 hunks)tools/server/server-context.h(0 hunks)tools/server/server-queue.cpp(5 hunks)tools/server/server-queue.h(4 hunks)tools/server/server.cpp(0 hunks)tools/server/tests/unit/test_sleep.py(1 hunks)tools/server/tests/utils.py(2 hunks)
💤 Files with no reviewable changes (3)
- tools/cli/cli.cpp
- tools/server/server-context.h
- tools/server/server.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Always activate the Python virtual environment in.venvand use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in.flake8
Ensure Python code passes pyright type checking as configured inpyrightconfig.json
Files:
tools/server/tests/utils.pytools/server/tests/unit/test_sleep.py
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/server/tests/utils.pycommon/arg.cpptools/server/server-context.cpptools/server/server-queue.hcommon/common.htools/server/tests/unit/test_sleep.pytools/server/server-queue.cpp
tools/server/**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Run server unit tests using the provided test infrastructure:
cd tools/server/tests && source ../../../.venv/bin/activate && ./tests.sh
Files:
tools/server/tests/utils.pytools/server/tests/unit/test_sleep.py
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
common/arg.cpptools/server/server-context.cpptools/server/server-queue.hcommon/common.htools/server/server-queue.cpp
🧠 Learnings (1)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
🧬 Code graph analysis (2)
tools/server/tests/unit/test_sleep.py (1)
tools/server/tests/utils.py (3)
tinyllama2(454-465)start(117-275)make_request(288-319)
tools/server/server-queue.cpp (1)
tools/server/server-context.cpp (5)
terminate(2760-2762)terminate(2760-2760)start_loop(2755-2758)start_loop(2755-2755)res(1051-1076)
🪛 Ruff (0.14.8)
tools/server/tests/unit/test_sleep.py
3-3: from utils import * used; unable to detect undefined names
(F403)
5-5: ServerPreset may be undefined, or defined from star imports
(F405)
11-11: ServerPreset may be undefined, or defined from star imports
(F405)
27-27: Avoid equality comparisons to True; use res.body["is_sleeping"]: for truth checks
Replace with res.body["is_sleeping"]
(E712)
39-39: Avoid equality comparisons to False; use not res.body["is_sleeping"]: for false checks
Replace with not res.body["is_sleeping"]
(E712)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: ggml-ci-arm64-cpu-kleidiai
- GitHub Check: macOS-latest-cmake-visionos
- GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
- GitHub Check: ggml-ci-arm64-cpu-high-perf-sve
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
- GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
- GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
- GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
- GitHub Check: ggml-ci-x64-cpu-low-perf
- GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN...
- GitHub Check: ggml-ci-arm64-cpu-high-perf
- GitHub Check: ios-xcode-build
- GitHub Check: ggml-ci-arm64-cpu-low-perf
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: WebUI Checks
- GitHub Check: server-build (Release)
🔇 Additional comments (20)
tools/server/tests/utils.py (1)
103-103: Sleep-idle flag is correctly threaded through ServerProcessThe new
sleep_idle_secondsoption and its propagation intoserver_argsare consistent with other optional numeric flags and with the CLI handler;Nonemeaning “use server default” vs.-1/>0being explicit values is a clean contract. No changes needed here.Also applies to: 234-235
tools/server/README.md (1)
1624-1633: Sleeping-on-idle docs align with implementationThe description of
--sleep-idle-seconds, sleep behavior, and the/healthand/propsexemptions matches the exposed API and test behavior; this section looks accurate and well-scoped.common/arg.cpp (1)
2890-2899: CLI wiring and validation for--sleep-idle-secondslook correctThe new server-only option cleanly maps into
params.sleep_idle_seconds, with a sensible default and validation (-1or>0only). This integrates well with the rest of the argument machinery.common/common.h (1)
478-479: Newsleep_idle_secondsfield is consistent with server paramsAdding
sleep_idle_secondswith a default of-1and the accompanying comment lines up with the CLI flag semantics and keeps server-specific configuration grouped coherently incommon_params.tools/server/server-queue.h (3)
15-18: LGTM! Sleep state tracking fields are well-organized.The new fields
sleeping,req_stop_sleeping, andtime_last_taskprovide a clean mechanism for idle sleep management. Thetime_last_taskinitialization to 0 is fine since it gets properly set instart_loop()before any idle checks occur.
48-55: LGTM! Clean API for sleep state management.The
wait_until_no_sleep()andis_sleeping()methods provide a clear interface for external components to interact with the sleep state. The mutex lock inis_sleeping()ensures thread-safe access.
82-101: Thread-safety warning is helpful but consider defensive null-check for callback.The comment about non-thread-safe usage is valuable. However,
callback_sleeping_statecould be null ifon_sleeping_state()is never called beforestart_loop().Please verify that
on_sleeping_state()is always registered beforestart_loop()is called, or add a null check before invoking the callback instart_loop(). Based on the code inserver-context.cpp, it appears the callback is registered inload_model()which is called beforestart_loop(), but this ordering dependency should be documented or enforced.tools/server/server-queue.cpp (4)
36-37: LGTM! Consistent time tracking across all queue mutation operations.Updating
time_last_taskinpost(),post(vector),defer(), andpop_deferred_task()ensures accurate idle time calculation.Also applies to: 58-59, 67-68, 83-84
87-102: LGTM! Well-implemented wake-from-sleep mechanism.The
wait_until_no_sleep()function correctly handles the synchronization:
- Early return if not sleeping
- Sets
req_stop_sleepingand notifies the main loop- Waits until
sleepingbecomes falseThe condition variable usage is correct for coordinating between the caller and the main loop.
163-181: Callback invoked while holding mutex - verify no deadlock risk.
callback_sleeping_state(true)andcallback_sleeping_state(false)are called whilemutex_tasksis held (lines 166, 177). If the callback implementation attempts to call anyserver_queuemethod that acquires this mutex, it will deadlock.Based on
server-context.cpp,handle_sleeping_state()callsdestroy()andload_model(), which don't appear to directly call queue methods that would re-acquire the mutex. However, this is a subtle contract.Consider documenting this constraint or releasing the lock before invoking callbacks. The current implementation works because
handle_sleeping_state()doesn't re-enterserver_queuemethods, but this is fragile.
149-153: Good defensive update oftime_last_taskafter slot updates.This ensures that long-running inference operations don't falsely trigger idle sleep immediately after completion.
tools/server/server-context.cpp (9)
547-550: LGTM! Cached JSON responses improve efficiency.Caching server properties in
json_server_propsavoids rebuilding the same JSON on every/propsrequest.
561-565: Destructor correctly skips cleanup when sleeping.When the server is in sleeping state, the model has already been destroyed by
handle_sleeping_state(), so skippingdestroy()prevents double-free issues.
600-615: LGTM! Resume path correctly skips queue wiring.When resuming from sleep (
is_resume == true), the queue callbacks are already registered from the initialload_model()call, so skipping re-registration is correct.
749-749: Clearing slots on resume is correct but verify no dangling references.
slots.clear()before re-populating ensures fresh slot state on resume. Since the slots are owned by theserver_context_impland referenced by pointer/index in tasks, ensure no tasks are in flight when entering sleep state.The sleep is entered only when the queue is idle (no tasks), so this should be safe. However, verify that
queue_tasksis empty and no deferred tasks reference old slots when entering sleep.
806-811: LGTM! Resume path correctly skips one-time initialization.Skipping
metrics.init(), prompt cache setup, and model name derivation on resume is correct since these are one-time setup operations that should persist across sleep cycles.
2784-2794: LGTM! Bypass sleep logic is clean and efficient.The
bypass_sleepflag with the fast path check for disabled sleep (sleep_idle_seconds < 0) is well-designed. This allows health checks and property queries to respond immediately without waking the server.
3026-3026: LGTM! Health endpoint correctly bypasses sleep.Health checks should always respond immediately regardless of sleep state, which this achieves with
bypass_sleep = true.
3207-3212: LGTM! Props endpoint returns sleep status.Including
is_sleepingin the response allows clients to know the server's current state, which is useful for monitoring and debugging.
2755-2757: Verify unit conversion: seconds to milliseconds.
params.sleep_idle_seconds * 1000converts seconds to milliseconds. Ensure thatsleep_idle_secondsis indeed in seconds (the name suggests it is). Ifsleep_idle_secondsis negative (disabled), the multiplication preserves the negative value correctly.
|
@coderabbitai any potential thread-safety issues? |
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/server/server-context.cpp (1)
3427-3472: Incomplete: get_models should bypass sleep per TODO.Line 3427 has a TODO indicating this endpoint should bypass sleep mode, but Line 3429 doesn't pass
bypass_sleep=true. This means/modelsrequests will block while the server is sleeping or loading.🔎 Proposed fix
// TODO: allow this endpoint to be accessed bypassing sleep mode, same method as get_props this->get_models = [this](const server_http_req &) { - auto res = std::make_unique<server_res_generator>(ctx_server); + auto res = std::make_unique<server_res_generator>(ctx_server, true); json model_meta = nullptr; if (is_ready()) { model_meta = ctx_server.model_meta();You'll also need to handle the case where
ctx_server.ctxmight be nullptr during sleep (similar to howget_propshandles it by returning cached data).
🧹 Nitpick comments (1)
tools/server/server-context.cpp (1)
567-587: Consider making destroy() idempotent.The
destroy()method is called from multiple places (destructor and sleep transition). While the current implementation appears safe, adding an early-return check at the beginning would make it explicitly idempotent and more robust:void destroy() { if (!llama_init && !mctx) { return; // already destroyed } // ... rest of implementation }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tools/server/server-context.cpp(13 hunks)tools/server/server-context.h(1 hunks)tools/server/server.cpp(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tools/server/server.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/server/server-context.htools/server/server-context.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/server/server-context.htools/server/server-context.cpp
🧠 Learnings (3)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
tools/server/server-context.h
📚 Learning: 2025-03-13T22:57:04.482Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/imatrix/imatrix.cpp:500-500
Timestamp: 2025-03-13T22:57:04.482Z
Learning: The function llama_batch_ext_init never returns nullptr in the llama.cpp codebase, so null checks after calling this function are unnecessary.
Applied to files:
tools/server/server-context.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
- GitHub Check: windows-2022-cmake-cuda (12.4)
- GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
- GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
- GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
- GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: windows-latest-cmake-sycl
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: server-windows
- GitHub Check: pyright type-check
- GitHub Check: server-build (ADDRESS, RelWithDebInfo)
- GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
- GitHub Check: server-build (Release)
- GitHub Check: WebUI Checks
🔇 Additional comments (8)
tools/server/server-context.h (1)
35-36: LGTM: Comment accurately reflects new sleeping behavior.The updated documentation correctly indicates that
get_llama_context()may return nullptr during sleep transitions, which aligns with the sleeping state management introduced in this PR.tools/server/server-context.cpp (7)
547-549: Good: JSON response caching improves performance.Caching
json_server_propsavoids re-computing server properties on every request, which is a sensible optimization for read-heavy endpoints like/props.
754-754: Verify intended behavior: slots.clear() on resume.Line 754 calls
slots.clear()even when resuming from sleep (is_resume=true). This erases all slot state including cached prompts and context.Is this intentional? If the goal is to preserve slot state across sleep/wake cycles, this clear operation defeats that purpose. If the goal is to start fresh on wake, then this is correct.
Please clarify the intended behavior and add a comment explaining the decision.
876-926: LGTM: JSON response population logic is sound.The
populate_json_responses()method correctly builds cached server properties. The JSON parsing includes proper error handling (lines 881-886).Note: This is only called during initial load, not on resume from sleep. Verify this is acceptable given that server properties shouldn't change between sleep cycles.
2760-2762: LGTM: start_loop() correctly passes sleep configuration.The sleep idle timeout is properly passed from
params_baseto the queue'sstart_loop()method.
2788-2808: LGTM: bypass_sleep mechanism enables health checks during sleep.The
bypass_sleepparameter correctly allows certain endpoints (health, props) to respond immediately even when the server is sleeping, while blocking other requests viawait_until_no_sleep().The fast-path optimization on Line 2795 (when sleep is disabled) is a nice touch.
3029-3034: LGTM: Health endpoint correctly bypasses sleep.Using
bypass_sleep=trueensures health checks return immediately even during sleep transitions, which is the expected behavior for health monitoring.
3212-3218: LGTM: Props endpoint exposes sleeping state correctly.The endpoint uses
bypass_sleep=trueand includesis_sleepingin the response, allowing clients to monitor the server's sleep state without blocking.
| ~server_context_impl() { | ||
| if (!sleeping) { | ||
| destroy(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential resource leak if destroyed while sleeping.
When the server is in sleeping state, the destructor skips calling destroy(). However, destroy() also cleans up resources beyond llama_init, including:
mctx(multimodal context)- Draft contexts and speculative decoding state in slots (
ctx_dft,spec) batchmemory
These resources will leak if the server is destroyed while sleeping.
🔎 Proposed fix
~server_context_impl() {
- if (!sleeping) {
- destroy();
- }
+ destroy();
}And update destroy() to be idempotent:
void destroy() {
+ if (!llama_init && !mctx) {
+ return; // already destroyed
+ }
llama_init.reset();
ctx = nullptr;
model = nullptr;
mtmd_free(mctx);
mctx = nullptr;
// ... rest of cleanup
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ~server_context_impl() { | |
| if (!sleeping) { | |
| destroy(); | |
| } | |
| } | |
| ~server_context_impl() { | |
| destroy(); | |
| } |
🤖 Prompt for AI Agents
In tools/server/server-context.cpp around lines 561 to 565, the destructor
currently skips calling destroy() when sleeping which allows resources (mctx,
per-slot ctx_dft and spec, batch memory, etc.) to leak; always invoke destroy()
from the destructor and make destroy() idempotent: have destroy() check and
early-return if already destroyed, free/release mctx, iterate slots to free
ctx_dft and spec and clear any speculative state, free batch memory, stop/join
any background threads, null out pointers and reset flags after freeing, and
guard all frees with null checks to prevent double-free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/server/server-context.cpp (1)
569-589: Makedestroy()fully idempotent.The method should be safe to call multiple times to support the destructor fix (always calling
destroy()). Add null checks before freeing resources:
- Check
mctx != nullptrbefore callingmtmd_free(mctx)(line 574)- Check
slot.ctx_dft != nullptrbefore callingllama_free(slot.ctx_dft)(line 579)- Check
slot.spec != nullptrbefore callingcommon_speculative_free(slot.spec)(line 582)- Consider adding an early return if
llama_initis already null to avoid redundant cleanupThis makes the method safe to call from both the destructor and
handle_sleeping_state().
♻️ Duplicate comments (2)
tools/server/server-context.cpp (2)
561-567: Resource leak risk when destroyed while sleeping.As previously noted, skipping
destroy()whensleeping=truecan leak resources likemctx, per-slotctx_dftandspec, andbatchmemory. The recommended fix is to always calldestroy()and make it idempotent.
591-603: Critical thread-safety issue confirmed (per PR discussion).As noted in the PR comments, this callback is invoked while holding
mutex_tasksinserver-queue.cpp, blocking all queue operations during expensivedestroy()andload_model()calls.Additionally,
GGML_ABORTon Line 599 terminates the process. Consider whether graceful degradation or retry logic would be more appropriate.
🧹 Nitpick comments (2)
tools/server/server-context.cpp (2)
610-621: Potential double-registration of queue callbacks.If
load_model()is called multiple times whensleeping=false, the queue callbacks will be registered multiple times, causing each callback to fire multiple times per event. Consider adding a guard or assertion to ensure this block only executes once:if (!is_resume) { static bool callbacks_registered = false; GGML_ASSERT(!callbacks_registered && "callbacks already registered"); callbacks_registered = true; // ... register callbacks ... }Or ensure at the architectural level that
load_model()is only called once during initial startup and once per wake from sleep.
3428-3430: Consider implementing bypass for get_models endpoint.For consistency with
get_propsandget_health, the/modelsendpoint should also bypass sleep mode, as it returns mostly static metadata that doesn't require a loaded model.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/server/server-context.cpp(13 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/server/server-context.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/server/server-context.cpp
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:47.417Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:06.114Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.
Applied to files:
tools/server/server-context.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: build-cmake-pkg / linux
- GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
- GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: ggml-ci-arm64-cpu-low-perf
- GitHub Check: ubuntu-24-wasm-webgpu
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: server-windows
- GitHub Check: server-build (ADDRESS, RelWithDebInfo)
- GitHub Check: server-build (Release)
- GitHub Check: WebUI Checks
- GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
🔇 Additional comments (11)
tools/server/server-context.cpp (11)
547-549: LGTM - Caching server properties for performance.The cached
json_server_propsavoids rebuilding server metadata on every request, which is a good optimization for the/propsendpoint.
559-559: LGTM - State tracking for sleep mode.The
sleepingflag cleanly tracks whether the server is in sleep mode.
694-696: LGTM - Avoiding redundant logging setup on resume.Skipping
mtmd_helper_log_set()on resume is correct since logging configuration is global and doesn't need to be reapplied.
755-755: LGTM - Clearing slots for fresh initialization.The
slots.clear()is necessary to remove old slot objects before repopulating. This works correctly withdestroy()which frees per-slot resources before this line executes.
772-772: LGTM - Proper error handling for slot initialization failures.Returning
falsewhen draft context or speculator creation fails correctly signalsload_model()failure.Also applies to: 778-778
812-814: Verify prompt_cache validity after sleep/wake cycle.The early return skips prompt_cache recreation. However,
destroy()frees the context but doesn't clearprompt_cache, leaving it with stale state referencing the old context. When the model is reloaded on wake, the cache may contain invalid data.Consider clearing
prompt_cacheindestroy()to avoid stale state:void destroy() { // ... existing cleanup ... prompt_cache.reset(); // Clear stale cache }Or recreate the cache on resume before the early return.
869-875: LGTM - Proper initialization and error handling.The call to
populate_json_responses()correctly builds cached server properties, with appropriate error handling.
877-927: LGTM - Comprehensive server properties builder.The method correctly builds and caches server metadata, with appropriate error handling for JSON parsing failures.
762-763: LGTM - Enabling idle sleep with correct time conversion.The sleep idle timeout is correctly passed in milliseconds to
start_loop().
2790-2809: LGTM - Clean bypass mechanism for sleep mode.The
bypass_sleepparameter elegantly allows certain endpoints (health, props) to respond without waiting for wake, while most endpoints block until the server is ready.
3032-3032: LGTM - Appropriate bypass usage for health and props.Both endpoints correctly use
bypass_sleep=true:
- Health checks should always respond quickly for monitoring
- Props endpoint reports the current
is_sleepingstate by merging it with cached propertiesAlso applies to: 3214-3218
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tools/server/server-context.cpp (2)
597-600: Consider graceful degradation on model reload failure.Using
GGML_ABORTwhen model reload fails provides a clear failure mode, but it's a harsh response to potentially transient issues (e.g., temporary memory pressure).Given the author's preference for simplicity (per learnings), this is acceptable for the initial implementation, but consider whether returning to a "permanently sleeping" state with an error response might be more user-friendly in the future.
3425-3427: Note on TODO for get_models bypass.The TODO is valid, but bypassing sleep for
/modelswould require caching the model metadata (similar to howjson_server_propsis cached) since the model APIs used inmodel_meta()require the model to be loaded. Consider caching model metadata inpopulate_json_responses()if this endpoint should be accessible during sleep.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/server/server-context.cpp(13 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/server/server-context.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/server/server-context.cpp
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:47.417Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:06.114Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.
Applied to files:
tools/server/server-context.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
- GitHub Check: build-cmake-pkg / linux
- GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: ggml-ci-arm64-cpu-kleidiai
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-24-wasm-webgpu
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
- GitHub Check: server-build (ADDRESS, RelWithDebInfo)
- GitHub Check: server-build (Release)
- GitHub Check: WebUI Checks
- GitHub Check: server-windows
- GitHub Check: pyright type-check
🔇 Additional comments (9)
tools/server/server-context.cpp (9)
547-549: LGTM!The cached JSON properties member is appropriately initialized and provides a clean way to avoid rebuilding the response on each request.
561-567: The destructor logic is intentional but verify edge cases.The comment clarifies that
destroy()was already called when entering sleep. However, there's a subtle edge case: if the server is destroyed during the wake transition (aftersleepingis set tofalseinhandle_sleeping_statebut beforeload_modelcompletes),destroy()will be called on a partially-initialized state.Since
destroy()already handles null pointers gracefully (e.g.,llama_init.reset()on nullptr is safe), this should be fine in practice.
755-755: Important fix: clearing slots before reinitializing.The
slots.clear()ensures that slots from a previous session are properly discarded before creating new ones during resume. This prevents slot accumulation across sleep/wake cycles.
874-924: LGTM!Clean extraction of JSON response population into a dedicated function with proper error handling for webui config parsing.
2787-2797: LGTM!The
bypass_sleepparameter provides a clean mechanism for endpoints that don't require the model (like/healthand/props) to respond immediately without waiting for model reload. The fast-path optimization when sleep is disabled avoids unnecessary synchronization overhead.
3027-3032: LGTM!Health endpoint correctly bypasses sleep wait, allowing health checks to respond immediately. The sleeping status is appropriately exposed via the
/propsendpoint instead.
3210-3216: LGTM!The
/propsendpoint correctly:
- Bypasses sleep wait to remain responsive during sleep
- Returns the cached properties plus the current
is_sleepingstatus- Creates a copy before modifying to avoid mutating the cached object
770-779: LGTM!Proper error handling for speculative decoding initialization - if draft context or speculator fails to initialize, the function correctly returns
falseto signal the failure.
2759-2760: No action needed. The queue implementation already properly handles negativeidle_sleep_msvalues—when negative, theshould_sleep()lambda returnsfalse, disabling the sleeping procedure as documented. The concern about unexpected behavior with negative timeouts is unfounded.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tools/server/server-context.cpp (1)
569-589: Consider adding idempotency check for defensive programming.The
destroy()method is called from two controlled locations (destructor andhandle_sleeping_state), but adding an early-return idempotency check would make it more robust against future changes or accidental double-calls.🔎 Optional enhancement
void destroy() { + if (!llama_init && !mctx) { + return; // already destroyed + } llama_init.reset(); ctx = nullptr; model = nullptr; mtmd_free(mctx); mctx = nullptr;
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tools/server/README.md(1 hunks)tools/server/server-context.cpp(20 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tools/server/README.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/server/server-context.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/server/server-context.cpp
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:47.417Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:06.114Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.
Applied to files:
tools/server/server-context.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: build-cmake-pkg / linux
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: windows-latest-cmake-sycl
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: ggml-ci-x64-cpu-low-perf
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: windows-latest-cmake-hip
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: windows-2022-cmake-cuda (12.4)
- GitHub Check: ggml-ci-x64-cpu-high-perf
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: pyright type-check
- GitHub Check: server-windows
- GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
- GitHub Check: server-build (ADDRESS, RelWithDebInfo)
- GitHub Check: server-build (Release)
- GitHub Check: WebUI Checks
🔇 Additional comments (3)
tools/server/server-context.cpp (3)
874-936: LGTM - Well-structured JSON response caching.The
populate_json_responses()method cleanly separates the JSON response building logic and is correctly called after model initialization on both fresh load and resume from sleep. The error handling in the caller (lines 866-872) properly validates the result.
2788-2798: LGTM - Bypass sleep logic is sound.The
server_res_generatorconstructor correctly implements the bypass logic:
- Read-only endpoints (health, props, models) pass
bypass_sleep=trueand use cached JSON data- Endpoints requiring an active model wait via
wait_until_no_sleep(), which triggers wake-up when needed- The fast-path check for disabled sleeping (
sleep_idle_seconds < 0) avoids unnecessary overhead
607-622: LGTM - Resume logic correctly preserves queue callbacks.The
is_resumeflag appropriately skips re-wiring queue callbacks when waking from sleep, since the lambda captures ofthisremain valid across the sleep/wake cycle. This avoids creating duplicate callbacks while ensuring the sleeping state handler remains functional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tools/server/README.md(1 hunks)tools/server/server-context.cpp(20 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tools/server/README.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/server/server-context.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/server/server-context.cpp
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:47.417Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:06.114Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.
Applied to files:
tools/server/server-context.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release)
- GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
- GitHub Check: server-build (Release)
- GitHub Check: WebUI Checks
- GitHub Check: server-build (ADDRESS, RelWithDebInfo)
- GitHub Check: server-windows
- GitHub Check: pyright type-check
| // cached responses for HTTP API (read-only from HTTP threads) | ||
| json json_server_props = json::object(); | ||
| json json_server_model_meta = json::object(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition on JSON response cache access.
The json_server_props and json_server_model_meta members are written by populate_json_responses() (called from load_model() at Line 866) but read directly by HTTP threads in route handlers (Lines 3217, 3445) without synchronization. When the server wakes from sleep, handle_sleeping_state() calls load_model(), which modifies these JSON objects concurrently with HTTP handlers reading them.
C++ does not provide atomic guarantees for complex types like nlohmann::json, so concurrent read/write access creates a data race.
🔎 Proposed fix
Wrap access to these cached JSON objects with a mutex:
+ std::mutex mutex_json_cache;
+
// cached responses for HTTP API (read-only from HTTP threads)
json json_server_props = json::object();
json json_server_model_meta = json::object();Then protect writes in populate_json_responses():
bool populate_json_responses() {
// ... build json_webui_settings, default_generation_settings_for_props, etc. ...
+ std::lock_guard<std::mutex> lock(mutex_json_cache);
json_server_props = { /* ... */ };
json_server_model_meta = { /* ... */ };
return true;
}And protect reads in route handlers:
this->get_props = [this](const server_http_req &) {
auto res = std::make_unique<server_res_generator>(ctx_server, true);
+ json props;
+ {
+ std::lock_guard<std::mutex> lock(ctx_server.mutex_json_cache);
+ props = ctx_server.json_server_props;
+ }
- auto props = ctx_server.json_server_props;
props["is_sleeping"] = ctx_server.queue_tasks.is_sleeping();
res->ok(props);
return res;
};Apply similar protection for get_models and any other readers.
🤖 Prompt for AI Agents
In tools/server/server-context.cpp around lines 547-549, the json_server_props
and json_server_model_meta objects are written without synchronization while
HTTP threads read them, causing a data race; to fix, add a std::mutex member
(e.g., mutex_json_cache) to the server context class, take a
std::lock_guard<std::mutex> when assigning/updating json_server_props and
json_server_model_meta inside populate_json_responses() (and any other writers
such as load_model()), and take a lock in every HTTP route handler that reads
these caches (get_props, get_models, etc.), copying the json into a local
variable while holding the lock and then release the lock before mutating the
local copy or returning it; ensure all reader/writer sites are updated to use
this mutex to eliminate concurrent read/write access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/server/server-context.cpp (1)
569-589: Makedestroy()idempotent to support unconditional destructor calls.To resolve the resource-leak issue in the destructor (lines 561-567),
destroy()must become idempotent. Add an early-return check at the start:void destroy() { if (!llama_init && !mctx) { return; // already destroyed } // ... rest of cleanup }Additionally, guard individual resource frees with null-checks to prevent double-free if
destroy()is invoked multiple times.🔎 Proposed fix
void destroy() { + if (!llama_init && !mctx) { + return; // already destroyed + } + llama_init.reset(); ctx = nullptr; model = nullptr; - mtmd_free(mctx); - mctx = nullptr; + if (mctx) { + mtmd_free(mctx); + mctx = nullptr; + } // Clear any sampling context for (server_slot & slot : slots) { - llama_free(slot.ctx_dft); - slot.ctx_dft = nullptr; + if (slot.ctx_dft) { + llama_free(slot.ctx_dft); + slot.ctx_dft = nullptr; + } - common_speculative_free(slot.spec); - slot.spec = nullptr; + if (slot.spec) { + common_speculative_free(slot.spec); + slot.spec = nullptr; + } llama_batch_free(slot.batch_spec); } llama_batch_free(batch); }
♻️ Duplicate comments (2)
tools/server/server-context.cpp (2)
547-549: Race condition on JSON cache access remains unresolved.As previously flagged,
json_server_propsandjson_server_model_metaare accessed by HTTP threads without synchronization. This review confirms:
- Writers:
populate_json_responses()(lines 877-939) assigns these objects, called fromload_model()during wake-up- Readers: Multiple HTTP endpoints (e.g.,
get_propsat line 3220,get_modelsat line 3448) read directly without locks- Aggravating factor:
get_propsusesbypass_sleep=true(line 3219), allowing reads during model reload when writes are occurringThe previously suggested fix (adding a
mutex_json_cacheand protecting all read/write sites) should be implemented before merging.Also applies to: 877-939, 3218-3224
561-567: Resource leak when destroyed during sleep (previously flagged).The destructor still skips
destroy()whensleeping=true, leavingmctx, per-slotctx_dft/spec/batch_spec, and the mainbatchun-freed. The previously suggested fix—always invokedestroy()and make it idempotent by checking and early-returning if resources are already freed—should be implemented.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/server/server-context.cpp(20 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/server/server-context.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/server/server-context.cpp
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:47.417Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:06.114Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.
Applied to files:
tools/server/server-context.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
- GitHub Check: windows-latest-cmake-sycl
- GitHub Check: macOS-latest-cmake-ios
- GitHub Check: macOS-latest-cmake-visionos
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: windows-2022-cmake-cuda (12.4)
- GitHub Check: ggml-ci-x64-cpu-high-perf
- GitHub Check: ggml-ci-x64-cpu-low-perf
- GitHub Check: macOS-latest-cmake-tvos
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
- GitHub Check: server-build (ADDRESS, RelWithDebInfo)
- GitHub Check: WebUI Checks
- GitHub Check: server-build (Release)
- GitHub Check: server-windows
- GitHub Check: pyright type-check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
tools/server/server-context.cpp (1)
547-549: Critical race condition on JSON cache remains unaddressed (duplicate from previous review).The previous review identified a data race where
json_server_propsandjson_server_model_metaare written bypopulate_json_responses()(called fromload_model()at line 874) but read directly by HTTP threads (lines 3225, 3453) without synchronization. This creates undefined behavior when the server wakes from sleep.The comment "read-only from HTTP threads" is misleading—these fields are writable during model reload.
🔎 Recommended fix
Add a mutex to protect concurrent access:
+ std::mutex mutex_json_cache; + // cached responses for HTTP API (read-only from HTTP threads) json json_server_props = json::object(); json json_server_model_meta = json::object();Protect writes in
populate_json_responses():bool populate_json_responses() { // ... build json objects ... + std::lock_guard<std::mutex> lock(mutex_json_cache); json_server_props = { /* ... */ }; json_server_model_meta = { /* ... */ }; return true; }Protect reads in route handlers (lines 3225, 3453):
json props; { std::lock_guard<std::mutex> lock(ctx_server.mutex_json_cache); props = ctx_server.json_server_props; } props["is_sleeping"] = ctx_server.queue_tasks.is_sleeping();Based on learnings, this fix is necessary before merging.
🧹 Nitpick comments (2)
tools/server/README-dev.md (1)
110-111: Use markdown link syntax for bare URLs.The newly added PR references follow the existing formatting pattern in the file, but all URLs in this section (lines 100–111) are bare URLs. Consider wrapping them in proper markdown link syntax (
[text](url)) for better markup consistency and structure, as flagged by markdownlint.This is an optional refactor that could improve the entire "Notable Related PRs" section uniformly.
🔎 Proposed improvement for lines 110–111 (alternative: apply to all lines 100–111)
- INI presets: https://github.com/ggml-org/llama.cpp/pull/17859 (+ refactoring: https://github.com/ggml-org/llama.cpp/pull/18169) - Sleeping mode: https://github.com/ggml-org/llama.cpp/pull/18228 + INI presets: [PR 17859](https://github.com/ggml-org/llama.cpp/pull/17859) (+ refactoring: [PR 18169](https://github.com/ggml-org/llama.cpp/pull/18169)) + Sleeping mode: [PR 18228](https://github.com/ggml-org/llama.cpp/pull/18228)tools/server/server-context.cpp (1)
569-589: Consider makingdestroy()idempotent for defensive programming.While the current design prevents double-calls through the
sleepingflag, adding early-return guards would makedestroy()safer and more maintainable:void destroy() { if (!llama_init && !mctx && slots.empty()) { return; // already destroyed } llama_init.reset(); ctx = nullptr; model = nullptr; mtmd_free(mctx); mctx = nullptr; // ... rest of cleanup }This prevents potential issues if the calling logic changes in the future.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tools/server/README-dev.md(1 hunks)tools/server/server-context.cpp(19 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/server/server-context.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/server/server-context.cpp
🧠 Learnings (6)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
tools/server/README-dev.mdtools/server/server-context.cpp
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to include/llama.h : Exercise careful consideration when making changes to the public API in `include/llama.h`, as API stability is critical
Applied to files:
tools/server/README-dev.md
📚 Learning: 2025-03-13T22:56:47.417Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:06.114Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:57:04.482Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/imatrix/imatrix.cpp:500-500
Timestamp: 2025-03-13T22:57:04.482Z
Learning: The function llama_batch_ext_init never returns nullptr in the llama.cpp codebase, so null checks after calling this function are unnecessary.
Applied to files:
tools/server/server-context.cpp
🪛 GitHub Actions: EditorConfig Checker
tools/server/server-context.cpp
[error] 854-854: Trailing whitespace detected by editorconfig-checker at line 854. Step: editorconfig-checker.
🪛 markdownlint-cli2 (0.18.1)
tools/server/README-dev.md
110-110: Bare URL used
(MD034, no-bare-urls)
110-110: Bare URL used
(MD034, no-bare-urls)
111-111: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: macOS-latest-cmake-visionos
- GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: server-build (ADDRESS, RelWithDebInfo)
- GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
- GitHub Check: server-build (Release)
- GitHub Check: WebUI Checks
- GitHub Check: pyright type-check
- GitHub Check: server-windows
🔇 Additional comments (2)
tools/server/server-context.cpp (2)
3037-3039: Good documentation of the sleeping-aware response pattern.The comment clearly explains why all routes must use
server_res_generator, ensuring consistent handling of the sleeping state across all endpoints.
3040-3045: Correct use ofbypass_sleepfor health endpoint.Health checks properly bypass the sleep wait, ensuring monitoring systems can query server status without blocking. This is the right design for operational health checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (5)
tools/server/server-context.cpp (5)
547-549: Data race on cached JSON objects remains unaddressed.The
json_server_propsandjson_server_model_metamembers are written bypopulate_json_responses()(called during model reload on wake from sleep) while HTTP threads read them concurrently without synchronization. This creates undefined behavior.
561-567: Destructor skips cleanup when sleeping—potential resource leak.When
sleepingis true, the destructor skipsdestroy(), butmctx, per-slotctx_dft/spec, andbatchmay still hold resources that need cleanup. Thedestroy()method should be made idempotent and always called.
591-603: Thread-safety issue: callback invoked while holding mutex (duplicate).As noted in previous reviews,
handle_sleeping_state()is called fromserver-queue.cppwhilemutex_tasksis held. Bothdestroy()andload_model()are expensive operations that can block for seconds/minutes, causing all queue operations to stall.
3223-3228: Unsynchronized read ofjson_server_props(duplicate).While
bypass_sleep=trueis correct for the props endpoint, readingctx_server.json_server_propswithout synchronization creates a data race withpopulate_json_responses()writes during model reload.
3447-3459: TODO remains unresolved; race condition persists (duplicate).The TODO at lines 3447-3448 acknowledges the endpoint is unsafe during model reload. While the comment at lines 3457-3459 warns against accessing dynamic model info,
json_server_model_metais still read without synchronization, andctx_server.mctxaccess at line 3455 may observe inconsistent state during sleep transitions.
🧹 Nitpick comments (1)
tools/server/server-context.cpp (1)
559-559: Consider atomic or mutex protection forsleepingflag.The
sleepingflag is accessed fromhandle_sleeping_state()(which may be called from the queue thread),load_model(), and potentially inspected viaqueue_tasks.is_sleeping()from HTTP threads. While current usage appears safe due to sequencing, usingstd::atomic<bool>would provide explicit thread-safety guarantees.🔎 Suggested change
- bool sleeping = false; + std::atomic<bool> sleeping{false};
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/server/server-context.cpp(19 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/server/server-context.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/server/server-context.cpp
🧠 Learnings (5)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:47.417Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:06.114Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:57:04.482Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/imatrix/imatrix.cpp:500-500
Timestamp: 2025-03-13T22:57:04.482Z
Learning: The function llama_batch_ext_init never returns nullptr in the llama.cpp codebase, so null checks after calling this function are unnecessary.
Applied to files:
tools/server/server-context.cpp
🧬 Code graph analysis (1)
tools/server/server-context.cpp (1)
tools/server/server-common.cpp (4)
oaicompat_chat_params_parse(844-1083)oaicompat_chat_params_parse(844-847)convert_anthropic_to_oai(1085-1299)convert_anthropic_to_oai(1085-1085)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
- GitHub Check: ggml-ci-x64-cpu-low-perf
- GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
- GitHub Check: windows-latest-cmake-sycl
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: windows-2022-cmake-cuda (12.4)
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: android-build
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: server-windows
🔇 Additional comments (9)
tools/server/server-context.cpp (9)
742-742: LGTM: Slot clearing before re-initialization.Clearing
slotsbefore the loop ensures clean slate on both fresh load and resume, preventing stale slot state accumulation.
855-880: Well-structured initialization separation.The separation of
init()for one-time setup (queue wiring, metrics) fromload_model()for repeatable model loading is a good design choice that cleanly supports the sleep/wake lifecycle.
882-944:populate_json_responses()correctly builds cached responses.The function properly extracts model metadata and server properties. However, as noted in prior comments, writes to
json_server_propsandjson_server_model_metarequire synchronization since HTTP threads read them concurrently.
2767-2770: LGTM: Sleep idle timeout integration.The idle timeout is correctly converted from seconds to milliseconds and passed to the queue's main loop.
2796-2806: Good design for sleep-aware response generation.The
bypass_sleepparameter allows read-only endpoints (health, props) to respond immediately without waiting for model load, while inference endpoints correctly block viawait_until_no_sleep(). The fast-path optimization when sleeping is disabled (sleep_idle_seconds < 0) avoids unnecessary synchronization overhead.
2823-2834: Clean refactor to support sleep-aware generation.Passing the pre-constructed
server_res_generatoras the first parameter allows callers to control thebypass_sleepbehavior before entering this function. The move semantics are used correctly.
3040-3045: Correct: Health endpoint bypasses sleep.Health checks should always respond immediately regardless of sleep state, which is correctly implemented with
bypass_sleep=true.
3354-3366: Consistent refactoring across completion endpoints.The pattern of constructing
server_res_generatorfirst (which handles sleep/wake synchronization) and then passing it tohandle_completions_implis applied consistently across all completion-related endpoints.
3036-3039: Good documentation of the sleep-aware pattern.The comment clearly explains the requirement that all route handlers must start with
std::make_unique<server_res_generator>, ensuring consistent sleep handling across endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tools/server/server-context.cpp (3)
569-589: Consider makingdestroy()fully idempotent.Currently,
destroy()relies on thesleepingflag in the destructor to prevent double-free. While this works for the current design, makingdestroy()itself idempotent would improve robustness.🔎 Suggested approach
Add an early-return guard at the start of
destroy():void destroy() { + if (!llama_init && !mctx && batch.n_tokens == 0) { + return; // already destroyed + } llama_init.reset(); ctx = nullptr; model = nullptr; mtmd_free(mctx); mctx = nullptr; // ... rest of cleanup }This allows
destroy()to be safely called multiple times without relying on external state checks.
598-600: Consider more graceful error handling thanGGML_ABORT.If
load_model()fails during wake (e.g., model file moved/deleted, OOM),GGML_ABORTwill crash the entire server. While the server is in a difficult-to-recover state, consider alternatives:
- Stay in sleeping state and log the error, allowing retry on next request
- Set an error state that
/healthcan report- Gracefully terminate the server process with an exit code
However, given the intentional blocking during state transition (per learnings), recovery complexity is high. If you decide to keep
GGML_ABORT, consider adding a detailed error message explaining why the server cannot continue.🔎 Suggested approach
} else { SRV_INF("%s", "server is exiting sleeping state\n"); if (!load_model(params_base)) { - GGML_ABORT("failed to reload model after sleeping"); + SRV_ERR("Critical: failed to reload model after sleeping. Model path: %s\n", + params_base.model.path.c_str()); + // Stay in sleeping state; next request will retry + return; } } sleeping = new_state;
3447-3453: Consider addingbypass_sleep=trueto resolve TODO.The TODO at lines 3447-3448 can be addressed by adding
bypass_sleep=trueto theserver_res_generatorconstructor, similar to/healthand/propsendpoints. Sincejson_server_model_metais populated once during initialization and represents static model metadata (vocab size, context length, etc.), it's safe to read during sleep state.🔎 Proposed fix
// TODO: this endpoint is unsafe to access during model reloading (i.e. wake up from sleeping) // how to make it work even during load_model()? this->get_models = [this](const server_http_req &) { - auto res = std::make_unique<server_res_generator>(ctx_server); + auto res = std::make_unique<server_res_generator>(ctx_server, true); json model_meta = nullptr; if (is_ready()) { model_meta = ctx_server.json_server_model_meta; }This allows clients to query model information immediately, even when the server is sleeping, similar to how
/propsworks.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/server/server-context.cpp(19 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/server/server-context.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/server/server-context.cpp
🧠 Learnings (6)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 62
File: tools/server/server-context.cpp:591-603
Timestamp: 2025-12-20T21:18:14.768Z
Learning: In ngxson/llama.cpp server implementation, holding mutex_tasks during callback_sleeping_state() (which calls handle_sleeping_state) is intentional behavior. No tasks should be allowed to be pushed into the queue during sleeping state transition, even though this blocks all queue operations (post, defer, wait_until_no_sleep) for the duration of model unload/reload (potentially minutes).
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-20T21:18:14.768Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 62
File: tools/server/server-context.cpp:591-603
Timestamp: 2025-12-20T21:18:14.768Z
Learning: In ngxson/llama.cpp server implementation, holding mutex_tasks during callback_sleeping_state() (which calls handle_sleeping_state) is intentional behavior. No tasks should be allowed to be pushed into the queue during sleeping state transition, even though this blocks all queue operations (post, defer, wait_until_no_sleep) for the duration of model unload/reload (potentially minutes).
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:47.417Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:56:06.114Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/speculative-simple/speculative-simple.cpp:172-172
Timestamp: 2025-03-13T22:56:06.114Z
Learning: For the speculative-simple example in llama.cpp, ngxson prefers deferring error handling of decode functions to maintain code simplicity in the initial implementation.
Applied to files:
tools/server/server-context.cpp
📚 Learning: 2025-03-13T22:57:04.482Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/imatrix/imatrix.cpp:500-500
Timestamp: 2025-03-13T22:57:04.482Z
Learning: The function llama_batch_ext_init never returns nullptr in the llama.cpp codebase, so null checks after calling this function are unnecessary.
Applied to files:
tools/server/server-context.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: windows-latest-cmake-hip
- GitHub Check: ggml-ci-x64-cpu-low-perf
- GitHub Check: ggml-ci-x64-cpu-high-perf
- GitHub Check: windows-2022-cmake-cuda (12.4)
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: windows-latest-cmake-sycl
- GitHub Check: ubuntu-22-cmake-musa
- GitHub Check: ubuntu-22-cmake-sycl
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: server-build (Release)
- GitHub Check: WebUI Checks
- GitHub Check: server-build (UNDEFINED, RelWithDebInfo)
- GitHub Check: server-build (ADDRESS, RelWithDebInfo)
- GitHub Check: server-windows
- GitHub Check: pyright type-check
🔇 Additional comments (9)
tools/server/server-context.cpp (9)
561-567: Destructor logic is correct per the sleep lifecycle.The conditional call to
destroy()is appropriate: whensleeping=true, resources have already been freed during the sleep entry (line 595). This avoids double-free issues.Based on learnings, the intentional blocking behavior during sleep state transitions means the destructor will not be called mid-transition, eliminating race conditions between destructor and
handle_sleeping_state().
848-852: Early return on resume is correct by design.The comment at line 855 clarifies that
init()is "only called once during initialization." On resume from sleep (is_resume=true), skippinginit()is intentional because:
populate_json_responses()was already called during initial startup (server properties are static)- Queue callbacks (
on_new_task,on_sleeping_state) are already wiredmetrics.init()already set the start timeThe model/context/slots/batch recreation before the early return ensures all necessary runtime state is restored.
855-880: Well-structured separation of concerns.The new
init()method correctly separates one-time server initialization (queue wiring, metrics, JSON population) from model loading. The assertions ensure it's only called in the right state, and the conditional call at line 849 ensures it runs once during startup but not on resume.The sleeping state callback wiring at lines 868-870 properly integrates the sleep lifecycle with the queue system.
882-944: JSON response caching is correctly implemented.The
populate_json_responses()method builds static metadata once during initialization. Since it's called frominit()(line 874), which only runs on initial startup (not on resume), this ensures:
- JSON caches are populated before HTTP routes become accessible
- No race conditions between writers and HTTP readers
- Properties remain valid across sleep/wake cycles (they're static)
The structured approach with separate blocks for webui settings, server properties, and model metadata improves maintainability.
2768-2769: LGTM!The
start_loop()method correctly wires up the sleep idle timeout by passingsleep_idle_seconds * 1000(converting seconds to milliseconds) to the queue system. This integrates the idle-sleep feature with the task queue lifecycle.
2796-2806: Well-designed bypass mechanism for sleep-aware endpoints.The
bypass_sleepparameter allows selective endpoint access during sleep state:
- Most endpoints block via
wait_until_no_sleep()until the server wakes- Critical endpoints (health, props) can bypass this wait with
bypass_sleep=true- Fast path optimization when sleep is disabled (
sleep_idle_seconds < 0)The blocking behavior in
wait_until_no_sleep()is intentional per the design, ensuring request ordering and preventing race conditions during model reload.
3040-3044: Appropriate use ofbypass_sleepfor health endpoint.Using
bypass_sleep=trueis correct here: the/healthendpoint should remain responsive regardless of sleep state, allowing monitoring systems to track server status. The simple implementation returns{"status": "ok"}immediately without accessing model resources.
3223-3228: Correct implementation with appropriatebypass_sleepusage.Using
bypass_sleep=trueallows clients to query server properties (includingis_sleepingstatus) regardless of sleep state. The read ofjson_server_propsis safe because it's populated once during initialization before HTTP routes become accessible.The dynamic
is_sleepingfield provides real-time status information, which is valuable for clients to understand server state.
3036-3039: Consistent sleep-aware implementation across all route handlers.The pattern is correctly applied across all endpoints:
- Inference endpoints (completions, chat, etc.) correctly use default
bypass_sleep=false, ensuring requests wait for the model to load before processing- Status endpoints (
/health,/props) usebypass_sleep=trueto remain responsive during sleep- The comment at lines 3037-3038 provides clear guidance for future endpoint implementations
- Moving
resintohandle_completions_implensures proper lifecycle managementThis consistent approach ensures all endpoints participate correctly in the sleep/wake lifecycle while maintaining appropriate access controls.
Also applies to: 3354-3366, 3368-3380, 3382-3398, 3400-3416
Make sure to read the contributing guidelines before submitting a PR
Summary by CodeRabbit
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.