-
Notifications
You must be signed in to change notification settings - Fork 235
MLX runtime support #1642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLX runtime support #1642
Conversation
Reviewer's GuideImplements a new MLX runtime by wiring the Python mlx_lm package into CLI parsing, Model execution, and chat flows—enforcing Apple Silicon macOS constraints, building subprocess arguments, updating documentation and CI, and covering behavior with unit and system tests. Sequence diagram for MLX runtime model execution (run/serve/chat)sequenceDiagram
actor User
participant CLI as CLI
participant Model as Model
participant MLX as mlx_lm subprocess
participant Chat as Chat
User->>CLI: ramalama --runtime=mlx run ...
CLI->>Model: parse args, detect runtime=mlx
Model->>Model: validate_args (Apple Silicon/macOS check)
Model->>Model: build MLX exec args
Model->>MLX: launch mlx_lm server subprocess
Model->>Chat: connect to server (with retries)
Chat->>MLX: send chat/completions request
MLX-->>Chat: stream response
Chat-->>User: display response
Entity relationship diagram for runtime configuration optionserDiagram
CONFIG ||--o{ CLI : provides
CLI ||--o{ Model : passes args
Model ||--o{ MLX : launches subprocess
CLI {
string runtime
bool container
}
Model {
string model_store_path
ModelStore model_store
}
MLX {
string model_path
list exec_args
}
Class diagram for Model MLX runtime integrationclassDiagram
class Model {
- _model_store: ModelStore | None
+ run(args)
+ _start_server(args)
+ _connect_and_chat(args, server_pid)
+ _handle_container_chat(args, server_pid)
+ _handle_mlx_chat(args)
+ _is_server_ready(port)
+ _cleanup_server_process(pid)
+ _build_mlx_exec_args(subcommand, model_path, args, extra)
+ mlx_serve(args, exec_model_path)
+ build_exec_args_serve(args, exec_model_path, ...)
+ validate_args(args)
}
class Chat {
+ _make_request_data()
+ _req()
+ kills()
}
Model "1" -- "1" Chat : uses
Model <|-- ModelMLX : MLX runtime logic
class ModelMLX {
<<MLX runtime logic>>
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kush-gupt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the application's capabilities by introducing native support for Apple's MLX framework, specifically optimizing LLM inference on Apple Silicon. It seamlessly integrates the mlx_lm
package into the core command-line interface and execution paths, ensuring efficient and platform-aware model operation. The changes also include necessary documentation updates and comprehensive testing to guarantee reliability and proper usage.
Highlights
- New MLX Runtime Support: Introduced a new
--runtime mlx
option, enabling optimized inference for Large Language Models (LLMs) on Apple Silicon Macs by integrating with themlx_lm
Python package. - Automatic Container Handling: The MLX runtime automatically enforces
--nocontainer
mode, asmlx_lm
runs directly on the host system. If--container
is explicitly provided, a warning is issued, but--nocontainer
is still applied. - Direct
mlx_lm
Integration: Therun
command now directly invokesmlx_lm
for both interactive chat and one-shot generation, while theserve
command utilizesmlx_lm server
for REST API endpoints. This bypasses the previous daemonized service approach for MLX. - Platform and Constraint Validation: Added robust validation to ensure the MLX runtime is exclusively used on macOS systems with Apple Silicon hardware. Attempts to use MLX on unsupported platforms will result in an error.
- Comprehensive Documentation and Testing: Updated the README, man pages (
ramalama-serve.1.md
,ramalama.conf.5.md
), and configuration file (ramalama.conf
) to reflect the new MLX runtime. A new, extensive system test suite (080-mlx.bats
) and unit tests (test_model.py
) have been added to cover MLX functionality, argument parsing, and constraint enforcement.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kush-gupt - I've reviewed your changes - here's some feedback:
Blocking issues:
- Detected subprocess function 'Popen' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
General comments:
- Avoid using
shlex.quote(model_path)
when building theexec_args
list—passing the raw path will prevent embedding literal quotes into the subprocess arguments. - There’s duplicated logic between
mlx_run_chat
and_mlx_generate_response
; consider extracting shared subprocess setup and I/O handling into a common helper to reduce maintenance overhead. - Add an upfront check in CLI initialization or
validate_args
to verify that themlx_lm
package is installed and error early with a clear message if it’s missing, instead of relying on a failing subprocess call.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Avoid using `shlex.quote(model_path)` when building the `exec_args` list—passing the raw path will prevent embedding literal quotes into the subprocess arguments.
- There’s duplicated logic between `mlx_run_chat` and `_mlx_generate_response`; consider extracting shared subprocess setup and I/O handling into a common helper to reduce maintenance overhead.
- Add an upfront check in CLI initialization or `validate_args` to verify that the `mlx_lm` package is installed and error early with a clear message if it’s missing, instead of relying on a failing subprocess call.
## Individual Comments
### Comment 1
<location> `ramalama/model.py:437` </location>
<code_context>
+ if subcommand not in allowed_subcommands:
+ raise ValueError(f"Invalid subcommand: {subcommand}")
+
+ exec_args = [
+ "python",
+ "-m",
+ "mlx_lm",
+ subcommand,
+ "--model",
+ shlex.quote(model_path),
+ ]
+
</code_context>
<issue_to_address>
Quoting model_path with shlex.quote in exec_args may cause issues.
Passing model_path through shlex.quote adds literal quotes, which can break argument parsing. Pass model_path as a raw string in exec_args instead.
</issue_to_address>
### Comment 2
<location> `ramalama/model.py:488` </location>
<code_context>
+
+ # For interactive mode, we need to capture the response
+ # Consume stderr concurrently to avoid deadlocks if its buffer fills.
+ def _drain_stderr(proc):
+ if proc.stderr is None:
+ return
+ while True:
+ chunk = proc.stderr.read(1024)
+ if chunk == "" and proc.poll() is not None:
+ break
+ if chunk and "EOFError" not in chunk:
+ sys.stderr.write(chunk)
+ sys.stderr.flush()
</code_context>
<issue_to_address>
Filtering out 'EOFError' from stderr may hide other relevant errors.
Filtering out 'EOFError' may suppress important context in stderr. Please reconsider whether this exclusion is necessary, or ensure all relevant errors are still visible for debugging.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
if chunk and "EOFError" not in chunk:
sys.stderr.write(chunk)
sys.stderr.flush()
=======
if chunk:
sys.stderr.write(chunk)
sys.stderr.flush()
>>>>>>> REPLACE
</suggested_fix>
### Comment 3
<location> `test/system/080-mlx.bats:161` </location>
<code_context>
+ is "$output" ".*python.*-m.*mlx_lm.*chat.*" "should use MLX chat command"
+}
+
+@test "ramalama --runtime=mlx rejects --name option" {
+ skip_if_not_apple_silicon
+ skip_if_no_mlx
+
+ # --name requires container mode, which MLX doesn't support
+ run_ramalama 1 --runtime=mlx run --name test ${MODEL}
+ is "$output" ".*--nocontainer.*--name.*conflict.*" "should show conflict error"
+}
+
</code_context>
<issue_to_address>
Edge case: Add a test for MLX runtime with both --container and --nocontainer flags.
Consider adding a test that provides both --container and --nocontainer flags together to verify that the CLI handles this conflict properly and returns a clear error message.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
+@test "ramalama --runtime=mlx rejects --name option" {
+ skip_if_not_apple_silicon
+ skip_if_no_mlx
+
+ # --name requires container mode, which MLX doesn't support
+ run_ramalama 1 --runtime=mlx run --name test ${MODEL}
+ is "$output" ".*--nocontainer.*--name.*conflict.*" "should show conflict error"
+}
=======
+@test "ramalama --runtime=mlx rejects --name option" {
+ skip_if_not_apple_silicon
+ skip_if_no_mlx
+
+ # --name requires container mode, which MLX doesn't support
+ run_ramalama 1 --runtime=mlx run --name test ${MODEL}
+ is "$output" ".*--nocontainer.*--name.*conflict.*" "should show conflict error"
+}
+
+@test "ramalama --runtime=mlx rejects both --container and --nocontainer flags" {
+ skip_if_not_apple_silicon
+ skip_if_no_mlx
+
+ # Providing both --container and --nocontainer should result in a conflict error
+ run_ramalama 1 --runtime=mlx run --container --nocontainer ${MODEL}
+ is "$output" ".*(--container.*--nocontainer|--nocontainer.*--container).*conflict.*" "should show conflict error for both flags"
+}
>>>>>>> REPLACE
</suggested_fix>
## Security Issues
### Issue 1
<location> `ramalama/model.py:501` </location>
<issue_to_address>
**security (opengrep-rules.python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'Popen' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
*Source: opengrep*
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully integrates MLX runtime support for Apple Silicon, enhancing performance across various ramalama
commands. The changes are well-structured, covering documentation, CLI, core logic, and testing. Feedback includes correcting a misleading error message, adding a missing newline in a test file, and improving code clarity for a specific mlx_lm
behavior.
bee40f5
to
3f4b9b5
Compare
ramalama/chat.py
Outdated
@@ -69,7 +69,11 @@ def __init__(self, args): | |||
self.request_in_process = False | |||
self.prompt = args.prefix | |||
|
|||
self.url = f"{args.url}/chat/completions" | |||
# MLX server uses /v1/chat/completions endpoint | |||
if getattr(args, "runtime", None) == "mlx": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't understand here is the url should be with /v1/ for both, ramalama chat adds v1 by default:
$ ramalama chat -h | grep -A1 "the url"
--url URL the url to send requests to (default:
http://127.0.0.1:8080/v1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this was an artifact I had from a previous bug, I can take it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And that bug was the server implementation lol!
llama.cpp url does not seem to have v1 in it from what I see:
Default runtime: http://127.0.0.1:8080/chat/completions
MLX runtime: http://127.0.0.1:8080/v1/chat/completions
llama.cpp runtime: http://127.0.0.1:8080/chat/completions
I think we should use: mlx_lm.server for everything, like we use: llama-server for everything llama.cpp based these days. I worry about vibe coding sometimes, it feels like there's at least 200 extra lines of code here than necessary (although I expect switching to mlx_lm.server will eliminate many of these, blaming vibe coding is probably wrong 😄 ), this code could be simpler. |
We get the unified "ramalama chat/run" experience if we use server also. We don't have to document, well if it's mlx it's like this, if it's llama.cpp it's like this, etc. |
I can't wait to play around and do some mlx vs llama.cpp performance comparisons once this is in though for a few minutes |
You should have seen it before, lol! I think on one hand yes, I could have made a smaller implementation that performed relatively well. But that would've taken me (a still slightly rusty coder) a lot more time than Claude did. It also created unit and bats tests, something I may have (potentially) omitted in the past at times :) |
I actually did have this as my first draft, but I couldn't find a way to get streaming to work. I was decided between consistency with llama.cpp and user experience, would run without steaming be worth the consistency? |
Can we do with no streaming now? And open an issue on mlx github with streaming as a feature request. If mlx add that feature we can then use it. I want to consolidate on OpenAI-api usage, it just makes maintenance easier. |
New examples from client/server MLX:
|
@sourcery-ai dismiss |
Automated Sourcery review dismissed.
@sourcery-ai review |
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces MLX runtime support for Apple Silicon, which is a significant enhancement. The implementation is well-structured, with clear separation of concerns, especially the refactoring of the run
method into smaller, more manageable functions. The platform validation and automatic --nocontainer
mode for MLX are great additions for user experience. The new unit and system tests provide excellent coverage for the new functionality.
I have a few minor suggestions to improve maintainability by replacing magic numbers with named constants and adhering to Python's import conventions. Overall, this is a high-quality contribution.
@@ -142,6 +142,10 @@ def _req(self): | |||
i = 0.01 | |||
total_time_slept = 0 | |||
response = None | |||
|
|||
# Adjust timeout based on whether we're in initial connection phase | |||
max_timeout = 30 if getattr(self.args, "initial_connection", False) else 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout values 30
and 16
are magic numbers. To improve readability and maintainability, they should be defined as constants with descriptive names at the module level.
For example:
# At module level
INITIAL_CONNECTION_TIMEOUT_S = 30
CHAT_RESPONSE_TIMEOUT_S = 16
Then you can use these constants here.
def _handle_mlx_chat(self, args): | ||
"""Handle chat for MLX runtime with connection retries.""" | ||
args.ignore = getattr(args, "dryrun", False) | ||
args.initial_connection = True | ||
max_retries = 10 | ||
|
||
for i in range(max_retries): | ||
try: | ||
if self._is_server_ready(args.port): | ||
args.initial_connection = False | ||
time.sleep(1) # Give server time to stabilize | ||
chat.chat(args) | ||
break | ||
else: | ||
if args.debug: | ||
print(f"MLX server not ready, waiting... (attempt {i+1}/{max_retries})", file=sys.stderr) | ||
time.sleep(3) | ||
continue | ||
|
||
except Exception as e: | ||
if i >= max_retries - 1: | ||
print(f"Error: Failed to connect to MLX server after {max_retries} attempts: {e}", file=sys.stderr) | ||
self._cleanup_server_process(args.pid2kill) | ||
raise e | ||
if args.debug: | ||
print(f"Connection attempt failed, retrying... (attempt {i+1}/{max_retries}): {e}", file=sys.stderr) | ||
time.sleep(3) | ||
|
||
args.initial_connection = False | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function uses several magic numbers for configuration (e.g., max_retries = 10
on line 415, time.sleep(1)
on line 421, time.sleep(3)
on lines 427 and 437). It would be better to define these as named constants at the module level. This improves readability and makes it easier to adjust these values in the future.
For example:
# At module level
MAX_MLX_CONNECTION_RETRIES = 10
MLX_SERVER_STABILIZATION_WAIT_S = 1
MLX_CONNECTION_RETRY_WAIT_S = 3
if not pid: | ||
return | ||
|
||
import signal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to PEP 8, imports should be at the top of the file. Moving import signal
to the top of ramalama/model.py
improves code style consistency and can avoid re-importing the module if this function is called multiple times. Placing imports inside functions is generally discouraged unless it's for resolving circular dependencies or for optional, platform-specific imports, which is not the case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kush-gupt - I've reviewed your changes and they look great!
Blocking issues:
- time.sleep() call; did you mean to leave this in? (link)
- time.sleep() call; did you mean to leave this in? (link)
- time.sleep() call; did you mean to leave this in? (link)
- time.sleep() call; did you mean to leave this in? (link)
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `ramalama/chat.py:168` </location>
<code_context>
- print(f"\rError: could not connect to: {self.url}", file=sys.stderr)
- self.kills()
+ # Only show error and kill if not in initial connection phase
+ if not getattr(self.args, "initial_connection", False):
+ print(f"\rError: could not connect to: {self.url}", file=sys.stderr)
+ self.kills()
</code_context>
<issue_to_address>
Suppressing error output during initial connection may hide useful diagnostics.
Consider logging connection errors at the debug level during the initial connection phase to aid troubleshooting without adding unnecessary noise.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
# Only show error and kill if not in initial connection phase
if not getattr(self.args, "initial_connection", False):
print(f"\rError: could not connect to: {self.url}", file=sys.stderr)
self.kills()
=======
# Only show error and kill if not in initial connection phase
if not getattr(self.args, "initial_connection", False):
print(f"\rError: could not connect to: {self.url}", file=sys.stderr)
self.kills()
else:
import logging
logging.debug(f"Could not connect to: {self.url}")
>>>>>>> REPLACE
</suggested_fix>
### Comment 2
<location> `test/unit/test_model.py:213` </location>
<code_context>
+ def test_mlx_run_uses_server_client_model(
</code_context>
<issue_to_address>
Missing test for server process failure or connection timeout in MLX run.
Add a test case for server startup failure or connection timeout to verify proper error handling and resource cleanup.
</issue_to_address>
## Security Issues
### Issue 1
<location> `ramalama/model.py:421` </location>
<issue_to_address>
**security (opengrep-rules.python.lang.best-practice.arbitrary-sleep):** time.sleep() call; did you mean to leave this in?
*Source: opengrep*
</issue_to_address>
### Issue 2
<location> `ramalama/model.py:427` </location>
<issue_to_address>
**security (opengrep-rules.python.lang.best-practice.arbitrary-sleep):** time.sleep() call; did you mean to leave this in?
*Source: opengrep*
</issue_to_address>
### Issue 3
<location> `ramalama/model.py:437` </location>
<issue_to_address>
**security (opengrep-rules.python.lang.best-practice.arbitrary-sleep):** time.sleep() call; did you mean to leave this in?
*Source: opengrep*
</issue_to_address>
### Issue 4
<location> `ramalama/model.py:461` </location>
<issue_to_address>
**security (opengrep-rules.python.lang.best-practice.arbitrary-sleep):** time.sleep() call; did you mean to leave this in?
*Source: opengrep*
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
def test_mlx_run_uses_server_client_model( | ||
self, mock_socket_class, mock_chat, mock_fork, mock_compute_port, mock_machine, mock_system | ||
): | ||
"""Test that MLX runtime uses server-client model in run method""" | ||
mock_system.return_value = "Darwin" | ||
mock_machine.return_value = "arm64" | ||
mock_compute_port.return_value = "8080" | ||
mock_fork.return_value = 123 # Parent process | ||
|
||
# Mock socket to simulate successful connection (server ready) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Missing test for server process failure or connection timeout in MLX run.
Add a test case for server startup failure or connection timeout to verify proper error handling and resource cleanup.
ramalama/model.py
Outdated
if pid == 0: | ||
args.host = CONFIG.host | ||
args.generate = "" | ||
args.detach = True | ||
self.serve(args, True) | ||
# Child process - start the server | ||
self._start_server(args) | ||
return 0 | ||
else: | ||
args.url = f"http://127.0.0.1:{args.port}" | ||
args.pid2kill = "" | ||
if args.container: | ||
_, status = os.waitpid(pid, 0) | ||
if status != 0: | ||
raise ValueError(f"Failed to serve model {self.model_name}, for ramalama run command") | ||
args.ignore = args.dryrun | ||
for i in range(0, 6): | ||
try: | ||
chat.chat(args) | ||
break | ||
except Exception as e: | ||
if i > 5: | ||
raise e | ||
time.sleep(1) | ||
# Parent process - connect to server and start chat | ||
return self._connect_and_chat(args, pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Swap if/else branches (
swap-if-else-branches
) - Remove unnecessary else after guard condition (
remove-unnecessary-else
)
611ba70
to
6a15d36
Compare
Could someone fill me in on why setting --host 0.0.0.0 in a container is a bad thing for llama.cpp serve? |
It is not, as far as I know, my understanding was that this was the default, so it just becomes repetitive, if the server needs it then add it. |
By the way, I am glad we are getting this in because mlx is compatible with safetensors. It means macOS users can test safetensors files, like what would be used in vLLM like on RHEL AI or OpenShift AI. Upto now we had no way of doing this because llama.cpp is only compatible with gguf. Would be cool to see people running these on RamaLama on macOS for example: |
Needs a rebase. |
Signed-off-by: Kush Gupta <[email protected]>
Signed-off-by: Kush Gupta <[email protected]>
I believe I did this correctly now! |
Signed-off-by: Kush Gupta <[email protected]>
LGTM |
Assisted by: Cursor (Claude 4 sonnet)
Enables a new ‘mlx’ runtime option that integrates the
mlx_lm
Python package for optimized Apple Silicon inference, wiring it into the CLI, model execution paths (run, serve), enforcing platform constraints, updating documentation, and covering it with unit and system tests.New Features:
mlx_lm
CLIEnhancements:
--nocontainer
mode and validationModelStore | None
and add helper methods for building MLX subprocess argumentsCI:
mlx-lm
package in CI workflowsDocumentation:
Tests:
Examples:
Summary by Sourcery
Enable the new MLX runtime option using the mlx_lm package for optimized Apple Silicon inference, integrate it into run/serve/benchmark commands, enforce platform and container constraints, update CI and documentation, and add comprehensive tests.
New Features:
Enhancements:
CI:
Documentation:
Tests: