Skip to content

Commit 6a15d36

Browse files
committed
mlx runtime with client/server
Signed-off-by: Kush Gupta <[email protected]>
1 parent 162e2e5 commit 6a15d36

File tree

11 files changed

+628
-42
lines changed

11 files changed

+628
-42
lines changed

.github/workflows/ci.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,11 @@ jobs:
264264
uses: astral-sh/setup-uv@v6
265265
with:
266266
activate-environment: true
267+
268+
- name: install mlx-lm
269+
shell: bash
270+
run: |
271+
uv pip install mlx-lm
267272
268273
- name: install golang
269274
shell: bash

README.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ RamaLama then pulls AI Models from model registries, starting a chatbot or REST
5757
| :--------------------------------- | :-------------------------: |
5858
| CPU | &check; |
5959
| Apple Silicon GPU (Linux / Asahi) | &check; |
60-
| Apple Silicon GPU (macOS) | &check; |
60+
| Apple Silicon GPU (macOS) | &check; llama.cpp or MLX |
6161
| Apple Silicon GPU (podman-machine) | &check; |
6262
| Nvidia GPU (cuda) | &check; See note below |
6363
| AMD GPU (rocm, vulkan) | &check; |
@@ -87,6 +87,22 @@ See the [Intel hardware table](https://dgpu-docs.intel.com/devices/hardware-tabl
8787
### Moore Threads GPUs
8888
On systems with Moore Threads GPUs, see [ramalama-musa](docs/ramalama-musa.7.md) documentation for the correct host system configuration.
8989

90+
### MLX Runtime (macOS only)
91+
The MLX runtime provides optimized inference for Apple Silicon Macs. MLX requires:
92+
- macOS operating system
93+
- Apple Silicon hardware (M1, M2, M3, or later)
94+
- Usage with `--nocontainer` option (containers are not supported)
95+
- The `mlx-lm` Python package installed on the host system
96+
97+
To install and run Phi-4 on MLX, use either `uv` or `pip`:
98+
```bash
99+
uv pip install mlx-lm
100+
# or pip:
101+
pip install mlx-lm
102+
103+
ramalama --runtime=mlx serve hf://mlx-community/Unsloth-Phi-4-4bit
104+
```
105+
90106
## Install
91107
### Install on Fedora
92108
RamaLama is available in [Fedora 40](https://fedoraproject.org/) and later. To install it, run:
@@ -1125,6 +1141,7 @@ This project wouldn't be possible without the help of other projects like:
11251141
- [llama.cpp](https://github.com/ggml-org/llama.cpp)
11261142
- [whisper.cpp](https://github.com/ggml-org/whisper.cpp)
11271143
- [vllm](https://github.com/vllm-project/vllm)
1144+
- [mlx-lm](https://github.com/ml-explore/mlx-examples)
11281145
- [podman](https://github.com/containers/podman)
11291146
- [huggingface](https://github.com/huggingface)
11301147

docs/ramalama-serve.1.md

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,12 @@ Modify individual model transports by specifying the `huggingface://`, `oci://`,
2929
URL support means if a model is on a web site or even on your local system, you can run it directly.
3030

3131
## REST API ENDPOINTS
32-
Under the hood, `ramalama-serve` uses the `LLaMA.cpp` HTTP server by default.
32+
Under the hood, `ramalama-serve` uses the `llama.cpp` HTTP server by default. When using `--runtime=vllm`, it uses the vLLM server. When using `--runtime=mlx`, it uses the MLX LM server.
3333

34-
For REST API endpoint documentation, see: [https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#api-endpoints](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#api-endpoints)
34+
For REST API endpoint documentation, see:
35+
- llama.cpp: [https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#api-endpoints](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#api-endpoints)
36+
- vLLM: [https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)
37+
- MLX LM: [https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md)
3538

3639
## OPTIONS
3740

@@ -462,6 +465,27 @@ WantedBy=multi-user.target default.target
462465

463466
See **[ramalama-cuda(7)](ramalama-cuda.7.md)** for setting up the host Linux system for CUDA support.
464467

468+
## MLX Support
469+
470+
The MLX runtime is designed for Apple Silicon Macs and provides optimized performance on these systems. MLX support has the following requirements:
471+
472+
- **Operating System**: macOS only
473+
- **Hardware**: Apple Silicon (M1, M2, M3, or later)
474+
- **Container Mode**: MLX requires `--nocontainer` as it cannot run inside containers
475+
- **Dependencies**: Requires `mlx-lm` package to be installed on the host system
476+
477+
To install MLX dependencies, use either `uv` or `pip`:
478+
```bash
479+
uv pip install mlx-lm
480+
# or pip:
481+
pip install mlx-lm
482+
```
483+
484+
Example usage:
485+
```bash
486+
ramalama --runtime=mlx serve hf://mlx-community/Unsloth-Phi-4-4bit
487+
```
488+
465489
## SEE ALSO
466490
**[ramalama(1)](ramalama.1.md)**, **[ramalama-stop(1)](ramalama-stop.1.md)**, **quadlet(1)**, **systemctl(1)**, **podman(1)**, **podman-ps(1)**, **[ramalama-cuda(7)](ramalama-cuda.7.md)**
467491

docs/ramalama.conf

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -86,8 +86,8 @@
8686
#
8787
#pull = "newer"
8888

89-
# Specify the AI runtime to use; valid options are 'llama.cpp' and 'vllm' (default: llama.cpp)
90-
# Options: llama.cpp, vllm
89+
# Specify the AI runtime to use; valid options are 'llama.cpp', 'vllm', and 'mlx' (default: llama.cpp)
90+
# Options: llama.cpp, vllm, mlx
9191
#
9292
#runtime = "llama.cpp"
9393

docs/ramalama.conf.5.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -132,8 +132,8 @@ Specify default port for services to listen on
132132

133133
**runtime**="llama.cpp"
134134

135-
Specify the AI runtime to use; valid options are 'llama.cpp' and 'vllm' (default: llama.cpp)
136-
Options: llama.cpp, vllm
135+
Specify the AI runtime to use; valid options are 'llama.cpp', 'vllm', and 'mlx' (default: llama.cpp)
136+
Options: llama.cpp, vllm, mlx
137137

138138
**store**="$HOME/.local/share/ramalama"
139139

ramalama/chat.py

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,6 @@ def __init__(self, args):
6868
self.args = args
6969
self.request_in_process = False
7070
self.prompt = args.prefix
71-
7271
self.url = f"{args.url}/chat/completions"
7372
self.prep_rag_message()
7473

@@ -118,8 +117,9 @@ def _make_request_data(self):
118117
data = {
119118
"stream": True,
120119
"messages": self.conversation_history,
121-
"model": self.args.MODEL,
122120
}
121+
if not (hasattr(self.args, 'runtime') and self.args.runtime == "mlx"):
122+
data["model"] = self.args.MODEL
123123
json_data = json.dumps(data).encode("utf-8")
124124
headers = {
125125
"Content-Type": "application/json",
@@ -142,6 +142,10 @@ def _req(self):
142142
i = 0.01
143143
total_time_slept = 0
144144
response = None
145+
146+
# Adjust timeout based on whether we're in initial connection phase
147+
max_timeout = 30 if getattr(self.args, "initial_connection", False) else 16
148+
145149
for c in itertools.cycle(['⠋', '⠙', '⠹', '⠸', '⠼', '⠴', '⠦', '⠧', '⠇', '⠏']):
146150
try:
147151
response = urllib.request.urlopen(request)
@@ -150,7 +154,7 @@ def _req(self):
150154
if sys.stdout.isatty():
151155
print(f"\r{c}", end="", flush=True)
152156

153-
if total_time_slept > 16:
157+
if total_time_slept > max_timeout:
154158
break
155159

156160
total_time_slept += i
@@ -161,12 +165,20 @@ def _req(self):
161165
if response:
162166
return res(response, self.args.color)
163167

164-
print(f"\rError: could not connect to: {self.url}", file=sys.stderr)
165-
self.kills()
168+
# Only show error and kill if not in initial connection phase
169+
if not getattr(self.args, "initial_connection", False):
170+
print(f"\rError: could not connect to: {self.url}", file=sys.stderr)
171+
self.kills()
172+
else:
173+
logger.debug(f"Could not connect to: {self.url}")
166174

167175
return None
168176

169177
def kills(self):
178+
# Don't kill the server if we're still in the initial connection phase
179+
if getattr(self.args, "initial_connection", False):
180+
return
181+
170182
if getattr(self.args, "pid2kill", False):
171183
os.kill(self.args.pid2kill, signal.SIGINT)
172184
os.kill(self.args.pid2kill, signal.SIGTERM)

ramalama/cli.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -216,8 +216,8 @@ def configure_arguments(parser):
216216
parser.add_argument(
217217
"--runtime",
218218
default=CONFIG.runtime,
219-
choices=["llama.cpp", "vllm"],
220-
help="specify the runtime to use; valid options are 'llama.cpp' and 'vllm'",
219+
choices=["llama.cpp", "vllm", "mlx"],
220+
help="specify the runtime to use; valid options are 'llama.cpp', 'vllm', and 'mlx'",
221221
)
222222
parser.add_argument(
223223
"--store",
@@ -270,6 +270,12 @@ def post_parse_setup(args):
270270
if hasattr(args, "runtime_args"):
271271
args.runtime_args = shlex.split(args.runtime_args)
272272

273+
# MLX runtime automatically requires --nocontainer
274+
if getattr(args, "runtime", None) == "mlx":
275+
if getattr(args, "container", None) is True:
276+
logger.info("MLX runtime automatically uses --nocontainer mode")
277+
args.container = False
278+
273279
configure_logger("DEBUG" if args.debug else "WARNING")
274280

275281

0 commit comments

Comments
 (0)