Skip to content

MLX runtime support #1642

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 4, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,11 @@ jobs:
uses: astral-sh/setup-uv@v6
with:
activate-environment: true

- name: install mlx-lm
shell: bash
run: |
uv pip install mlx-lm

- name: install golang
shell: bash
Expand Down
19 changes: 18 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ RamaLama then pulls AI Models from model registries, starting a chatbot or REST
| :--------------------------------- | :-------------------------: |
| CPU | ✓ |
| Apple Silicon GPU (Linux / Asahi) | ✓ |
| Apple Silicon GPU (macOS) | ✓ |
| Apple Silicon GPU (macOS) | ✓ llama.cpp or MLX |
| Apple Silicon GPU (podman-machine) | ✓ |
| Nvidia GPU (cuda) | ✓ See note below |
| AMD GPU (rocm, vulkan) | ✓ |
Expand Down Expand Up @@ -87,6 +87,22 @@ See the [Intel hardware table](https://dgpu-docs.intel.com/devices/hardware-tabl
### Moore Threads GPUs
On systems with Moore Threads GPUs, see [ramalama-musa](docs/ramalama-musa.7.md) documentation for the correct host system configuration.

### MLX Runtime (macOS only)
The MLX runtime provides optimized inference for Apple Silicon Macs. MLX requires:
- macOS operating system
- Apple Silicon hardware (M1, M2, M3, or later)
- Usage with `--nocontainer` option (containers are not supported)
- The `mlx-lm` Python package installed on the host system

To install and run Phi-4 on MLX, use either `uv` or `pip`:
```bash
uv pip install mlx-lm
# or pip:
pip install mlx-lm

ramalama --runtime=mlx serve hf://mlx-community/Unsloth-Phi-4-4bit
```

## Install
### Install on Fedora
RamaLama is available in [Fedora 40](https://fedoraproject.org/) and later. To install it, run:
Expand Down Expand Up @@ -1125,6 +1141,7 @@ This project wouldn't be possible without the help of other projects like:
- [llama.cpp](https://github.com/ggml-org/llama.cpp)
- [whisper.cpp](https://github.com/ggml-org/whisper.cpp)
- [vllm](https://github.com/vllm-project/vllm)
- [mlx-lm](https://github.com/ml-explore/mlx-examples)
- [podman](https://github.com/containers/podman)
- [huggingface](https://github.com/huggingface)

Expand Down
28 changes: 26 additions & 2 deletions docs/ramalama-serve.1.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,12 @@ Modify individual model transports by specifying the `huggingface://`, `oci://`,
URL support means if a model is on a web site or even on your local system, you can run it directly.

## REST API ENDPOINTS
Under the hood, `ramalama-serve` uses the `LLaMA.cpp` HTTP server by default.
Under the hood, `ramalama-serve` uses the `llama.cpp` HTTP server by default. When using `--runtime=vllm`, it uses the vLLM server. When using `--runtime=mlx`, it uses the MLX LM server.

For REST API endpoint documentation, see: [https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#api-endpoints](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#api-endpoints)
For REST API endpoint documentation, see:
- llama.cpp: [https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#api-endpoints](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#api-endpoints)
- vLLM: [https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)
- MLX LM: [https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md)

## OPTIONS

Expand Down Expand Up @@ -462,6 +465,27 @@ WantedBy=multi-user.target default.target

See **[ramalama-cuda(7)](ramalama-cuda.7.md)** for setting up the host Linux system for CUDA support.

## MLX Support

The MLX runtime is designed for Apple Silicon Macs and provides optimized performance on these systems. MLX support has the following requirements:

- **Operating System**: macOS only
- **Hardware**: Apple Silicon (M1, M2, M3, or later)
- **Container Mode**: MLX requires `--nocontainer` as it cannot run inside containers
- **Dependencies**: Requires `mlx-lm` package to be installed on the host system

To install MLX dependencies, use either `uv` or `pip`:
```bash
uv pip install mlx-lm
# or pip:
pip install mlx-lm
```

Example usage:
```bash
ramalama --runtime=mlx serve hf://mlx-community/Unsloth-Phi-4-4bit
```

## SEE ALSO
**[ramalama(1)](ramalama.1.md)**, **[ramalama-stop(1)](ramalama-stop.1.md)**, **quadlet(1)**, **systemctl(1)**, **podman(1)**, **podman-ps(1)**, **[ramalama-cuda(7)](ramalama-cuda.7.md)**

Expand Down
4 changes: 2 additions & 2 deletions docs/ramalama.conf
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,8 @@
#
#pull = "newer"

# Specify the AI runtime to use; valid options are 'llama.cpp' and 'vllm' (default: llama.cpp)
# Options: llama.cpp, vllm
# Specify the AI runtime to use; valid options are 'llama.cpp', 'vllm', and 'mlx' (default: llama.cpp)
# Options: llama.cpp, vllm, mlx
#
#runtime = "llama.cpp"

Expand Down
4 changes: 2 additions & 2 deletions docs/ramalama.conf.5.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,8 +132,8 @@ Specify default port for services to listen on

**runtime**="llama.cpp"

Specify the AI runtime to use; valid options are 'llama.cpp' and 'vllm' (default: llama.cpp)
Options: llama.cpp, vllm
Specify the AI runtime to use; valid options are 'llama.cpp', 'vllm', and 'mlx' (default: llama.cpp)
Options: llama.cpp, vllm, mlx

**store**="$HOME/.local/share/ramalama"

Expand Down
25 changes: 17 additions & 8 deletions ramalama/chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,6 @@ def __init__(self, args):
self.args = args
self.request_in_process = False
self.prompt = args.prefix

self.url = f"{args.url}/chat/completions"
self.prep_rag_message()

Expand Down Expand Up @@ -133,10 +132,8 @@ def _make_request_data(self):
"stream": True,
"messages": self.conversation_history,
}

if getattr(self.args, "model", False):
data["model"] = self.args.model

if not (hasattr(self.args, 'runtime') and self.args.runtime == "mlx"):
data["model"] = self.args.MODEL
json_data = json.dumps(data).encode("utf-8")
headers = {
"Content-Type": "application/json",
Expand All @@ -154,6 +151,10 @@ def _req(self):
i = 0.01
total_time_slept = 0
response = None

# Adjust timeout based on whether we're in initial connection phase
max_timeout = 30 if getattr(self.args, "initial_connection", False) else 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The timeout values 30 and 16 are magic numbers. To improve readability and maintainability, they should be defined as constants with descriptive names at the module level.

For example:

# At module level
INITIAL_CONNECTION_TIMEOUT_S = 30
CHAT_RESPONSE_TIMEOUT_S = 16

Then you can use these constants here.


for c in itertools.cycle(['⠋', '⠙', '⠹', '⠸', '⠼', '⠴', '⠦', '⠧', '⠇', '⠏']):
try:
response = urllib.request.urlopen(request)
Expand All @@ -162,7 +163,7 @@ def _req(self):
if sys.stdout.isatty():
print(f"\r{c}", end="", flush=True)

if total_time_slept > 16:
if total_time_slept > max_timeout:
break

total_time_slept += i
Expand All @@ -173,12 +174,20 @@ def _req(self):
if response:
return res(response, self.args.color)

print(f"\rError: could not connect to: {self.url}", file=sys.stderr)
self.kills()
# Only show error and kill if not in initial connection phase
if not getattr(self.args, "initial_connection", False):
print(f"\rError: could not connect to: {self.url}", file=sys.stderr)
self.kills()
else:
logger.debug(f"Could not connect to: {self.url}")

return None

def kills(self):
# Don't kill the server if we're still in the initial connection phase
if getattr(self.args, "initial_connection", False):
return

if getattr(self.args, "pid2kill", False):
os.kill(self.args.pid2kill, signal.SIGINT)
os.kill(self.args.pid2kill, signal.SIGTERM)
Expand Down
10 changes: 8 additions & 2 deletions ramalama/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,8 +216,8 @@ def configure_arguments(parser):
parser.add_argument(
"--runtime",
default=CONFIG.runtime,
choices=["llama.cpp", "vllm"],
help="specify the runtime to use; valid options are 'llama.cpp' and 'vllm'",
choices=["llama.cpp", "vllm", "mlx"],
help="specify the runtime to use; valid options are 'llama.cpp', 'vllm', and 'mlx'",
)
parser.add_argument(
"--store",
Expand Down Expand Up @@ -270,6 +270,12 @@ def post_parse_setup(args):
if hasattr(args, "runtime_args"):
args.runtime_args = shlex.split(args.runtime_args)

# MLX runtime automatically requires --nocontainer
if getattr(args, "runtime", None) == "mlx":
if getattr(args, "container", None) is True:
logger.info("MLX runtime automatically uses --nocontainer mode")
args.container = False

configure_logger("DEBUG" if args.debug else "WARNING")


Expand Down
Loading
Loading