containers
diff --git a/‎.github/workflows/ci.yml
Lines changed: 5 additions & 0 deletions b/‎.github/workflows/ci.yml
Lines changed: 5 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 18 additions & 1 deletion b/‎README.md
Lines changed: 18 additions & 1 deletion
diff --git a/‎docs/ramalama-serve.1.md
Lines changed: 26 additions & 2 deletions b/‎docs/ramalama-serve.1.md
Lines changed: 26 additions & 2 deletions
diff --git a/‎docs/ramalama.conf
Lines changed: 2 additions & 2 deletions b/‎docs/ramalama.conf
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/ramalama.conf.5.md
Lines changed: 2 additions & 2 deletions b/‎docs/ramalama.conf.5.md
Lines changed: 2 additions & 2 deletions
diff --git a/‎ramalama/chat.py
Lines changed: 17 additions & 5 deletions b/‎ramalama/chat.py
Lines changed: 17 additions & 5 deletions
diff --git a/‎ramalama/cli.py
Lines changed: 8 additions & 2 deletions b/‎ramalama/cli.py
Lines changed: 8 additions & 2 deletions
@@ -264,6 +264,11 @@ jobs:
         uses: astral-sh/setup-uv@v6
         with:
           activate-environment: true
+
+      - name: install mlx-lm
+        shell: bash
+        run: |
+          uv pip install mlx-lm
           
       - name: install golang
         shell: bash
 
@@ -57,7 +57,7 @@ RamaLama then pulls AI Models from model registries, starting a chatbot or REST
 | :--------------------------------- | :-------------------------: |
 | CPU                                | &check;                     |
 | Apple Silicon GPU (Linux / Asahi)  | &check;                     |
-| Apple Silicon GPU (macOS)          | &check;                     |
+| Apple Silicon GPU (macOS)          | &check; llama.cpp or MLX    |
 | Apple Silicon GPU (podman-machine) | &check;                     |
 | Nvidia GPU (cuda)                  | &check; See note below      |
 | AMD GPU (rocm, vulkan)             | &check;                     |
@@ -87,6 +87,22 @@ See the [Intel hardware table](https://dgpu-docs.intel.com/devices/hardware-tabl
 ### Moore Threads GPUs
 On systems with Moore Threads GPUs, see [ramalama-musa](docs/ramalama-musa.7.md) documentation for the correct host system configuration.
 
+### MLX Runtime (macOS only)
+The MLX runtime provides optimized inference for Apple Silicon Macs. MLX requires:
+- macOS operating system
+- Apple Silicon hardware (M1, M2, M3, or later)
+- Usage with `--nocontainer` option (containers are not supported)
+- The `mlx-lm` Python package installed on the host system
+
+To install and run Phi-4 on MLX, use either `uv` or `pip`:
+```bash
+uv pip install mlx-lm
+# or pip:
+pip install mlx-lm
+
+ramalama --runtime=mlx serve hf://mlx-community/Unsloth-Phi-4-4bit
+```
+
 ## Install
 ### Install on Fedora
 RamaLama is available in [Fedora 40](https://fedoraproject.org/) and later. To install it, run:
@@ -1125,6 +1141,7 @@ This project wouldn't be possible without the help of other projects like:
 - [llama.cpp](https://github.com/ggml-org/llama.cpp)
 - [whisper.cpp](https://github.com/ggml-org/whisper.cpp)
 - [vllm](https://github.com/vllm-project/vllm)
+- [mlx-lm](https://github.com/ml-explore/mlx-examples)
 - [podman](https://github.com/containers/podman)
 - [huggingface](https://github.com/huggingface)
 
 
@@ -29,9 +29,12 @@ Modify individual model transports by specifying the `huggingface://`, `oci://`,
 URL support means if a model is on a web site or even on your local system, you can run it directly.
 
 ## REST API ENDPOINTS
-Under the hood, `ramalama-serve` uses the `LLaMA.cpp` HTTP server by default.
+Under the hood, `ramalama-serve` uses the `llama.cpp` HTTP server by default. When using `--runtime=vllm`, it uses the vLLM server. When using `--runtime=mlx`, it uses the MLX LM server.
 
-For REST API endpoint documentation, see: [https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#api-endpoints](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#api-endpoints)
+For REST API endpoint documentation, see:
+- llama.cpp: [https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#api-endpoints](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#api-endpoints)
+- vLLM: [https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)
+- MLX LM: [https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md)
 
 ## OPTIONS
 
@@ -462,6 +465,27 @@ WantedBy=multi-user.target default.target
 
 See **[ramalama-cuda(7)](ramalama-cuda.7.md)** for setting up the host Linux system for CUDA support.
 
+## MLX Support
+
+The MLX runtime is designed for Apple Silicon Macs and provides optimized performance on these systems. MLX support has the following requirements:
+
+- **Operating System**: macOS only
+- **Hardware**: Apple Silicon (M1, M2, M3, or later)
+- **Container Mode**: MLX requires `--nocontainer` as it cannot run inside containers
+- **Dependencies**: Requires `mlx-lm` package to be installed on the host system
+
+To install MLX dependencies, use either `uv` or `pip`:
+```bash
+uv pip install mlx-lm
+# or pip:
+pip install mlx-lm
+```
+
+Example usage:
+```bash
+ramalama --runtime=mlx serve hf://mlx-community/Unsloth-Phi-4-4bit
+```
+
 ## SEE ALSO
 **[ramalama(1)](ramalama.1.md)**, **[ramalama-stop(1)](ramalama-stop.1.md)**, **quadlet(1)**, **systemctl(1)**, **podman(1)**, **podman-ps(1)**, **[ramalama-cuda(7)](ramalama-cuda.7.md)**
 
 
@@ -86,8 +86,8 @@
 #
 #pull = "newer"
 
-# Specify the AI runtime to use; valid options are 'llama.cpp' and 'vllm' (default: llama.cpp)
-# Options: llama.cpp, vllm
+# Specify the AI runtime to use; valid options are 'llama.cpp', 'vllm', and 'mlx' (default: llama.cpp)
+# Options: llama.cpp, vllm, mlx
 #
 #runtime = "llama.cpp"
 
 
@@ -132,8 +132,8 @@ Specify default port for services to listen on
 
 **runtime**="llama.cpp"
 
-Specify the AI runtime to use; valid options are 'llama.cpp' and 'vllm' (default: llama.cpp)
-Options: llama.cpp, vllm
+Specify the AI runtime to use; valid options are 'llama.cpp', 'vllm', and 'mlx' (default: llama.cpp)
+Options: llama.cpp, vllm, mlx
 
 **store**="$HOME/.local/share/ramalama"
 
 
@@ -68,7 +68,6 @@ def __init__(self, args):
         self.args = args
         self.request_in_process = False
         self.prompt = args.prefix
-
         self.url = f"{args.url}/chat/completions"
         self.prep_rag_message()
 
@@ -118,8 +117,9 @@ def _make_request_data(self):
         data = {
             "stream": True,
             "messages": self.conversation_history,
-            "model": self.args.MODEL,
         }
+        if not (hasattr(self.args, 'runtime') and self.args.runtime == "mlx"):
+            data["model"] = self.args.MODEL
         json_data = json.dumps(data).encode("utf-8")
         headers = {
             "Content-Type": "application/json",
@@ -142,6 +142,10 @@ def _req(self):
         i = 0.01
         total_time_slept = 0
         response = None
+
+        # Adjust timeout based on whether we're in initial connection phase
+        max_timeout = 30 if getattr(self.args, "initial_connection", False) else 16
+
         for c in itertools.cycle(['⠋', '⠙', '⠹', '⠸', '⠼', '⠴', '⠦', '⠧', '⠇', '⠏']):
             try:
                 response = urllib.request.urlopen(request)
@@ -150,7 +154,7 @@ def _req(self):
                 if sys.stdout.isatty():
                     print(f"\r{c}", end="", flush=True)
 
-                if total_time_slept > 16:
+                if total_time_slept > max_timeout:
                     break
 
                 total_time_slept += i
@@ -161,12 +165,20 @@ def _req(self):
         if response:
             return res(response, self.args.color)
 
-        print(f"\rError: could not connect to: {self.url}", file=sys.stderr)
-        self.kills()
+        # Only show error and kill if not in initial connection phase
+        if not getattr(self.args, "initial_connection", False):
+            print(f"\rError: could not connect to: {self.url}", file=sys.stderr)
+            self.kills()
+        else:
+            logger.debug(f"Could not connect to: {self.url}")
 
         return None
 
     def kills(self):
+        # Don't kill the server if we're still in the initial connection phase
+        if getattr(self.args, "initial_connection", False):
+            return
+
         if getattr(self.args, "pid2kill", False):
             os.kill(self.args.pid2kill, signal.SIGINT)
             os.kill(self.args.pid2kill, signal.SIGTERM)
 
@@ -216,8 +216,8 @@ def configure_arguments(parser):
     parser.add_argument(
         "--runtime",
         default=CONFIG.runtime,
-        choices=["llama.cpp", "vllm"],
-        help="specify the runtime to use; valid options are 'llama.cpp' and 'vllm'",
+        choices=["llama.cpp", "vllm", "mlx"],
+        help="specify the runtime to use; valid options are 'llama.cpp', 'vllm', and 'mlx'",
     )
     parser.add_argument(
         "--store",
@@ -270,6 +270,12 @@ def post_parse_setup(args):
     if hasattr(args, "runtime_args"):
         args.runtime_args = shlex.split(args.runtime_args)
 
+    # MLX runtime automatically requires --nocontainer
+    if getattr(args, "runtime", None) == "mlx":
+        if getattr(args, "container", None) is True:
+            logger.info("MLX runtime automatically uses --nocontainer mode")
+        args.container = False
+
     configure_logger("DEBUG" if args.debug else "WARNING")
Original file line number	Diff line number	Diff line change
`@@ -86,8 +86,8 @@`
`86`	`86`	`#`
`87`	`87`	`#pull = "newer"`
`88`	`88`
`89`		`-# Specify the AI runtime to use; valid options are 'llama.cpp' and 'vllm' (default: llama.cpp)`
`90`		`-# Options: llama.cpp, vllm`
	`89`	`+# Specify the AI runtime to use; valid options are 'llama.cpp', 'vllm', and 'mlx' (default: llama.cpp)`
	`90`	`+# Options: llama.cpp, vllm, mlx`
`91`	`91`	`#`
`92`	`92`	`#runtime = "llama.cpp"`
`93`	`93`