Skip to content

Running with MLX runtime cannot find an existing model #1735

@BobbyRadford

Description

@BobbyRadford

Issue Description

I'm running on a M2 Max with ramalama 0.11.1 and mlx 0.26.5. I have trouble running or serving a model from huggingface. Even though I can pull the model just fine, I always get an error about the model not being found.

Steps to reproduce the issue

  1. Pull the model
ramalama pull hf://mlx-community/Llama-3.2-1B-Instruct-4bit        
Downloading hf://mlx-community/Llama-3.2-1B-Instruct-4bit ...
Trying to pull hf://mlx-community/Llama-3.2-1B-Instruct-4bit ...
Fetching 8 files:   0%|                                                    | 0/8 [00:00<?, ?it/s]Downloading 'tokenizer.json' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/HgM_lKo9sdSCfRtVg7MMFS7EKqo=.6b9e4e7fb171f92fd137b777cc2714bf87d11576700a1dcd7a399e7bbe39537b.incomplete'
Downloading 'model.safetensors.index.json' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/yVzAsSxRSINSz-tQbpx-TLpfkLU=.32101c2481caabb396a3b36c3fd8b219b0da9c2c.incomplete'
Downloading '.gitattributes' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/wPaCkH-WbT7GsmxMKKrNZTV4nSM=.52373fe24473b1aa44333d318f578ae6bf04b49b.incomplete'
Downloading 'tokenizer_config.json' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/vzaExXFZNBay89bvlQv-ZcI6BTg=.6568c91f9cdd35e8ac07b8ff0c201f7e835affc8.incomplete'
Downloading 'special_tokens_map.json' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/ahkChHUJFxEmOdq5GDFEmerRzCY=.02ee80b6196926a5ad790a004d9efd6ab1ba6542.incomplete'
model.safetensors.index.json: 26.2kB [00:00, 110MB/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/model.safetensors.index.json
.gitattributes: 1.57kB [00:00, 19.1MB/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.gitattributes
tokenizer_config.json: 54.6kB [00:00, 206MB/s]                     | 1/8 [00:00<00:00,  8.77it/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/tokenizer_config.json
special_tokens_map.json: 100%|██████████████████████████████████| 296/296 [00:00<00:00, 2.67MB/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/special_tokens_map.json
                                                                                                Downloading 'config.json' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/8_PA_wEVGiVa2goH2H4KQOQpvVY=.25e549e5bb9b201031726870cec84fd9bef3d707.incomplete'
config.json: 1.12kB [00:00, 1.71MB/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/config.jsonon: 0.00B [00:00, ?B/s]
Downloading 'model.safetensors' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/xGOKKLRSlIhH692hSVvI1-gpoa8=.35e396644bca888eec399f9c0f843ec7fa78b8f8c5e06841661be62b4edf96dd.incomplete'
Downloading 'README.md' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/Xn7B-BWUGOee2Y6hCZtEhtFu4BE=.f8c048069e67e62805503ba050832af2e69a210b.incomplete'
README.md: 16.3kB [00:00, 12.0MB/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/README.mdmd: 0.00B [00:00, ?B/s]
tokenizer.json: 100%|███████████████████████████████████████| 17.2M/17.2M [00:00<00:00, 20.2MB/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/tokenizer.jsonnsors:   0%|                                              | 0.00/695M [00:00<?, ?B/s]
model.safetensors: 100%|███████████████████████████████████████| 695M/695M [00:03<00:00, 191MB/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/model.safetensorsrs: 100%|███████████████████████████████████████| 695M/695M [00:03<00:00, 200MB/s]
Fetching 8 files: 100%|████████████████████████████████████████████| 8/8 [00:03<00:00,  2.07it/s]

listing the models

ramalama list
NAME                                          MODIFIED     SIZE     
hf://mlx-community/Llama-3.2-1B-Instruct-4bit 55 years ago 0 B      
hf://mlx-community/gemma-3-12b-it-qat-4bit    55 years ago 0 B      
ollama://granite3-moe/granite3-moe:latest     2 weeks ago  783.77 MB

(noting that the size and modified values are a bit weird)

  1. Run or serve the model with the MLX runtime
ramalama --runtime=mlx --nocontainer --debug --engine=docker run hf://mlx-community/Llama-3.2-1B-Instruct-4bit

Describe the results you received

I get this error when running the run or serve commands with the mlx runtime

ramalama --runtime=mlx --nocontainer --debug --engine=docker run hf://mlx-community/Llama-3.2-1B-Instruct-4bit
2025-07-23 16:50:44 - DEBUG - run_cmd: npu-smi info
2025-07-23 16:50:44 - DEBUG - Working directory: None
2025-07-23 16:50:44 - DEBUG - Ignore stderr: False
2025-07-23 16:50:44 - DEBUG - Ignore all: False
2025-07-23 16:50:44 - DEBUG - run_cmd: mthreads-gmi
2025-07-23 16:50:44 - DEBUG - Working directory: None
2025-07-23 16:50:44 - DEBUG - Ignore stderr: False
2025-07-23 16:50:44 - DEBUG - Ignore all: False
2025-07-23 16:50:44 - DEBUG - Checking if 8080 is available
2025-07-23 16:50:44 - DEBUG - MLX server not ready, waiting... (attempt 1/10)
2025-07-23 16:50:44 - DEBUG - Checking if 8080 is available
Traceback (most recent call last):
  File "/opt/homebrew/bin/ramalama", line 8, in <module>
    sys.exit(main())
             ~~~~^^
  File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/cli.py", line 1248, in main
    args.func(args)
    ~~~~~~~~~^^^^^^
  File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/cli.py", line 986, in run_cli
    model.serve(args, quiet=True) if args.rag else model.run(args)
                                                   ~~~~~~~~~^^^^^^
  File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/model.py", line 358, in run
    self._start_server(args)
    ~~~~~~~~~~~~~~~~~~^^^^^^
  File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/model.py", line 369, in _start_server
    self.serve(args, True)
    ~~~~~~~~~~^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/model.py", line 739, in serve
    exec_args = self.build_exec_args_serve(args)
  File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/model.py", line 642, in build_exec_args_serve
    exec_args = self.mlx_serve(args)
  File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/model.py", line 636, in mlx_serve
    return self._build_mlx_exec_args("server", args, extra)
           ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/model.py", line 472, in _build_mlx_exec_args
    shlex.quote(self._get_entry_model_path(args.container, args.generate, args.dryrun)),
                ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/model.py", line 189, in _get_entry_model_path
    raise NoRefFileFound(self.model)
ramalama.model.NoRefFileFound: No ref file or models found for 'mlx-community/Llama-3.2-1B-Instruct-4bit'. Please pull model.

Describe the results you expected

I should be able to run or serve the model with the mlx runtime without issue

ramalama info output

{
    "Accelerator": "none",
    "Engine": {
        "Name": null
    },
    "Image": "quay.io/ramalama/ramalama:latest",
    "Runtime": "llama.cpp",
    "Selinux": false,
    "Shortnames": {
        "Files": [
            "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/share/ramalama/shortnames.conf"
        ],
        "Names": {
            "cerebrum": "huggingface://froggeric/Cerebrum-1.0-7b-GGUF/Cerebrum-1.0-7b-Q4_KS.gguf",
            "deepseek": "ollama://deepseek-r1",
            "dragon": "huggingface://llmware/dragon-mistral-7b-v0/dragon-mistral-7b-q4_k_m.gguf",
            "gemma3": "hf://ggml-org/gemma-3-4b-it-GGUF",
            "gemma3:12b": "hf://ggml-org/gemma-3-12b-it-GGUF",
            "gemma3:1b": "hf://ggml-org/gemma-3-1b-it-GGUF/gemma-3-1b-it-Q4_K_M.gguf",
            "gemma3:27b": "hf://ggml-org/gemma-3-27b-it-GGUF",
            "gemma3:4b": "hf://ggml-org/gemma-3-4b-it-GGUF",
            "gemma3n": "hf://ggml-org/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-Q8_0.gguf",
            "gemma3n:e2b": "hf://ggml-org/gemma-3n-E2B-it-GGUF/gemma-3n-E2B-it-Q8_0.gguf",
            "gemma3n:e2b-it-f16": "hf://ggml-org/gemma-3n-E2B-it-GGUF/gemma-3n-E2B-it-f16.gguf",
            "gemma3n:e2b-it-q8_0": "hf://ggml-org/gemma-3n-E2B-it-GGUF/gemma-3n-E2B-it-Q8_0.gguf",
            "gemma3n:e4b": "hf://ggml-org/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-Q8_0.gguf",
            "gemma3n:e4b-it-f16": "hf://ggml-org/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-f16.gguf",
            "gemma3n:e4b-it-q8_0": "hf://ggml-org/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-Q8_0.gguf",
            "granite": "ollama://granite3.1-dense",
            "granite-code": "hf://ibm-granite/granite-3b-code-base-2k-GGUF/granite-3b-code-base.Q4_K_M.gguf",
            "granite-code:20b": "hf://ibm-granite/granite-20b-code-base-8k-GGUF/granite-20b-code-base.Q4_K_M.gguf",
            "granite-code:34b": "hf://ibm-granite/granite-34b-code-base-8k-GGUF/granite-34b-code-base.Q4_K_M.gguf",
            "granite-code:3b": "hf://ibm-granite/granite-3b-code-base-2k-GGUF/granite-3b-code-base.Q4_K_M.gguf",
            "granite-code:8b": "hf://ibm-granite/granite-8b-code-base-4k-GGUF/granite-8b-code-base.Q4_K_M.gguf",
            "granite-lab-7b": "huggingface://instructlab/granite-7b-lab-GGUF/granite-7b-lab-Q4_K_M.gguf",
            "granite-lab-8b": "huggingface://ibm-granite/granite-8b-code-base-GGUF/granite-8b-code-base.Q4_K_M.gguf",
            "granite-lab:7b": "huggingface://instructlab/granite-7b-lab-GGUF/granite-7b-lab-Q4_K_M.gguf",
            "granite:2b": "ollama://granite3.1-dense:2b",
            "granite:7b": "huggingface://instructlab/granite-7b-lab-GGUF/granite-7b-lab-Q4_K_M.gguf",
            "granite:8b": "ollama://granite3.1-dense:8b",
            "hermes": "huggingface://NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf",
            "ibm/granite": "ollama://granite3.1-dense:8b",
            "ibm/granite:2b": "ollama://granite3.1-dense:2b",
            "ibm/granite:7b": "huggingface://instructlab/granite-7b-lab-GGUF/granite-7b-lab-Q4_K_M.gguf",
            "ibm/granite:8b": "ollama://granite3.1-dense:8b",
            "merlinite": "huggingface://instructlab/merlinite-7b-lab-GGUF/merlinite-7b-lab-Q4_K_M.gguf",
            "merlinite-lab-7b": "huggingface://instructlab/merlinite-7b-lab-GGUF/merlinite-7b-lab-Q4_K_M.gguf",
            "merlinite-lab:7b": "huggingface://instructlab/merlinite-7b-lab-GGUF/merlinite-7b-lab-Q4_K_M.gguf",
            "merlinite:7b": "huggingface://instructlab/merlinite-7b-lab-GGUF/merlinite-7b-lab-Q4_K_M.gguf",
            "mistral": "hf://lmstudio-community/Mistral-7B-Instruct-v0.3-GGUF/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf",
            "mistral-small3.1": "hf://bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF/mistralai_Mistral-Small-3.1-24B-Instruct-2503-IQ2_M.gguf",
            "mistral-small3.1:24b": "hf://bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF/mistralai_Mistral-Small-3.1-24B-Instruct-2503-IQ2_M.gguf",
            "mistral:7b": "hf://lmstudio-community/Mistral-7B-Instruct-v0.3-GGUF/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf",
            "mistral:7b-v1": "huggingface://TheBloke/Mistral-7B-Instruct-v0.1-GGUF/mistral-7b-instruct-v0.1.Q5_K_M.gguf",
            "mistral:7b-v2": "huggingface://TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
            "mistral:7b-v3": "hf://lmstudio-community/Mistral-7B-Instruct-v0.3-GGUF/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf",
            "mistral_code_16k": "huggingface://TheBloke/Mistral-7B-Code-16K-qlora-GGUF/mistral-7b-code-16k-qlora.Q4_K_M.gguf",
            "mistral_codealpaca": "huggingface://TheBloke/Mistral-7B-codealpaca-lora-GGUF/mistral-7b-codealpaca-lora.Q4_K_M.gguf",
            "mixtao": "huggingface://MaziyarPanahi/MixTAO-7Bx2-MoE-Instruct-v7.0-GGUF/MixTAO-7Bx2-MoE-Instruct-v7.0.Q4_K_M.gguf",
            "openchat": "huggingface://TheBloke/openchat-3.5-0106-GGUF/openchat-3.5-0106.Q4_K_M.gguf",
            "openorca": "huggingface://TheBloke/Mistral-7B-OpenOrca-GGUF/mistral-7b-openorca.Q4_K_M.gguf",
            "phi2": "huggingface://MaziyarPanahi/phi-2-GGUF/phi-2.Q4_K_M.gguf",
            "qwen2.5vl": "hf://ggml-org/Qwen2.5-VL-32B-Instruct-GGUF",
            "qwen2.5vl:2b": "hf://ggml-org/Qwen2.5-VL-2B-Instruct-GGUF",
            "qwen2.5vl:32b": "hf://ggml-org/Qwen2.5-VL-32B-Instruct-GGUF",
            "qwen2.5vl:3b": "hf://ggml-org/Qwen2.5-VL-3B-Instruct-GGUF",
            "qwen2.5vl:7b": "hf://ggml-org/Qwen2.5-VL-7B-Instruct-GGUF",
            "smollm:135m": "ollama://smollm:135m",
            "smolvlm": "hf://ggml-org/SmolVLM-500M-Instruct-GGUF",
            "smolvlm:256m": "hf://ggml-org/SmolVLM-256M-Instruct-GGUF",
            "smolvlm:2b": "hf://ggml-org/SmolVLM-Instruct-GGUF",
            "smolvlm:500m": "hf://ggml-org/SmolVLM-500M-Instruct-GGUF",
            "tiny": "ollama://tinyllama"
        }
    },
    "Store": "/Users/bobby/.local/share/ramalama",
    "UseContainer": false,
    "Version": "0.11.1"
}

Upstream Latest Release

Yes

Additional environment details

I have not messed with the default store location or anything else with environment variables.

Additional information

Here are the contents of the Llama-3.2-1B-Instruct-4bit/refs/latest.json

{
  "files": [
    {
      "hash": "sha256:35e396644bca888eec399f9c0f843ec7fa78b8f8c5e06841661be62b4edf96dd",
      "name": "model.safetensors",
      "type": "other"
    },
    {
      "hash": "sha256:6568c91f9cdd35e8ac07b8ff0c201f7e835affc8",
      "name": "tokenizer_config.json",
      "type": "other"
    },
    {
      "hash": "sha256:02ee80b6196926a5ad790a004d9efd6ab1ba6542",
      "name": "special_tokens_map.json",
      "type": "other"
    },
    {
      "hash": "sha256:25e549e5bb9b201031726870cec84fd9bef3d707",
      "name": "config.json",
      "type": "other"
    },
    {
      "hash": "sha256:6b9e4e7fb171f92fd137b777cc2714bf87d11576700a1dcd7a399e7bbe39537b",
      "name": "tokenizer.json",
      "type": "other"
    },
    {
      "hash": "sha256:32101c2481caabb396a3b36c3fd8b219b0da9c2c",
      "name": "model.safetensors.index.json",
      "type": "other"
    }
  ],
  "hash": "sha256-f8c048069e67e62805503ba050832af2e69a210b",
  "path": "/Users/bobby/.local/share/ramalama/store/huggingface/mlx-community/Llama-3.2-1B-Instruct-4bit/refs/latest.json",
  "version": "v1.0"
}

I'm hypothesizing that this ramalama thinks there are no model files because of this returning an empty array

def model_files(self) -> list[StoreFile]:
. That is a total stab in the dark, however.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions