-
Notifications
You must be signed in to change notification settings - Fork 235
Description
Issue Description
I'm running on a M2 Max with ramalama 0.11.1 and mlx 0.26.5. I have trouble running or serving a model from huggingface. Even though I can pull the model just fine, I always get an error about the model not being found.
Steps to reproduce the issue
- Pull the model
ramalama pull hf://mlx-community/Llama-3.2-1B-Instruct-4bit
Downloading hf://mlx-community/Llama-3.2-1B-Instruct-4bit ...
Trying to pull hf://mlx-community/Llama-3.2-1B-Instruct-4bit ...
Fetching 8 files: 0%| | 0/8 [00:00<?, ?it/s]Downloading 'tokenizer.json' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/HgM_lKo9sdSCfRtVg7MMFS7EKqo=.6b9e4e7fb171f92fd137b777cc2714bf87d11576700a1dcd7a399e7bbe39537b.incomplete'
Downloading 'model.safetensors.index.json' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/yVzAsSxRSINSz-tQbpx-TLpfkLU=.32101c2481caabb396a3b36c3fd8b219b0da9c2c.incomplete'
Downloading '.gitattributes' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/wPaCkH-WbT7GsmxMKKrNZTV4nSM=.52373fe24473b1aa44333d318f578ae6bf04b49b.incomplete'
Downloading 'tokenizer_config.json' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/vzaExXFZNBay89bvlQv-ZcI6BTg=.6568c91f9cdd35e8ac07b8ff0c201f7e835affc8.incomplete'
Downloading 'special_tokens_map.json' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/ahkChHUJFxEmOdq5GDFEmerRzCY=.02ee80b6196926a5ad790a004d9efd6ab1ba6542.incomplete'
model.safetensors.index.json: 26.2kB [00:00, 110MB/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/model.safetensors.index.json
.gitattributes: 1.57kB [00:00, 19.1MB/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.gitattributes
tokenizer_config.json: 54.6kB [00:00, 206MB/s] | 1/8 [00:00<00:00, 8.77it/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/tokenizer_config.json
special_tokens_map.json: 100%|██████████████████████████████████| 296/296 [00:00<00:00, 2.67MB/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/special_tokens_map.json
Downloading 'config.json' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/8_PA_wEVGiVa2goH2H4KQOQpvVY=.25e549e5bb9b201031726870cec84fd9bef3d707.incomplete'
config.json: 1.12kB [00:00, 1.71MB/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/config.jsonon: 0.00B [00:00, ?B/s]
Downloading 'model.safetensors' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/xGOKKLRSlIhH692hSVvI1-gpoa8=.35e396644bca888eec399f9c0f843ec7fa78b8f8c5e06841661be62b4edf96dd.incomplete'
Downloading 'README.md' to '/var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/.cache/huggingface/download/Xn7B-BWUGOee2Y6hCZtEhtFu4BE=.f8c048069e67e62805503ba050832af2e69a210b.incomplete'
README.md: 16.3kB [00:00, 12.0MB/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/README.mdmd: 0.00B [00:00, ?B/s]
tokenizer.json: 100%|███████████████████████████████████████| 17.2M/17.2M [00:00<00:00, 20.2MB/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/tokenizer.jsonnsors: 0%| | 0.00/695M [00:00<?, ?B/s]
model.safetensors: 100%|███████████████████████████████████████| 695M/695M [00:03<00:00, 191MB/s]
Download complete. Moving file to /var/folders/7_/3zhq5rhd78d4vc04h30qzmz80000gp/T/tmpdi28_w_v/model.safetensorsrs: 100%|███████████████████████████████████████| 695M/695M [00:03<00:00, 200MB/s]
Fetching 8 files: 100%|████████████████████████████████████████████| 8/8 [00:03<00:00, 2.07it/s]
listing the models
ramalama list
NAME MODIFIED SIZE
hf://mlx-community/Llama-3.2-1B-Instruct-4bit 55 years ago 0 B
hf://mlx-community/gemma-3-12b-it-qat-4bit 55 years ago 0 B
ollama://granite3-moe/granite3-moe:latest 2 weeks ago 783.77 MB
(noting that the size and modified values are a bit weird)
- Run or serve the model with the MLX runtime
ramalama --runtime=mlx --nocontainer --debug --engine=docker run hf://mlx-community/Llama-3.2-1B-Instruct-4bit
Describe the results you received
I get this error when running the run
or serve
commands with the mlx runtime
ramalama --runtime=mlx --nocontainer --debug --engine=docker run hf://mlx-community/Llama-3.2-1B-Instruct-4bit
2025-07-23 16:50:44 - DEBUG - run_cmd: npu-smi info
2025-07-23 16:50:44 - DEBUG - Working directory: None
2025-07-23 16:50:44 - DEBUG - Ignore stderr: False
2025-07-23 16:50:44 - DEBUG - Ignore all: False
2025-07-23 16:50:44 - DEBUG - run_cmd: mthreads-gmi
2025-07-23 16:50:44 - DEBUG - Working directory: None
2025-07-23 16:50:44 - DEBUG - Ignore stderr: False
2025-07-23 16:50:44 - DEBUG - Ignore all: False
2025-07-23 16:50:44 - DEBUG - Checking if 8080 is available
2025-07-23 16:50:44 - DEBUG - MLX server not ready, waiting... (attempt 1/10)
2025-07-23 16:50:44 - DEBUG - Checking if 8080 is available
Traceback (most recent call last):
File "/opt/homebrew/bin/ramalama", line 8, in <module>
sys.exit(main())
~~~~^^
File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/cli.py", line 1248, in main
args.func(args)
~~~~~~~~~^^^^^^
File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/cli.py", line 986, in run_cli
model.serve(args, quiet=True) if args.rag else model.run(args)
~~~~~~~~~^^^^^^
File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/model.py", line 358, in run
self._start_server(args)
~~~~~~~~~~~~~~~~~~^^^^^^
File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/model.py", line 369, in _start_server
self.serve(args, True)
~~~~~~~~~~^^^^^^^^^^^^
File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/model.py", line 739, in serve
exec_args = self.build_exec_args_serve(args)
File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/model.py", line 642, in build_exec_args_serve
exec_args = self.mlx_serve(args)
File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/model.py", line 636, in mlx_serve
return self._build_mlx_exec_args("server", args, extra)
~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/model.py", line 472, in _build_mlx_exec_args
shlex.quote(self._get_entry_model_path(args.container, args.generate, args.dryrun)),
~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/ramalama/0.11.1/libexec/lib/python3.13/site-packages/ramalama/model.py", line 189, in _get_entry_model_path
raise NoRefFileFound(self.model)
ramalama.model.NoRefFileFound: No ref file or models found for 'mlx-community/Llama-3.2-1B-Instruct-4bit'. Please pull model.
Describe the results you expected
I should be able to run
or serve
the model with the mlx runtime without issue
ramalama info output
{
"Accelerator": "none",
"Engine": {
"Name": null
},
"Image": "quay.io/ramalama/ramalama:latest",
"Runtime": "llama.cpp",
"Selinux": false,
"Shortnames": {
"Files": [
"/opt/homebrew/Cellar/ramalama/0.11.1/libexec/share/ramalama/shortnames.conf"
],
"Names": {
"cerebrum": "huggingface://froggeric/Cerebrum-1.0-7b-GGUF/Cerebrum-1.0-7b-Q4_KS.gguf",
"deepseek": "ollama://deepseek-r1",
"dragon": "huggingface://llmware/dragon-mistral-7b-v0/dragon-mistral-7b-q4_k_m.gguf",
"gemma3": "hf://ggml-org/gemma-3-4b-it-GGUF",
"gemma3:12b": "hf://ggml-org/gemma-3-12b-it-GGUF",
"gemma3:1b": "hf://ggml-org/gemma-3-1b-it-GGUF/gemma-3-1b-it-Q4_K_M.gguf",
"gemma3:27b": "hf://ggml-org/gemma-3-27b-it-GGUF",
"gemma3:4b": "hf://ggml-org/gemma-3-4b-it-GGUF",
"gemma3n": "hf://ggml-org/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-Q8_0.gguf",
"gemma3n:e2b": "hf://ggml-org/gemma-3n-E2B-it-GGUF/gemma-3n-E2B-it-Q8_0.gguf",
"gemma3n:e2b-it-f16": "hf://ggml-org/gemma-3n-E2B-it-GGUF/gemma-3n-E2B-it-f16.gguf",
"gemma3n:e2b-it-q8_0": "hf://ggml-org/gemma-3n-E2B-it-GGUF/gemma-3n-E2B-it-Q8_0.gguf",
"gemma3n:e4b": "hf://ggml-org/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-Q8_0.gguf",
"gemma3n:e4b-it-f16": "hf://ggml-org/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-f16.gguf",
"gemma3n:e4b-it-q8_0": "hf://ggml-org/gemma-3n-E4B-it-GGUF/gemma-3n-E4B-it-Q8_0.gguf",
"granite": "ollama://granite3.1-dense",
"granite-code": "hf://ibm-granite/granite-3b-code-base-2k-GGUF/granite-3b-code-base.Q4_K_M.gguf",
"granite-code:20b": "hf://ibm-granite/granite-20b-code-base-8k-GGUF/granite-20b-code-base.Q4_K_M.gguf",
"granite-code:34b": "hf://ibm-granite/granite-34b-code-base-8k-GGUF/granite-34b-code-base.Q4_K_M.gguf",
"granite-code:3b": "hf://ibm-granite/granite-3b-code-base-2k-GGUF/granite-3b-code-base.Q4_K_M.gguf",
"granite-code:8b": "hf://ibm-granite/granite-8b-code-base-4k-GGUF/granite-8b-code-base.Q4_K_M.gguf",
"granite-lab-7b": "huggingface://instructlab/granite-7b-lab-GGUF/granite-7b-lab-Q4_K_M.gguf",
"granite-lab-8b": "huggingface://ibm-granite/granite-8b-code-base-GGUF/granite-8b-code-base.Q4_K_M.gguf",
"granite-lab:7b": "huggingface://instructlab/granite-7b-lab-GGUF/granite-7b-lab-Q4_K_M.gguf",
"granite:2b": "ollama://granite3.1-dense:2b",
"granite:7b": "huggingface://instructlab/granite-7b-lab-GGUF/granite-7b-lab-Q4_K_M.gguf",
"granite:8b": "ollama://granite3.1-dense:8b",
"hermes": "huggingface://NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf",
"ibm/granite": "ollama://granite3.1-dense:8b",
"ibm/granite:2b": "ollama://granite3.1-dense:2b",
"ibm/granite:7b": "huggingface://instructlab/granite-7b-lab-GGUF/granite-7b-lab-Q4_K_M.gguf",
"ibm/granite:8b": "ollama://granite3.1-dense:8b",
"merlinite": "huggingface://instructlab/merlinite-7b-lab-GGUF/merlinite-7b-lab-Q4_K_M.gguf",
"merlinite-lab-7b": "huggingface://instructlab/merlinite-7b-lab-GGUF/merlinite-7b-lab-Q4_K_M.gguf",
"merlinite-lab:7b": "huggingface://instructlab/merlinite-7b-lab-GGUF/merlinite-7b-lab-Q4_K_M.gguf",
"merlinite:7b": "huggingface://instructlab/merlinite-7b-lab-GGUF/merlinite-7b-lab-Q4_K_M.gguf",
"mistral": "hf://lmstudio-community/Mistral-7B-Instruct-v0.3-GGUF/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf",
"mistral-small3.1": "hf://bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF/mistralai_Mistral-Small-3.1-24B-Instruct-2503-IQ2_M.gguf",
"mistral-small3.1:24b": "hf://bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF/mistralai_Mistral-Small-3.1-24B-Instruct-2503-IQ2_M.gguf",
"mistral:7b": "hf://lmstudio-community/Mistral-7B-Instruct-v0.3-GGUF/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf",
"mistral:7b-v1": "huggingface://TheBloke/Mistral-7B-Instruct-v0.1-GGUF/mistral-7b-instruct-v0.1.Q5_K_M.gguf",
"mistral:7b-v2": "huggingface://TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
"mistral:7b-v3": "hf://lmstudio-community/Mistral-7B-Instruct-v0.3-GGUF/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf",
"mistral_code_16k": "huggingface://TheBloke/Mistral-7B-Code-16K-qlora-GGUF/mistral-7b-code-16k-qlora.Q4_K_M.gguf",
"mistral_codealpaca": "huggingface://TheBloke/Mistral-7B-codealpaca-lora-GGUF/mistral-7b-codealpaca-lora.Q4_K_M.gguf",
"mixtao": "huggingface://MaziyarPanahi/MixTAO-7Bx2-MoE-Instruct-v7.0-GGUF/MixTAO-7Bx2-MoE-Instruct-v7.0.Q4_K_M.gguf",
"openchat": "huggingface://TheBloke/openchat-3.5-0106-GGUF/openchat-3.5-0106.Q4_K_M.gguf",
"openorca": "huggingface://TheBloke/Mistral-7B-OpenOrca-GGUF/mistral-7b-openorca.Q4_K_M.gguf",
"phi2": "huggingface://MaziyarPanahi/phi-2-GGUF/phi-2.Q4_K_M.gguf",
"qwen2.5vl": "hf://ggml-org/Qwen2.5-VL-32B-Instruct-GGUF",
"qwen2.5vl:2b": "hf://ggml-org/Qwen2.5-VL-2B-Instruct-GGUF",
"qwen2.5vl:32b": "hf://ggml-org/Qwen2.5-VL-32B-Instruct-GGUF",
"qwen2.5vl:3b": "hf://ggml-org/Qwen2.5-VL-3B-Instruct-GGUF",
"qwen2.5vl:7b": "hf://ggml-org/Qwen2.5-VL-7B-Instruct-GGUF",
"smollm:135m": "ollama://smollm:135m",
"smolvlm": "hf://ggml-org/SmolVLM-500M-Instruct-GGUF",
"smolvlm:256m": "hf://ggml-org/SmolVLM-256M-Instruct-GGUF",
"smolvlm:2b": "hf://ggml-org/SmolVLM-Instruct-GGUF",
"smolvlm:500m": "hf://ggml-org/SmolVLM-500M-Instruct-GGUF",
"tiny": "ollama://tinyllama"
}
},
"Store": "/Users/bobby/.local/share/ramalama",
"UseContainer": false,
"Version": "0.11.1"
}
Upstream Latest Release
Yes
Additional environment details
I have not messed with the default store location or anything else with environment variables.
Additional information
Here are the contents of the Llama-3.2-1B-Instruct-4bit/refs/latest.json
{
"files": [
{
"hash": "sha256:35e396644bca888eec399f9c0f843ec7fa78b8f8c5e06841661be62b4edf96dd",
"name": "model.safetensors",
"type": "other"
},
{
"hash": "sha256:6568c91f9cdd35e8ac07b8ff0c201f7e835affc8",
"name": "tokenizer_config.json",
"type": "other"
},
{
"hash": "sha256:02ee80b6196926a5ad790a004d9efd6ab1ba6542",
"name": "special_tokens_map.json",
"type": "other"
},
{
"hash": "sha256:25e549e5bb9b201031726870cec84fd9bef3d707",
"name": "config.json",
"type": "other"
},
{
"hash": "sha256:6b9e4e7fb171f92fd137b777cc2714bf87d11576700a1dcd7a399e7bbe39537b",
"name": "tokenizer.json",
"type": "other"
},
{
"hash": "sha256:32101c2481caabb396a3b36c3fd8b219b0da9c2c",
"name": "model.safetensors.index.json",
"type": "other"
}
],
"hash": "sha256-f8c048069e67e62805503ba050832af2e69a210b",
"path": "/Users/bobby/.local/share/ramalama/store/huggingface/mlx-community/Llama-3.2-1B-Instruct-4bit/refs/latest.json",
"version": "v1.0"
}
I'm hypothesizing that this ramalama thinks there are no model files because of this returning an empty array
ramalama/ramalama/model_store/reffile.py
Line 123 in 95eeef9
def model_files(self) -> list[StoreFile]: |