You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> Add "--device mps" if you are running on your Mac with Apple silicon (M1, M2, etc)
60
60
61
+
### Inference Gorilla locally
62
+
63
+
K-quantized gorilla models can be found on [Huggingface](https://huggingface.co/gorilla-llm): [Llama-based](https://huggingface.co/gorilla-llm/gorilla-7b-hf-v1-gguf), [MPT-Based](https://huggingface.co/gorilla-llm/gorilla-mpt-7b-hf-v0-gguf), [Falcon-Based](https://huggingface.co/gorilla-llm/gorilla-falcon-7b-hf-v0-gguf)
64
+
65
+
K-quantized `gorilla-openfunctions-v0` and `gorilla-openfunctions-v1` models can be found on [Huggingface](https://huggingface.co/gorilla-llm): [`gorilla-openfunctions-v0-gguf`](https://huggingface.co/gorilla-llm/gorilla-openfunctions-v0-gguf), [`gorilla-openfunctions-v1-gguf`](https://huggingface.co/gorilla-llm/gorilla-openfunctions-v1-gguf)
66
+
67
+
For an in depth walkthrough on how this quantization was done, follow the tutorial in
68
+
this [](https://colab.research.google.com/drive/1JP_MN-J1rODo9k_-dR_9c9EnZRCfcVNe?usp=sharing). This tutorial is a fully self-contained space to see an under-the-hood walkthrough of the quantization pipeline (using llama.cpp) and to test out your own prompts with different quantized versions of Gorilla. The models don't take up local space and utilize a CPU runtime.
69
+
70
+
Running local inference with Gorilla on a clean interface is simple. Follow the instructions below to set up [text-generation-webui](https://github.com/oobabooga/text-generation-webui), add your desired models, and run inference.
71
+
72
+
73
+
My specs, M1 Macbook Air 2020
74
+
```
75
+
Model Name: MacBook Air
76
+
Model Identifier: MacBookAir10,1
77
+
Model Number: Z125000NMCH/A
78
+
Chip: Apple M1
79
+
Total Number of Cores: 8 (4 performance and 4 efficiency)
80
+
Memory: 16 GB
81
+
System Firmware Version: 10151.61.4
82
+
OS Loader Version: 10151.61.4
83
+
```
84
+
85
+
Step 1: Clone [text-generation-webui](https://github.com/oobabooga/text-generation-webui), a Gradio web UI for Large Language Models. It supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), and Llama models. It hides many complexities of llama.cpp and has a well defined interface that is easy to use.
Step 2: Follow [text-generation-webui](https://github.com/oobabooga/text-generation-webui) instructions to run the application locally.
90
+
1. Go to the cloned folder
91
+
2.`./start_macos.sh` and it will output the following 
92
+
3. Open a browser and go to url `http://127.0.0.1:7860/` as an example.
93
+
94
+
95
+
Step 3: Select the quantization method you want to use, download the quantized model and run the inference on the quantized Gorilla models.
96
+
97
+
1. Here, we can go to `Model` and there is `Download model or LoRA`. For example, we want to get the q3_K_M gguf quantized model for `gorilla-7b-hf-v1`, you would input `gorilla-llm/gorilla-7b-hf-v1` and filename as `gorilla-7b-hf-v1-q3_K_M` and click `Download`. It would say Downloading file to `models/`. 
98
+
2. After downloading the model, you select the Model, `gorilla-7b-hf-v1-q3_K_M` for demonstration, and click `Load`. For settings, if you have laptop GPU available, increasing `n-gpu-layers` accelerates inference. 
99
+
3. After loading, it will give a confirmation message as following. 
100
+
4. Then go to `Chat` page, use default setting for llama based quantized models, 
101
+
5.*Real-time inference* video demo
102
+

103
+
104
+
Integration with Gorilla-CLI coming soon ...
105
+
61
106
### [Optional] Batch Inference on a Prompt File
62
107
63
108
After downloading the model, you need to make a jsonl file containing all the question you want to inference through Gorilla. Here is [one example](https://github.com/ShishirPatil/gorilla/blob/main/inference/example_questions/example_questions.jsonl):
0 commit comments