Skip to content

Commit 95edaa0

Browse files
Quantized Gorilla (#160)
Resolved #77 , demo displaying local inference with textwebui. K-quantized gorilla models can be found on [Huggingface](https://huggingface.co/gorilla-llm): [Llama-based](https://huggingface.co/gorilla-llm/gorilla-7b-hf-v1-gguf), [MPT-Based](https://huggingface.co/gorilla-llm/gorilla-mpt-7b-hf-v0-gguf), [Falcon-Based](https://huggingface.co/gorilla-llm/gorilla-falcon-7b-hf-v0-gguf), [`gorilla-openfunctions-v0-gguf`](https://huggingface.co/gorilla-llm/gorilla-openfunctions-v0-gguf), [`gorilla-openfunctions-v1-gguf`](https://huggingface.co/gorilla-llm/gorilla-openfunctions-v1-gguf) A tutorial walkthrough on how to quantize model using llama.cpp with different quantization methods documented in [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JP_MN-J1rODo9k_-dR_9c9EnZRCfcVNe?usp=sharing). Running local inference with Gorilla on a clean interface is simple. Demoed using [text-generation-webui](https://github.com/oobabooga/text-generation-webui), add your desired models, and run inference. More details in `/inference` README Co-authored-by: Pranav Ramesh <[email protected]> --------- Co-authored-by: Pranav Ramesh <[email protected]> Co-authored-by: Pranav Ramesh <[email protected]>
1 parent 988c3f9 commit 95edaa0

File tree

7 files changed

+45
-0
lines changed

7 files changed

+45
-0
lines changed

inference/Presentation1_final.gif

66.3 MB
Loading

inference/README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,51 @@ python3 serve/gorilla_falcon_cli.py --model-path path/to/gorilla-falcon-7b-hf-v0
5858

5959
> Add "--device mps" if you are running on your Mac with Apple silicon (M1, M2, etc)
6060
61+
### Inference Gorilla locally
62+
63+
K-quantized gorilla models can be found on [Huggingface](https://huggingface.co/gorilla-llm): [Llama-based](https://huggingface.co/gorilla-llm/gorilla-7b-hf-v1-gguf), [MPT-Based](https://huggingface.co/gorilla-llm/gorilla-mpt-7b-hf-v0-gguf), [Falcon-Based](https://huggingface.co/gorilla-llm/gorilla-falcon-7b-hf-v0-gguf)
64+
65+
K-quantized `gorilla-openfunctions-v0` and `gorilla-openfunctions-v1` models can be found on [Huggingface](https://huggingface.co/gorilla-llm): [`gorilla-openfunctions-v0-gguf`](https://huggingface.co/gorilla-llm/gorilla-openfunctions-v0-gguf), [`gorilla-openfunctions-v1-gguf`](https://huggingface.co/gorilla-llm/gorilla-openfunctions-v1-gguf)
66+
67+
For an in depth walkthrough on how this quantization was done, follow the tutorial in
68+
this [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JP_MN-J1rODo9k_-dR_9c9EnZRCfcVNe?usp=sharing). This tutorial is a fully self-contained space to see an under-the-hood walkthrough of the quantization pipeline (using llama.cpp) and to test out your own prompts with different quantized versions of Gorilla. The models don't take up local space and utilize a CPU runtime.
69+
70+
Running local inference with Gorilla on a clean interface is simple. Follow the instructions below to set up [text-generation-webui](https://github.com/oobabooga/text-generation-webui), add your desired models, and run inference.
71+
72+
73+
My specs, M1 Macbook Air 2020
74+
```
75+
Model Name: MacBook Air
76+
Model Identifier: MacBookAir10,1
77+
Model Number: Z125000NMCH/A
78+
Chip: Apple M1
79+
Total Number of Cores: 8 (4 performance and 4 efficiency)
80+
Memory: 16 GB
81+
System Firmware Version: 10151.61.4
82+
OS Loader Version: 10151.61.4
83+
```
84+
85+
Step 1: Clone [text-generation-webui](https://github.com/oobabooga/text-generation-webui), a Gradio web UI for Large Language Models. It supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), and Llama models. It hides many complexities of llama.cpp and has a well defined interface that is easy to use.
86+
87+
`git clone https://github.com/oobabooga/text-generation-webui.git`
88+
89+
Step 2: Follow [text-generation-webui](https://github.com/oobabooga/text-generation-webui) instructions to run the application locally.
90+
1. Go to the cloned folder
91+
2. `./start_macos.sh` and it will output the following ![Alt text](image.png)
92+
3. Open a browser and go to url `http://127.0.0.1:7860/` as an example.
93+
94+
95+
Step 3: Select the quantization method you want to use, download the quantized model and run the inference on the quantized Gorilla models.
96+
97+
1. Here, we can go to `Model` and there is `Download model or LoRA`. For example, we want to get the q3_K_M gguf quantized model for `gorilla-7b-hf-v1`, you would input `gorilla-llm/gorilla-7b-hf-v1` and filename as `gorilla-7b-hf-v1-q3_K_M` and click `Download`. It would say Downloading file to `models/`. ![Alt text](image-2.png)
98+
2. After downloading the model, you select the Model, `gorilla-7b-hf-v1-q3_K_M` for demonstration, and click `Load`. For settings, if you have laptop GPU available, increasing `n-gpu-layers` accelerates inference. ![Alt text](image-1.png)
99+
3. After loading, it will give a confirmation message as following. ![Alt text](image-3.png)
100+
4. Then go to `Chat` page, use default setting for llama based quantized models, ![Alt text](image-4.png)
101+
5. *Real-time inference* video demo
102+
![Alt text](Presentation1_final.gif)
103+
104+
Integration with Gorilla-CLI coming soon ...
105+
61106
### [Optional] Batch Inference on a Prompt File
62107

63108
After downloading the model, you need to make a jsonl file containing all the question you want to inference through Gorilla. Here is [one example](https://github.com/ShishirPatil/gorilla/blob/main/inference/example_questions/example_questions.jsonl):

inference/image-1.png

473 KB
Loading

inference/image-2.png

206 KB
Loading

inference/image-3.png

469 KB
Loading

inference/image-4.png

368 KB
Loading

inference/image.png

128 KB
Loading

0 commit comments

Comments
 (0)