shaleprotocol
diff --git a/‎assets/server_arch.png‎
-10.1 KB b/‎assets/server_arch.png‎
-10.1 KB
diff --git a/‎data/dummy_conversation.json‎
Lines changed: 4007 additions & 5345 deletions b/‎data/dummy_conversation.json‎
Lines changed: 4007 additions & 5345 deletions
diff --git a/‎docker/Dockerfile‎
Lines changed: 3 additions & 2 deletions b/‎docker/Dockerfile‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎docker/docker-compose.yml‎
Lines changed: 1 addition & 1 deletion b/‎docker/docker-compose.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/commands/leaderboard.md‎
Lines changed: 11 additions & 0 deletions b/‎docs/commands/leaderboard.md‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎docs/commands/webserver.md‎
Lines changed: 10 additions & 1 deletion b/‎docs/commands/webserver.md‎
Lines changed: 10 additions & 1 deletion
diff --git a/‎docs/dataset_release.md‎
Lines changed: 6 additions & 0 deletions b/‎docs/dataset_release.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎docs/exllama_v2.md‎
Lines changed: 63 additions & 0 deletions b/‎docs/exllama_v2.md‎
Lines changed: 63 additions & 0 deletions
diff --git a/‎docs/langchain_integration.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/langchain_integration.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/model_support.md‎
Lines changed: 11 additions & 2 deletions b/‎docs/model_support.md‎
Lines changed: 11 additions & 2 deletions
@@ -1,6 +1,7 @@
-FROM nvidia/cuda:11.7.1-runtime-ubuntu20.04
+FROM nvidia/cuda:12.2.0-runtime-ubuntu20.04
 
 RUN apt-get update -y && apt-get install -y python3.9 python3.9-distutils curl
 RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
 RUN python3.9 get-pip.py
-RUN pip3 install fschat
+RUN pip3 install fschat
+RUN pip3 install fschat[model_worker,webui] pydantic==1.10.13
@@ -23,7 +23,7 @@ services:
             - driver: nvidia
               count: 1
               capabilities: [gpu]
-    entrypoint: ["python3.9", "-m", "fastchat.serve.model_worker", "--model-names", "${FASTCHAT_WORKER_MODEL_NAMES:-vicuna-7b-v1.3}", "--model-path", "${FASTCHAT_WORKER_MODEL_PATH:-lmsys/vicuna-7b-v1.3}", "--worker-address", "http://fastchat-model-worker:21002", "--controller-address", "http://fastchat-controller:21001", "--host", "0.0.0.0", "--port", "21002"]
+    entrypoint: ["python3.9", "-m", "fastchat.serve.model_worker", "--model-names", "${FASTCHAT_WORKER_MODEL_NAMES:-vicuna-7b-v1.5}", "--model-path", "${FASTCHAT_WORKER_MODEL_PATH:-lmsys/vicuna-7b-v1.5}", "--worker-address", "http://fastchat-model-worker:21002", "--controller-address", "http://fastchat-controller:21001", "--host", "0.0.0.0", "--port", "21002"]
   fastchat-api-server:
     build:
       context: .
 
@@ -24,3 +24,14 @@ scp atlas:/data/lmzheng/FastChat/fastchat/serve/monitor/elo_results_20230905.pkl
 ```
 wget https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/raw/main/leaderboard_table_20230905.csv
 ```
+
+### Update files on webserver
+```
+DATE=20231002
+
+rm -rf elo_results.pkl leaderboard_table.csv
+wget https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/resolve/main/elo_results_$DATE.pkl
+wget https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard/resolve/main/leaderboard_table_$DATE.csv
+ln -s leaderboard_table_$DATE.csv leaderboard_table.csv
+ln -s elo_results_$DATE.pkl elo_results.pkl
+```
@@ -72,7 +72,16 @@ vim /home/vicuna/anaconda3/envs/fastchat/lib/python3.9/site-packages/gradio/temp
 <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.4.1/html2canvas.min.js"></script>
 ```
 
-2. Loading
+2. deprecation warnings
+```
+vim /home/vicuna/anaconda3/envs/fastchat/lib/python3.9/site-packages/gradio/deprecation.py
+```
+
+```
+def check_deprecated_parameters(
+```
+
+3. Loading
 ```
 vim /home/vicuna/anaconda3/envs/fastchat/lib/python3.9/site-packages/gradio/templates/frontend/assets/index-188ef5e8.js
 ```
 
@@ -0,0 +1,6 @@
+## Datasets
+We release the following datasets based on our projects and websites.
+
+- [LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)
+- [Chatbot Arena Conversation Dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)
+- [MT-bench Human Annotation Dataset](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)
@@ -0,0 +1,63 @@
+# ExllamaV2 GPTQ Inference Framework
+
+Integrated [ExllamaV2](https://github.com/turboderp/exllamav2) customized kernel into Fastchat to provide **Faster** GPTQ inference speed.
+
+**Note: Exllama not yet support embedding REST API.**
+
+## Install ExllamaV2
+
+Setup environment (please refer to [this link](https://github.com/turboderp/exllamav2#how-to) for more details):
+
+```bash
+git clone https://github.com/turboderp/exllamav2
+cd exllamav2
+pip install -e .
+```
+
+Chat with the CLI:
+```bash
+python3 -m fastchat.serve.cli \
+    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
+    --enable-exllama
+```
+
+Start model worker:
+```bash
+# Download quantized model from huggingface
+# Make sure you have git-lfs installed (https://git-lfs.com)
+git lfs install
+git clone https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g models/vicuna-7B-1.1-GPTQ-4bit-128g
+
+# Load model with default configuration (max sequence length 4096, no GPU split setting).
+python3 -m fastchat.serve.model_worker \
+    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
+    --enable-exllama
+
+#Load model with max sequence length 2048, allocate 18 GB to CUDA:0 and 24 GB to CUDA:1.
+python3 -m fastchat.serve.model_worker \
+    --model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
+    --enable-exllama \
+    --exllama-max-seq-len 2048 \
+    --exllama-gpu-split 18,24
+```
+
+`--exllama-cache-8bit` can be used to enable 8-bit caching with exllama and save some VRAM.
+
+## Performance 
+
+Reference: https://github.com/turboderp/exllamav2#performance
+
+
+| Model      | Mode         | Size  | grpsz | act | V1: 3090Ti | V1: 4090 | V2: 3090Ti | V2: 4090    |
+|------------|--------------|-------|-------|-----|------------|----------|------------|-------------|
+| Llama      | GPTQ         | 7B    | 128   | no  | 143 t/s    | 173 t/s  | 175 t/s    | **195** t/s |
+| Llama      | GPTQ         | 13B   | 128   | no  | 84 t/s     | 102 t/s  | 105 t/s    | **110** t/s |
+| Llama      | GPTQ         | 33B   | 128   | yes | 37 t/s     | 45 t/s   | 45 t/s     | **48** t/s  |
+| OpenLlama  | GPTQ         | 3B    | 128   | yes | 194 t/s    | 226 t/s  | 295 t/s    | **321** t/s |
+| CodeLlama  | EXL2 4.0 bpw | 34B   | -     | -   | -          | -        | 42 t/s     | **48** t/s  |
+| Llama2     | EXL2 3.0 bpw | 7B    | -     | -   | -          | -        | 195 t/s    | **224** t/s |
+| Llama2     | EXL2 4.0 bpw | 7B    | -     | -   | -          | -        | 164 t/s    | **197** t/s |
+| Llama2     | EXL2 5.0 bpw | 7B    | -     | -   | -          | -        | 144 t/s    | **160** t/s |
+| Llama2     | EXL2 2.5 bpw | 70B   | -     | -   | -          | -        | 30 t/s     | **35** t/s  |
+| TinyLlama  | EXL2 3.0 bpw | 1.1B  | -     | -   | -          | -        | 536 t/s    | **635** t/s |
+| TinyLlama  | EXL2 4.0 bpw | 1.1B  | -     | -   | -          | -        | 509 t/s    | **590** t/s |
@@ -19,7 +19,7 @@ Here, we use Vicuna as an example and use it for three endpoints: chat completio
 See a full list of supported models [here](../README.md#supported-models).
 
 ```bash
-python3 -m fastchat.serve.model_worker --model-names "gpt-3.5-turbo,text-davinci-003,text-embedding-ada-002" --model-path lmsys/vicuna-7b-v1.3
+python3 -m fastchat.serve.model_worker --model-names "gpt-3.5-turbo,text-davinci-003,text-embedding-ada-002" --model-path lmsys/vicuna-7b-v1.5
 ```
 
 Finally, launch the RESTful API server
 
@@ -5,8 +5,10 @@
 - [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
   - example: `python3 -m fastchat.serve.cli --model-path meta-llama/Llama-2-7b-chat-hf`
 - Vicuna, Alpaca, LLaMA, Koala
-  - example: `python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.3`
+  - example: `python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5`
 - [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B)
+- [BAAI/AquilaChat2-7B](https://huggingface.co/BAAI/AquilaChat2-7B)
+- [BAAI/AquilaChat2-34B](https://huggingface.co/BAAI/AquilaChat2-34B)
 - [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en#using-huggingface-transformers)
 - [baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B)
 - [BlinkDL/RWKV-4-Raven](https://huggingface.co/BlinkDL/rwkv-4-raven)
@@ -30,6 +32,8 @@
 - [NousResearch/Nous-Hermes-13b](https://huggingface.co/NousResearch/Nous-Hermes-13b)
 - [openaccess-ai-collective/manticore-13b-chat-pyg](https://huggingface.co/openaccess-ai-collective/manticore-13b-chat-pyg)
 - [OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5](https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5)
+- [openchat/openchat_3.5](https://huggingface.co/openchat/openchat_3.5)
+- [Open-Orca/Mistral-7B-OpenOrca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca)
 - [VMware/open-llama-7b-v2-open-instruct](https://huggingface.co/VMware/open-llama-7b-v2-open-instruct)
 - [Phind/Phind-CodeLlama-34B-v2](https://huggingface.co/Phind/Phind-CodeLlama-34B-v2)
 - [project-baize/baize-v2-7b](https://huggingface.co/project-baize/baize-v2-7b)
@@ -45,6 +49,11 @@
 - [WizardLM/WizardLM-13B-V1.0](https://huggingface.co/WizardLM/WizardLM-13B-V1.0)
 - [WizardLM/WizardCoder-15B-V1.0](https://huggingface.co/WizardLM/WizardCoder-15B-V1.0)
 - [HuggingFaceH4/starchat-beta](https://huggingface.co/HuggingFaceH4/starchat-beta)
+- [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)
+- [Xwin-LM/Xwin-LM-7B-V0.1](https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1)
+- [OpenLemur/lemur-70b-chat-v1](https://huggingface.co/OpenLemur/lemur-70b-chat-v1)
+- [allenai/tulu-2-dpo-7b](https://huggingface.co/allenai/tulu-2-dpo-7b)
+- [Microsoft/Orca-2-7b](https://huggingface.co/microsoft/Orca-2-7b)
 - Any [EleutherAI](https://huggingface.co/EleutherAI) pythia model such as [pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b)
 - Any [Peft](https://github.com/huggingface/peft) adapter trained on top of a
   model above.  To activate, must have `peft` in the model path.  Note: If
@@ -64,7 +73,7 @@ python3 -m fastchat.serve.cli --model [YOUR_MODEL_PATH]
 You can run this example command to learn the code logic.
 
 ```
-python3 -m fastchat.serve.cli --model lmsys/vicuna-7b-v1.3
+python3 -m fastchat.serve.cli --model lmsys/vicuna-7b-v1.5
 ```
 
 You can add `--debug` to see the actual prompt sent to the model.