shaleprotocol
diff --git a/‎README.md‎
Lines changed: 4 additions & 0 deletions b/‎README.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/arena.md‎
Lines changed: 7 additions & 6 deletions b/‎docs/arena.md‎
Lines changed: 7 additions & 6 deletions
diff --git a/‎docs/commands/webserver.md‎
Lines changed: 4 additions & 1 deletion b/‎docs/commands/webserver.md‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎docs/lightllm_integration.md‎
Lines changed: 18 additions & 0 deletions b/‎docs/lightllm_integration.md‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎docs/mlx_integration.md‎
Lines changed: 23 additions & 0 deletions b/‎docs/mlx_integration.md‎
Lines changed: 23 additions & 0 deletions
diff --git a/‎docs/model_support.md‎
Lines changed: 67 additions & 26 deletions b/‎docs/model_support.md‎
Lines changed: 67 additions & 26 deletions
diff --git a/‎docs/openai_api.md‎
Lines changed: 8 additions & 7 deletions b/‎docs/openai_api.md‎
Lines changed: 8 additions & 7 deletions
diff --git a/‎docs/third_party_ui.md‎
Lines changed: 24 additions & 0 deletions b/‎docs/third_party_ui.md‎
Lines changed: 24 additions & 0 deletions
@@ -16,6 +16,10 @@ We are focused to support Llama2 at scale now. If you want any other models, ple
 
 ## Dev Log
 
+### 2024-02
+
+Sync upstream changes
+
 ### 2023-09
 
 Sync upstream changes
 
@@ -5,10 +5,11 @@ We invite the entire community to join this benchmarking effort by contributing
 ## How to add a new model
 If you want to see a specific model in the arena, you can follow the methods below.
 
-- Method 1: Hosted by LMSYS.
-  1. Contribute the code to support this model in FastChat by submitting a pull request. See [instructions](model_support.md#how-to-support-a-new-model).
-  2. After the model is supported, we will try to schedule some compute resources to host the model in the arena. However, due to the limited resources we have, we may not be able to serve every model. We will select the models based on popularity, quality, diversity, and other factors.
+### Method 1: Hosted by 3rd party API providers or yourself
+If you have a model hosted by a 3rd party API provider or yourself, please give us the access to an API endpoint.
+  - We prefer OpenAI-compatible APIs, so we can reuse our [code](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/api_provider.py) for calling OpenAI models.
+  - If you have your own API protocol, please follow the [instructions](model_support.md) to add them. Contribute your code by sending a pull request.
 
-- Method 2: Hosted by 3rd party API providers or yourself.
-  1. If you have a model hosted by a 3rd party API provider or yourself, please give us an API endpoint. We prefer OpenAI-compatible APIs, so we can reuse our [code](https://github.com/lm-sys/FastChat/blob/33dca5cf12ee602455bfa9b5f4790a07829a2db7/fastchat/serve/gradio_web_server.py#L333-L358) for calling OpenAI models.
-  2. You can use FastChat's OpenAI API [server](openai_api.md) to serve your model with OpenAI-compatible APIs and provide us with the endpoint.
+### Method 2: Hosted by LMSYS
+1. Contribute the code to support this model in FastChat by submitting a pull request. See [instructions](model_support.md).
+2. After the model is supported, we will try to schedule some compute resources to host the model in the arena. However, due to the limited resources we have, we may not be able to serve every model. We will select the models based on popularity, quality, diversity, and other factors.
@@ -24,10 +24,13 @@ python3 -m fastchat.serve.test_message --model vicuna-13b --controller http://lo
 
 cd fastchat_logs/server0
 
+python3 -m fastchat.serve.huggingface_api_worker --model-info-file ~/elo_results/register_hf_api_models.json
+
 export OPENAI_API_KEY=
 export ANTHROPIC_API_KEY=
+export GCP_PROJECT_ID=
 
-python3 -m fastchat.serve.gradio_web_server_multi --controller http://localhost:21001 --concurrency 10 --add-chatgpt --add-claude --add-palm --anony-only --elo ~/elo_results/elo_results.pkl --leaderboard-table-file ~/elo_results/leaderboard_table.csv --register ~/elo_results/register_oai_models.json --show-terms
+python3 -m fastchat.serve.gradio_web_server_multi --controller http://localhost:21001 --concurrency 50 --add-chatgpt --add-claude --add-palm --elo ~/elo_results/elo_results.pkl --leaderboard-table-file ~/elo_results/leaderboard_table.csv --register ~/elo_results/register_oai_models.json --show-terms
 
 python3 backup_logs.py
 ```
 
@@ -0,0 +1,18 @@
+# LightLLM Integration
+You can use [LightLLM](https://github.com/ModelTC/lightllm) as an optimized worker implementation in FastChat.
+It offers advanced continuous batching and a much higher (~10x) throughput.
+See the supported models [here](https://github.com/ModelTC/lightllm?tab=readme-ov-file#supported-model-list).
+
+## Instructions
+1. Please refer to the [Get started](https://github.com/ModelTC/lightllm?tab=readme-ov-file#get-started) to install LightLLM. Or use [Pre-built image](https://github.com/ModelTC/lightllm?tab=readme-ov-file#container)
+
+2. When you launch a model worker, replace the normal worker (`fastchat.serve.model_worker`) with the LightLLM worker (`fastchat.serve.lightllm_worker`). All other commands such as controller, gradio web server, and OpenAI API server are kept the same. Refer to [--max_total_token_num](https://github.com/ModelTC/lightllm/blob/4a9824b6b248f4561584b8a48ae126a0c8f5b000/docs/ApiServerArgs.md?plain=1#L23) to understand how to calculate the `--max_total_token_num` argument.
+   ```
+   python3 -m fastchat.serve.lightllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer_mode "auto" --max_total_token_num 154000
+   ```
+
+   If you what to use quantized weight and kv cache for inference, try
+
+   ```
+   python3 -m fastchat.serve.lightllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer_mode "auto" --max_total_token_num 154000 --mode triton_int8weight triton_int8kv
+   ```
@@ -0,0 +1,23 @@
+# Apple MLX Integration
+
+You can use [Apple MLX](https://github.com/ml-explore/mlx) as an optimized worker implementation in FastChat.
+
+It runs models efficiently on Apple Silicon
+
+See the supported models [here](https://github.com/ml-explore/mlx-examples/tree/main/llms#supported-models).
+
+Note that for Apple Silicon Macs with less memory, smaller models (or quantized models) are recommended.
+
+## Instructions
+
+1. Install MLX.
+
+   ```
+   pip install "mlx-lm>=0.0.6"
+   ```
+
+2. When you launch a model worker, replace the normal worker (`fastchat.serve.model_worker`) with the MLX worker (`fastchat.serve.mlx_worker`). Remember to launch a model worker after you have launched the controller ([instructions](../README.md))
+
+   ```
+   python3 -m fastchat.serve.mlx_worker --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0
+   ```
@@ -1,15 +1,48 @@
 # Model Support
+This document describes how to support a new model in FastChat.
 
-## Supported models
+## Content
+- [Local Models](#local-models)
+- [API-Based Models](#api-based-models)
+
+## Local Models
+To support a new local model in FastChat, you need to correctly handle its prompt template and model loading.
+The goal is to make the following command run with the correct prompts.
+
+```
+python3 -m fastchat.serve.cli --model [YOUR_MODEL_PATH]
+```
+
+You can run this example command to learn the code logic.
+
+```
+python3 -m fastchat.serve.cli --model lmsys/vicuna-7b-v1.5
+```
+
+You can add `--debug` to see the actual prompt sent to the model.
+
+### Steps
+
+FastChat uses the `Conversation` class to handle prompt templates and `BaseModelAdapter` class to handle model loading.
+
+1. Implement a conversation template for the new model at [fastchat/conversation.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py). You can follow existing examples and use `register_conv_template` to add a new one. Please also add a link to the official reference code if possible.
+2. Implement a model adapter for the new model at [fastchat/model/model_adapter.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/model_adapter.py). You can follow existing examples and use `register_model_adapter` to add a new one.
+3. (Optional) add the model name to the "Supported models" [section](#supported-models) above and add more information in [fastchat/model/model_registry.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/model_registry.py).
+
+After these steps, the new model should be compatible with most FastChat features, such as CLI, web UI, model worker, and OpenAI-compatible API server. Please do some testing with these features as well.
+
+### Supported models
 
 - [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
   - example: `python3 -m fastchat.serve.cli --model-path meta-llama/Llama-2-7b-chat-hf`
 - Vicuna, Alpaca, LLaMA, Koala
   - example: `python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5`
+- [allenai/tulu-2-dpo-7b](https://huggingface.co/allenai/tulu-2-dpo-7b)
 - [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B)
 - [BAAI/AquilaChat2-7B](https://huggingface.co/BAAI/AquilaChat2-7B)
 - [BAAI/AquilaChat2-34B](https://huggingface.co/BAAI/AquilaChat2-34B)
 - [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en#using-huggingface-transformers)
+- [argilla/notus-7b-v1](https://huggingface.co/argilla/notus-7b-v1)
 - [baichuan-inc/baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B)
 - [BlinkDL/RWKV-4-Raven](https://huggingface.co/BlinkDL/rwkv-4-raven)
   - example: `python3 -m fastchat.serve.cli --model-path ~/model_weights/RWKV-4-Raven-7B-v11x-Eng99%-Other1%-20230429-ctx8192.pth`
@@ -18,13 +51,20 @@
 - [camel-ai/CAMEL-13B-Combined-Data](https://huggingface.co/camel-ai/CAMEL-13B-Combined-Data)
 - [codellama/CodeLlama-7b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf)
 - [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)
+- [deepseek-ai/deepseek-llm-67b-chat](https://huggingface.co/deepseek-ai/deepseek-llm-67b-chat)
+- [deepseek-ai/deepseek-coder-33b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct)
 - [FlagAlpha/Llama2-Chinese-13b-Chat](https://huggingface.co/FlagAlpha/Llama2-Chinese-13b-Chat)
 - [FreedomIntelligence/phoenix-inst-chat-7b](https://huggingface.co/FreedomIntelligence/phoenix-inst-chat-7b)
 - [FreedomIntelligence/ReaLM-7b-v1](https://huggingface.co/FreedomIntelligence/Realm-7b)
 - [h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b](https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b)
+- [HuggingFaceH4/starchat-beta](https://huggingface.co/HuggingFaceH4/starchat-beta)
+- [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)
 - [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b)
+- [IEITYuan/Yuan2-2B/51B/102B-hf](https://huggingface.co/IEITYuan)
 - [lcw99/polyglot-ko-12.8b-chang-instruct-chat](https://huggingface.co/lcw99/polyglot-ko-12.8b-chang-instruct-chat)
 - [lmsys/fastchat-t5-3b-v1.0](https://huggingface.co/lmsys/fastchat-t5)
+- [meta-math/MetaMath-7B-V1.0](https://huggingface.co/meta-math/MetaMath-7B-V1.0)
+- [Microsoft/Orca-2-7b](https://huggingface.co/microsoft/Orca-2-7b)
 - [mosaicml/mpt-7b-chat](https://huggingface.co/mosaicml/mpt-7b-chat)
   - example: `python3 -m fastchat.serve.cli --model-path mosaicml/mpt-7b-chat`
 - [Neutralzz/BiLLa-7B-SFT](https://huggingface.co/Neutralzz/BiLLa-7B-SFT)
@@ -34,56 +74,57 @@
 - [OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5](https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5)
 - [openchat/openchat_3.5](https://huggingface.co/openchat/openchat_3.5)
 - [Open-Orca/Mistral-7B-OpenOrca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca)
-- [VMware/open-llama-7b-v2-open-instruct](https://huggingface.co/VMware/open-llama-7b-v2-open-instruct)
+- [OpenLemur/lemur-70b-chat-v1](https://huggingface.co/OpenLemur/lemur-70b-chat-v1)
 - [Phind/Phind-CodeLlama-34B-v2](https://huggingface.co/Phind/Phind-CodeLlama-34B-v2)
 - [project-baize/baize-v2-7b](https://huggingface.co/project-baize/baize-v2-7b)
 - [Qwen/Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat)
+- [rishiraj/CatPPT](https://huggingface.co/rishiraj/CatPPT)
 - [Salesforce/codet5p-6b](https://huggingface.co/Salesforce/codet5p-6b)
 - [StabilityAI/stablelm-tuned-alpha-7b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b)
+- [tenyx/TenyxChat-7B-v1](https://huggingface.co/tenyx/TenyxChat-7B-v1)
+- [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
 - [THUDM/chatglm-6b](https://huggingface.co/THUDM/chatglm-6b)
 - [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
 - [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b)
 - [tiiuae/falcon-180B-chat](https://huggingface.co/tiiuae/falcon-180B-chat)
 - [timdettmers/guanaco-33b-merged](https://huggingface.co/timdettmers/guanaco-33b-merged)
 - [togethercomputer/RedPajama-INCITE-7B-Chat](https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Chat)
+- [VMware/open-llama-7b-v2-open-instruct](https://huggingface.co/VMware/open-llama-7b-v2-open-instruct)
 - [WizardLM/WizardLM-13B-V1.0](https://huggingface.co/WizardLM/WizardLM-13B-V1.0)
 - [WizardLM/WizardCoder-15B-V1.0](https://huggingface.co/WizardLM/WizardCoder-15B-V1.0)
-- [HuggingFaceH4/starchat-beta](https://huggingface.co/HuggingFaceH4/starchat-beta)
-- [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)
 - [Xwin-LM/Xwin-LM-7B-V0.1](https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1)
-- [OpenLemur/lemur-70b-chat-v1](https://huggingface.co/OpenLemur/lemur-70b-chat-v1)
-- [allenai/tulu-2-dpo-7b](https://huggingface.co/allenai/tulu-2-dpo-7b)
-- [Microsoft/Orca-2-7b](https://huggingface.co/microsoft/Orca-2-7b)
 - Any [EleutherAI](https://huggingface.co/EleutherAI) pythia model such as [pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b)
 - Any [Peft](https://github.com/huggingface/peft) adapter trained on top of a
   model above.  To activate, must have `peft` in the model path.  Note: If
   loading multiple peft models, you can have them share the base model weights by
   setting the environment variable `PEFT_SHARE_BASE_WEIGHTS=true` in any model
   worker.
 
-## How to support a new model
 
-To support a new model in FastChat, you need to correctly handle its prompt template and model loading.
-The goal is to make the following command run with the correct prompts.
+## API-Based Models
+To support an API-based model, consider learning from the existing OpenAI example.
+If the model is compatible with OpenAI APIs, then a configuration file is all that's needed without any additional code.
+For custom protocols, implementation of a streaming generator in [fastchat/serve/api_provider.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/api_provider.py) is required, following the provided examples. Currently, FastChat is compatible with OpenAI, Anthropic, Google Vertex AI, Mistral, and Nvidia NGC.
 
+### Steps to Launch a WebUI with an API Model
+1. Specify the endpoint information in a JSON configuration file. For instance, create a file named `api_endpoints.json`:
+```json
+{
+  "gpt-3.5-turbo": {
+    "model_name": "gpt-3.5-turbo",
+    "api_type": "openai",
+    "api_base": "https://api.openai.com/v1",
+    "api_key": "sk-******",
+    "anony_only": false
+  }
+}
 ```
-python3 -m fastchat.serve.cli --model [YOUR_MODEL_PATH]
-```
-
-You can run this example command to learn the code logic.
+  - "api_type" can be one of the following: openai, anthropic, gemini, or mistral. For custom APIs, add a new type and implement it accordingly.
+  - "anony_only" indicates whether to display this model in anonymous mode only.
 
+2. Launch the Gradio web server with the argument `--register api_endpoints.json`:
 ```
-python3 -m fastchat.serve.cli --model lmsys/vicuna-7b-v1.5
+python3 -m fastchat.serve.gradio_web_server --controller "" --share --register api_endpoints.json
 ```
 
-You can add `--debug` to see the actual prompt sent to the model.
-
-### Steps
-
-FastChat uses the `Conversation` class to handle prompt templates and `BaseModelAdapter` class to handle model loading.
-
-1. Implement a conversation template for the new model at [fastchat/conversation.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py). You can follow existing examples and use `register_conv_template` to add a new one. Please also add a link to the official reference code if possible.
-2. Implement a model adapter for the new model at [fastchat/model/model_adapter.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/model_adapter.py). You can follow existing examples and use `register_model_adapter` to add a new one.
-3. (Optional) add the model name to the "Supported models" [section](#supported-models) above and add more information in [fastchat/model/model_registry.py](https://github.com/lm-sys/FastChat/blob/main/fastchat/model/model_registry.py).
-
-After these steps, the new model should be compatible with most FastChat features, such as CLI, web UI, model worker, and OpenAI-compatible API server. Please do some testing with these features as well.
+Now, you can open a browser and interact with the model.
@@ -8,6 +8,8 @@ The following OpenAI APIs are supported:
 - Completions. (Reference: https://platform.openai.com/docs/api-reference/completions)
 - Embeddings. (Reference: https://platform.openai.com/docs/api-reference/embeddings)
 
+The REST API can be seamlessly operated from Google Colab, as demonstrated in the [FastChat_API_GoogleColab.ipynb](https://github.com/lm-sys/FastChat/blob/main/playground/FastChat_API_GoogleColab.ipynb) notebook, available in our repository. This notebook provides a practical example of how to utilize the API effectively within the Google Colab environment.
+
 ## RESTful API Server
 First, launch the controller
 
@@ -32,29 +34,28 @@ Now, let us test the API server.
 ### OpenAI Official SDK
 The goal of `openai_api_server.py` is to implement a fully OpenAI-compatible API server, so the models can be used directly with [openai-python](https://github.com/openai/openai-python) library.
 
-First, install openai-python:
+First, install OpenAI python package >= 1.0:
 ```bash
 pip install --upgrade openai
 ```
 
-Then, interact with model vicuna:
+Then, interact with the Vicuna model:
 ```python
 import openai
-# to get proper authentication, make sure to use a valid key that's listed in
-# the --api-keys flag. if no flag value is provided, the `api_key` will be ignored.
+
 openai.api_key = "EMPTY"
-openai.api_base = "http://localhost:8000/v1"
+openai.base_url = "http://localhost:8000/v1/"
 
 model = "vicuna-7b-v1.5"
 prompt = "Once upon a time"
 
 # create a completion
-completion = openai.Completion.create(model=model, prompt=prompt, max_tokens=64)
+completion = openai.completions.create(model=model, prompt=prompt, max_tokens=64)
 # print the completion
 print(prompt + completion.choices[0].text)
 
 # create a chat completion
-completion = openai.ChatCompletion.create(
+completion = openai.chat.completions.create(
   model=model,
   messages=[{"role": "user", "content": "Hello! What is your name?"}]
 )
 
@@ -0,0 +1,24 @@
+# Third Party UI
+If you want to host it on your own UI or third party UI, you can launch the [OpenAI compatible server](openai_api.md) and host with a tunnelling service such as Tunnelmole or ngrok, and then enter the credentials appropriately.
+
+You can find suitable UIs from third party repos:
+- [WongSaang's ChatGPT UI](https://github.com/WongSaang/chatgpt-ui)
+- [McKayWrigley's Chatbot UI](https://github.com/mckaywrigley/chatbot-ui)
+
+- Please note that some third-party providers only offer the standard `gpt-3.5-turbo`, `gpt-4`, etc., so you will have to add your own custom model inside the code. [Here is an example of how to create a UI with any custom model name](https://github.com/ztjhz/BetterChatGPT/pull/461).
+
+##### Using Tunnelmole
+Tunnelmole is an open source tunnelling tool. You can find its source code on [Github](https://github.com/robbie-cahill/tunnelmole-client). Here's how you can use Tunnelmole:
+1. Install Tunnelmole with `curl -O https://install.tunnelmole.com/9Wtxu/install && sudo bash install`. (On Windows, download [tmole.exe](https://tunnelmole.com/downloads/tmole.exe)). Head over to the [README](https://github.com/robbie-cahill/tunnelmole-client) for other methods such as `npm` or building from source.
+2. Run `tmole 7860` (replace `7860` with your listening port if it is different from 7860). The output will display two URLs: one HTTP and one HTTPS. It's best to use the HTTPS URL for better privacy and security.
+```
+➜  ~ tmole 7860
+http://bvdo5f-ip-49-183-170-144.tunnelmole.net is forwarding to localhost:7860
+https://bvdo5f-ip-49-183-170-144.tunnelmole.net is forwarding to localhost:7860
+```
+
+##### Using ngrok
+ngrok is a popular closed source tunnelling tool. First download and install it from [ngrok.com](https://ngrok.com/downloads). Here's how to use it to expose port 7860.
+```
+ngrok http 7860
+```