Skip to content
Merged
18 changes: 18 additions & 0 deletions docs/lightllm_integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# LightLLM Integration
You can use [LightLLM](https://github.com/ModelTC/lightllm) as an optimized worker implementation in FastChat.
It offers advanced continuous batching and a much higher (~10x) throughput.
See the supported models [here](https://github.com/ModelTC/lightllm?tab=readme-ov-file#supported-model-list).

## Instructions
1. Please refer to the [Get started](https://github.com/ModelTC/lightllm?tab=readme-ov-file#get-started) to install LightLLM. Or use [Pre-built image](https://github.com/ModelTC/lightllm?tab=readme-ov-file#container)

2. When you launch a model worker, replace the normal worker (`fastchat.serve.model_worker`) with the LightLLM worker (`fastchat.serve.lightllm_worker`). All other commands such as controller, gradio web server, and OpenAI API server are kept the same. Refer to [--max_total_token_num](https://github.com/ModelTC/lightllm/blob/4a9824b6b248f4561584b8a48ae126a0c8f5b000/docs/ApiServerArgs.md?plain=1#L23) to understand how to calcuate the `--max_total_token_num` argument.
```
python3 -m fastchat.serve.lightllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer_mode "auto" --max_total_token_num 154000
```

If you what to use quantized weight and kv cache for inference, try

```
python3 -m fastchat.serve.lightllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer_mode "auto" --max_total_token_num 154000 --mode triton_int8weight triton_int8kv
```
18 changes: 9 additions & 9 deletions fastchat/serve/base_model_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,18 +126,18 @@ def send_heart_beat(self):
self.register_to_controller()

def get_queue_length(self):
if (
self.semaphore is None
or self.semaphore._value is None
or self.semaphore._waiters is None
):
if self.semaphore is None:
return 0
else:
return (
self.limit_worker_concurrency
- self.semaphore._value
+ len(self.semaphore._waiters)
sempahore_value = (
self.semaphore._value
if self.semaphore._value is not None
else self.limit_worker_concurrency
)
waiter_count = (
0 if self.semaphore._waiters is None else len(self.semaphore._waiters)
)
return self.limit_worker_concurrency - sempahore_value + waiter_count

def get_status(self):
return {
Expand Down
Loading