Skip to content

Commit c19a953

Browse files
authored
Integrate LightLLM into serve worker (#2888)
1 parent c6d7acd commit c19a953

File tree

4 files changed

+554
-9
lines changed

4 files changed

+554
-9
lines changed

docs/lightllm_integration.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# LightLLM Integration
2+
You can use [LightLLM](https://github.com/ModelTC/lightllm) as an optimized worker implementation in FastChat.
3+
It offers advanced continuous batching and a much higher (~10x) throughput.
4+
See the supported models [here](https://github.com/ModelTC/lightllm?tab=readme-ov-file#supported-model-list).
5+
6+
## Instructions
7+
1. Please refer to the [Get started](https://github.com/ModelTC/lightllm?tab=readme-ov-file#get-started) to install LightLLM. Or use [Pre-built image](https://github.com/ModelTC/lightllm?tab=readme-ov-file#container)
8+
9+
2. When you launch a model worker, replace the normal worker (`fastchat.serve.model_worker`) with the LightLLM worker (`fastchat.serve.lightllm_worker`). All other commands such as controller, gradio web server, and OpenAI API server are kept the same. Refer to [--max_total_token_num](https://github.com/ModelTC/lightllm/blob/4a9824b6b248f4561584b8a48ae126a0c8f5b000/docs/ApiServerArgs.md?plain=1#L23) to understand how to calcuate the `--max_total_token_num` argument.
10+
```
11+
python3 -m fastchat.serve.lightllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer_mode "auto" --max_total_token_num 154000
12+
```
13+
14+
If you what to use quantized weight and kv cache for inference, try
15+
16+
```
17+
python3 -m fastchat.serve.lightllm_worker --model-path lmsys/vicuna-7b-v1.5 --tokenizer_mode "auto" --max_total_token_num 154000 --mode triton_int8weight triton_int8kv
18+
```

fastchat/serve/base_model_worker.py

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -126,18 +126,18 @@ def send_heart_beat(self):
126126
self.register_to_controller()
127127

128128
def get_queue_length(self):
129-
if (
130-
self.semaphore is None
131-
or self.semaphore._value is None
132-
or self.semaphore._waiters is None
133-
):
129+
if self.semaphore is None:
134130
return 0
135131
else:
136-
return (
137-
self.limit_worker_concurrency
138-
- self.semaphore._value
139-
+ len(self.semaphore._waiters)
132+
sempahore_value = (
133+
self.semaphore._value
134+
if self.semaphore._value is not None
135+
else self.limit_worker_concurrency
140136
)
137+
waiter_count = (
138+
0 if self.semaphore._waiters is None else len(self.semaphore._waiters)
139+
)
140+
return self.limit_worker_concurrency - sempahore_value + waiter_count
141141

142142
def get_status(self):
143143
return {

0 commit comments

Comments
 (0)