[Misc]: TTFT profiling with respect to prompt length

### Anything you want to discuss about vllm.

I am profiling TTFT and TPOT on my machine, I could not explain the behavior of TTFT thus opened this issue to seek for advice.

Below figure shows the TTFTs with respect to prompt length on my machine, the test condition is as below:
- model: llama3-8B
- GPU type: V100, the below figure shows the result of TP=2
- dataset: ShareGPT

steps taken for TTFT and TPOT profiling:
1. start the OpenAI API-compatible server using: `python -m vllm.entrypoints.openai.api_server --args`
2. iterative running `benchmark_serving.py` to get the TTFT and TPOT, each time only send a request to server to eliminate the effect of waiting time

The profiled TTFT is as below: 
Observation 1: when the prompt length is less than 400, the TTFT seems to be a flat value ~100ms. This value is consistent across different TP settings (tried TP=1, TP=2 and TP=4). 
Observation 2: When prompt length is greater than 400, TTFT is linear to prompt length. This result is inline with Figure 6b this paper (https://arxiv.org/pdf/2405.06856). 

I don't understand the result of observation 1, can anyone provide some insight on this result? What is the reason causingTTFT a horizontal line when the prompt length is less than 400?
![ttft](https://github.com/user-attachments/assets/7df1f883-90d2-4cdc-98cb-979b9d487437)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Misc]: TTFT profiling with respect to prompt length #7635

Anything you want to discuss about vllm.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Misc]: TTFT profiling with respect to prompt length #7635

Description

Anything you want to discuss about vllm.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions