-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
Closed
Labels
performancePerformance-related issuesPerformance-related issues
Description
OctoAI use vLLM as a benchmark to demonstrate how fast they are https://octo.ai/blog/acceleration-is-all-you-need-techniques-powering-octostacks-10x-performance-boost:
Single User Throughput | Multi-user Throughput | Inter-Token Latency |
---|---|---|
Their main optimisations appear to be:
- FP8 quantisation of the model (currently we only support KV cache)
- The
CustomAllReduce
kernel from Nvidia TRT LLM - CUDA graphs
- Speculative decoding (which we have thanks to @cadedaniel!)
- Dynamic SplitFuse (A.K.A. Chunked Prefill, which we have thanks to @rkooo567!)
My question is, what do we need to do to reach performance parity?
Some clear things are:
- Make all of these features compatible with eachother
- See what can be learned from the TRT LLM
CustomAllReduce
- Support executing models in FP8
Notable issues:
Metadata
Metadata
Assignees
Labels
performancePerformance-related issuesPerformance-related issues