[Performance]: What can we learn from OctoAI

OctoAI use vLLM as a benchmark to demonstrate how fast they are https://octo.ai/blog/acceleration-is-all-you-need-techniques-powering-octostacks-10x-performance-boost:

| Single User Throughput | Multi-user Throughput | Inter-Token Latency |
|--------|--------|--------|
| ![](https://www.datocms-assets.com/45680/1714684979-octostack-single-user-throughput-compared-to-vllm-chart.png) | ![](https://www.datocms-assets.com/45680/1714760904-octostack-multi-user-throughput-compared-to-vllm-chart-4.png) | ![](https://www.datocms-assets.com/45680/1714682977-octostack-compared-to-vllm-inter-token-latency-chart.png) |


Their main optimisations appear to be:
- FP8 quantisation of the model (currently we only support KV cache)
- The `CustomAllReduce` kernel from Nvidia TRT LLM
- CUDA graphs
- Speculative decoding (which we have thanks to @cadedaniel!)
- Dynamic SplitFuse (A.K.A. Chunked Prefill, which we have thanks to @rkooo567!)

My question is, **what do we need to do to reach performance parity?**

Some clear things are:
- Make all of these features compatible with eachother
- See what can be learned from the TRT LLM `CustomAllReduce`
- Support executing models in FP8

Notable issues:
- https://github.com/vllm-project/vllm/issues/4565
- https://github.com/vllm-project/vllm/issues/4630
- https://github.com/vllm-project/vllm/issues/5016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: What can we learn from OctoAI #5167

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: What can we learn from OctoAI #5167

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions