Skip to content

[Performance]: What can we learn from OctoAI #5167

@hmellor

Description

@hmellor

OctoAI use vLLM as a benchmark to demonstrate how fast they are https://octo.ai/blog/acceleration-is-all-you-need-techniques-powering-octostacks-10x-performance-boost:

Single User Throughput Multi-user Throughput Inter-Token Latency

Their main optimisations appear to be:

  • FP8 quantisation of the model (currently we only support KV cache)
  • The CustomAllReduce kernel from Nvidia TRT LLM
  • CUDA graphs
  • Speculative decoding (which we have thanks to @cadedaniel!)
  • Dynamic SplitFuse (A.K.A. Chunked Prefill, which we have thanks to @rkooo567!)

My question is, what do we need to do to reach performance parity?

Some clear things are:

  • Make all of these features compatible with eachother
  • See what can be learned from the TRT LLM CustomAllReduce
  • Support executing models in FP8

Notable issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions