Skip to content

Conversation

tjtanaavllm
Copy link

@tjtanaavllm tjtanaavllm commented Sep 25, 2025

Purpose

This is to pick the best setup for silu_and_mul kernel and set it as default.

We have tested three kernels:

  • torch inductor compiled silu_and_mul
  • aiter.silu_and_mul
  • vllm custom op silu_and_mul

Test Plan

Evaluate the performance on max concurrency 64 and large number of requests.

Test Result

max concurrency 64

Metric aiter's vLLM custom ops torch inductor Best
Request throughput (req/s) 1.78 1.77 1.78 aiter's / torch
Output token throughput (tok/s) 207.32 206.51 207.28 aiter's
Total token throughput (tok/s) 375.10 373.48 374.75 aiter's
Mean TTFT (ms) 3938.11 3714.86 4254.20 vLLM
Median TTFT (ms) 2386.74 2204.05 1977.58 torch
P99 TTFT (ms) 24368.01 24284.53 29484.65 vLLM
Mean TPOT (ms) 295.16 297.56 285.95 torch
Median TPOT (ms) 286.63 287.77 287.71 aiter's
P99 TPOT (ms) 527.57 462.26 491.60 vLLM

large number of requests

Metric torch inductor aiter's Best
Benchmark duration (s) 372.85 372.47 aiter's
Request throughput (req/s) 2.68 2.68 Tie
Output token throughput (tok/s) 313.84 313.71 torch
Total token throughput (tok/s) 566.82 566.96 aiter's
Mean TTFT (ms) 132,404 132,119 aiter's
Median TTFT (ms) 119,012 119,541 torch
P99 TTFT (ms) 351,168 355,388 torch
Mean TPOT (ms) 2,034 2,042 torch
Median TPOT (ms) 2,113 2,156 torch
P99 TPOT (ms) 4,089 4,032 aiter's

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@zejunchen-zejun
Copy link

zejunchen-zejun commented Sep 25, 2025

Great work @tjtanaavllm and we are also wondering which backends has the best performant fusion kernel. Your experiment helps a lot.
@xytpai You can discuss with tunjian for the performance of the fusion kernel here.

@tjtanaavllm tjtanaavllm merged commit f49f8d7 into llama_fp8_03122025 Sep 26, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants