[Perf] Remove `+silu_and_mul` from custom op list #710

tjtanaavllm · 2025-09-25T13:27:19Z

Purpose

This is to pick the best setup for silu_and_mul kernel and set it as default.

We have tested three kernels:

torch inductor compiled silu_and_mul
aiter.silu_and_mul
vllm custom op silu_and_mul

Test Plan

Evaluate the performance on max concurrency 64 and large number of requests.

Test Result

max concurrency 64

Metric	aiter's	vLLM custom ops	torch inductor	Best
Request throughput (req/s)	1.78	1.77	1.78	aiter's / torch
Output token throughput (tok/s)	207.32	206.51	207.28	aiter's
Total token throughput (tok/s)	375.10	373.48	374.75	aiter's
Mean TTFT (ms)	3938.11	3714.86	4254.20	vLLM
Median TTFT (ms)	2386.74	2204.05	1977.58	torch
P99 TTFT (ms)	24368.01	24284.53	29484.65	vLLM
Mean TPOT (ms)	295.16	297.56	285.95	torch
Median TPOT (ms)	286.63	287.77	287.71	aiter's
P99 TPOT (ms)	527.57	462.26	491.60	vLLM

large number of requests

Metric	torch inductor	aiter's	Best
Benchmark duration (s)	372.85	372.47	aiter's
Request throughput (req/s)	2.68	2.68	Tie
Output token throughput (tok/s)	313.84	313.71	torch
Total token throughput (tok/s)	566.82	566.96	aiter's
Mean TTFT (ms)	132,404	132,119	aiter's
Median TTFT (ms)	119,012	119,541	torch
P99 TTFT (ms)	351,168	355,388	torch
Mean TPOT (ms)	2,034	2,042	torch
Median TPOT (ms)	2,113	2,156	torch
P99 TPOT (ms)	4,089	4,032	aiter's

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: tjtanaavllm <[email protected]>

zejunchen-zejun · 2025-09-25T14:05:53Z

Great work @tjtanaavllm and we are also wondering which backends has the best performant fusion kernel. Your experiment helps a lot.
@xytpai You can discuss with tunjian for the performance of the fusion kernel here.

remove silu_and_mul from custom op list

f14d13f

Signed-off-by: tjtanaavllm <[email protected]>

tjtanaavllm requested review from wuhuikx, zejunchen-zejun and kliuae-amd September 25, 2025 13:27

tjtanaavllm requested review from charlifu, mawong-amd, shajrawi, gshtras, maleksan85, sunway513 and hongxiayang as code owners September 25, 2025 13:27

kliuae-amd approved these changes Sep 26, 2025

View reviewed changes

tjtanaavllm merged commit f49f8d7 into llama_fp8_03122025 Sep 26, 2025
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Remove `+silu_and_mul` from custom op list #710

[Perf] Remove `+silu_and_mul` from custom op list #710

Uh oh!

tjtanaavllm commented Sep 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

zejunchen-zejun commented Sep 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[Perf] Remove +silu_and_mul from custom op list #710

[Perf] Remove +silu_and_mul from custom op list #710

Uh oh!

Conversation

tjtanaavllm commented Sep 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

max concurrency 64

large number of requests

Uh oh!

zejunchen-zejun commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Perf] Remove `+silu_and_mul` from custom op list #710

[Perf] Remove `+silu_and_mul` from custom op list #710

tjtanaavllm commented Sep 25, 2025 •

edited by github-actions bot

Loading

zejunchen-zejun commented Sep 25, 2025 •

edited

Loading