[Perf] refactor attention backend for perf boost #713
+844
−126
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
The current aiter_fa_backend contains redundant fetch kv operation every run, which may cause not only serious memory pressure, but also the time consumption on every decode layer's attention inference. Based on that observation, we rewrite the attention backend to eliminate the unnecessary fetch kv operation, and rewrite the fetch kv triton kernel for better occupancy when chunk prefill or similar scenario occurs. Also in order to prevent unnecessary memory reorder operation, there are some changes on both model_runner and scheduler, which introduce negligible host overhead as we tested. The detail analysis can be found in following doc https://amd.atlassian.net/wiki/spaces/MLSE/pages/1143506837/AITER+Attention+Backend+Proposal
We have tested the perf on current attention, on Qwen3 for both 30B and 235B. The perf boost looks great, especially comes to the long prompt scenario. Test result also attached into the upper doc.
Test Plan
Test Result
We observe nearly 4.x throughput boost on long prompt short output case, also with significant latency improvement. this is the benchmark result, the test script also attached in the upper doc:

prev
new:

Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.