-
-
Notifications
You must be signed in to change notification settings - Fork 9.2k
Fix TorchAOConfig skip layers #19265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…atures) for faster loading Signed-off-by: mobicham <[email protected]>
Signed-off-by: mobicham <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @mobicham, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team,
Gemini here with a summary of this pull request. This PR, titled "Fix TorchAOConfig skip layers", aims to address issues encountered when loading models quantized with TorchAO where not all layers have been converted. Specifically, it fixes crashes that occur when loading models that define certain modules to be skipped from quantization, either via the modules_to_not_convert
list or by setting their configuration to None
in module_fqn_to_config
. The PR ensures that vLLM correctly identifies and handles these unquantized layers during loading. Additionally, a minor optimization is included to improve the loading speed for TorchAO models by avoiding the creation of large dummy nn.Linear
modules during parameter quantization.
Highlights
- TorchAO Quantization Fix: Resolves issues loading TorchAO quantized models that contain layers explicitly marked to be skipped from quantization.
- Support for Skipped Modules: Adds logic to correctly handle modules listed in
modules_to_not_convert
and modules with aNone
configuration inmodule_fqn_to_config
, ensuring they are treated as unquantized. - Loading Speed Optimization: Improves the speed of loading TorchAO models by optimizing the creation of dummy
nn.Linear
modules used during parameter quantization. - New Test Case: Adds a test case specifically for loading a Qwen-VL model quantized with TorchAO, which helps validate the fix for models with unquantized components like vision heads.
Changelog
- tests/quantization/test_torchao.py
- Added a new test function
test_qwenvl_int8wo_model_loading_with_params
(lines 62-74) to test loading a Qwen-VL model quantized with TorchAO, targeting the scenario with potentially skipped layers.
- Added a new test function
- vllm/model_executor/layers/quantization/torchao.py
- Modified the
TorchAOConfig
constructor to accept an optionalskip_modules
list (lines 23-25, 41). - Updated the
from_config
class method to parsemodules_to_not_convert
and identify modules set toNone
inmodule_fqn_to_config
, adding them to theskip_modules
list (lines 78-85). - In
get_quant_method
, added a check to returnUnquantizedLinearMethod()
if the current module's prefix is in theskip_modules
list (lines 96-97). - Ensured the
skip_modules
list is passed down when creating nestedTorchAOConfig
instances withinget_quant_method
(line 105). - Optimized
torchao_quantize_param_data
by creating a smallnn.Linear(1, 1)
and manually settingin_features
andout_features
instead of using the full parameter shape directly (lines 129-131).
- Modified the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the issue of loading TorchAO models with partially quantized layers, particularly for Vision-Language models. The changes to handle modules_to_not_convert
and module_fqn_to_config
for skipping layers are well-implemented. The added test case for a Qwen-VL model is a good addition, and the optimization in torchao_quantize_param_data
to reduce memory allocation during dummy linear layer creation is a nice improvement.
I have one suggestion regarding the string matching logic for skipping modules, which could be made more robust to prevent potential over-matching. Overall, this is a valuable fix.
Summary of Findings
- Module Skipping Logic Robustness: The logic for determining whether to skip a module (
any(s in prefix for s in self.skip_modules)
) uses a general substring check. This could potentially lead to over-matching if a skip pattern is a substring of an unrelated module's FQN (e.g., skipping"layer.1"
might unintentionally affect"layer.10"
). A more precise FQN-aware prefix matching or exact matching would be more robust. - Test Coverage: A new test case (
test_qwenvl_int8wo_model_loading_with_params
) was added, which is good for verifying the fix for VL models with unquantized vision modules. - Performance Improvement: The change in
torchao_quantize_param_data
to initializenn.Linear
with minimal dimensions (1,1) and then updatein_features
andout_features
is a good optimization to reduce temporary memory allocation.
Merge Readiness
The pull request is well-structured and addresses the core issues effectively. However, there's one medium-severity concern regarding the robustness of the module skipping logic that should be discussed and potentially addressed. Once that point is clarified or resolved, the PR should be in good shape for merging. As an AI, I am not authorized to approve pull requests; please ensure further review and approval from the maintainers.
Signed-off-by: mobicham <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: mobicham <[email protected]>
Signed-off-by: mobicham <[email protected]>
Signed-off-by: mobicham <[email protected]>
cc: @jerryzh168 |
@mobicham thanks for the fix, can you talk a bit more about qkv fusion that you mentioned before? still didn't quite get it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The q,k,v fusion issue you mentioned makes sense, does this PR fix that?
@drisspg only if:
Otherwise, there's no clean way to merge qkv if they don't have the quant settings. Moreover, the merging is not happening in TorchAOConfig, it's happening in the QKV linear modules. The main focus of this PR is to handle layer skipping for layers that were not quantized though. So it's simply checking in the config if the prefix matches the skipped layers defined in the config. |
Ohhh yea, fwiw I am planning on writing a little doc in AO on how to get write a subclass that will work with VLLM, and the slice and copy is the main point. I guess it feels like if someone skipped one of the q,k, projections then we shud skip (via ModFQNconfig or skip list) the stacked variant. Was more curious if this is tested and expected to work in this PR |
I see! That can't work with the current code unfortunately because the metadata will mismatch during the slice/copy. |
Actually DCO is only a soft requirement. :-) |
Signed-off-by: mobicham <[email protected]>
Signed-off-by: mobicham <[email protected]>
Anything else to have this merged? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rebase to the latest main and fix the pre-commit linter?
Signed-off-by: mobicham <[email protected]>
Signed-off-by: mobicham <[email protected]>
@houseroad sorry didn't see that, fixed, thank you! |
* [doc] clarify windows support (vllm-project#19088) Signed-off-by: youkaichao <[email protected]> * [CI/Build] Remove V0 LoRA test (vllm-project#19066) Signed-off-by: Jee Jee Li <[email protected]> * Fix underscores in dict keys passed via CLI (vllm-project#19030) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> --------- Signed-off-by: youkaichao <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>
* [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> * fix Signed-off-by: Amog Kamsetty <[email protected]> * remove logging Signed-off-by: Amog Kamsetty <[email protected]> --------- Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>
* [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> * fix Signed-off-by: Amog Kamsetty <[email protected]> --------- Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>
Signed-off-by: mobicham <[email protected]> Signed-off-by: minpeter <[email protected]>
Signed-off-by: mobicham <[email protected]>
Signed-off-by: mobicham <[email protected]>
* [doc] clarify windows support (vllm-project#19088) Signed-off-by: youkaichao <[email protected]> * [CI/Build] Remove V0 LoRA test (vllm-project#19066) Signed-off-by: Jee Jee Li <[email protected]> * Fix underscores in dict keys passed via CLI (vllm-project#19030) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> --------- Signed-off-by: youkaichao <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>
* [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> * fix Signed-off-by: Amog Kamsetty <[email protected]> * remove logging Signed-off-by: Amog Kamsetty <[email protected]> --------- Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>
* [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> * fix Signed-off-by: Amog Kamsetty <[email protected]> * remove logging Signed-off-by: Amog Kamsetty <[email protected]> --------- Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>
Signed-off-by: mobicham <[email protected]> Signed-off-by: avigny <[email protected]>
[Follow-up to #19147 due to DCO rebasing issues]
Purpose
The goal of this small PR is to fix loading torchao models where not all the layers have been quantized.
The current implementation doesn't keep track of the skipped layers defined in
config["modules_to_not_convert"]
. As a result, quantized VL models where the vision head is not quantized results in a crash.The PR also includes logic to skip layers defined in
module_fqn_to_config
. Currently, if a module is skipped inmodule_fqn_to_config
, loading the model in vLLM would crash.Also, made a quick fix to improve loading speed by avoiding creating an
nn.Linear
with the full tensor shape.Test Plan
Dependencies
Code
Loading a VL model with unquantized vision modules
Skip module example
Test Result
The model should load successfully.
@jerryzh168