[Bugfix] Fix auto dtype casting for BatchFeature #19316

Isotr0py · 2025-06-07T14:52:10Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

Purpose

FIX [Bug]: deepseek-vl2 RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same #19219
BatchFeature is a UserDict, and if isinstance(value, dict) will return False, causing its data not casted correctly in json_map_leaves.

Test Plan

python examples/offline_inference/vision_language.py -m deepseek_vl_v2

Test Result

Signed-off-by: Isotr0py <[email protected]>

github-actions · 2025-06-07T14:52:18Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Hello @Isotr0py, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here with a summary of this pull request. This PR addresses a bug related to automatic dtype casting when processing inputs using Hugging Face processors, specifically when the processor returns a BatchFeature object. Previously, the casting logic might not have correctly applied the desired dtype to the tensors contained within the BatchFeature. This change modifies the casting process to ensure that when a BatchFeature is returned, the dtype casting is applied directly to the internal data (.data) of the BatchFeature, fixing the issue where tensors inside the BatchFeature were not being cast correctly.

Highlights

Bugfix: BatchFeature dtype casting: Fixes an issue where the automatic dtype casting for inputs processed by Hugging Face processors did not correctly handle cases where the output was a BatchFeature, leading to tensors within the BatchFeature retaining their original dtype instead of being cast to the model's dtype.
Refined input processing logic: Modifies the maybe_cast_dtype function in vllm/inputs/registry.py to specifically target the .data attribute of BatchFeature objects for dtype casting, ensuring that the tensors inside are correctly handled.

Changelog

vllm/inputs/registry.py
- Modified the maybe_cast_dtype function to check if the processor output is a BatchFeature (line 171).
- If the output is a BatchFeature, the json_map_leaves casting function is now applied to output.data instead of output (line 172).
- The original casting logic for non-BatchFeature outputs is now handled after the if block (line 175).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses the bug regarding auto dtype casting for BatchFeature objects. The core issue, as I understand it, was that json_map_leaves did not correctly recurse into BatchFeature instances because isinstance(BatchFeature_instance, dict) evaluates to False (as BatchFeature inherits from UserDict but not dict directly). This meant maybe_cast_dtype was applied to the BatchFeature object itself, resulting in no casting of its internal tensors.

The fix correctly targets this by explicitly passing output.data (which is a dict) to json_map_leaves when output is a BatchFeature. This ensures that the tensors within the BatchFeature are properly cast. For other types of output, the existing logic for json_map_leaves is maintained.

The change is clear, well-targeted, and improves the robustness of the dtype casting mechanism. Well done!

Merge Readiness

The changes in this pull request appear to be correct and effectively address the described bug. The code is clear and the fix is well-targeted. Based on my review, the PR seems ready for merging. However, as an AI, I am not authorized to approve pull requests; please ensure it undergoes further review and approval by the maintainers.

DarkLight1337

Thanks for fixing!

DarkLight1337 · 2025-06-09T02:30:02Z

Hmm it looks like we are now running into a similar issue as #18862 (comment)

Isotr0py · 2025-06-09T17:26:33Z

Hmmm, the stuck test can pass on my side locally with this PR:

INFO 06-09 17:15:14 [core.py:455] Waiting for init message from front-end.
INFO 06-09 17:15:14 [core.py:70] Initializing a V1 LLM engine (v0.9.1.dev252+gc1c7dbbee) with config: model='Qwen/Qwen2-VL-2B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-VL-2B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2-VL-2B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":false,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
WARNING 06-09 17:15:14 [utils.py:2723] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f852ec33ec0>
[W609 17:15:25.239072132 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W609 17:15:35.245958761 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
INFO 06-09 17:15:35 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
Unused or unrecognized kwargs: return_tensors.
WARNING 06-09 17:15:47 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 06-09 17:15:47 [gpu_model_runner.py:1589] Starting to load model Qwen/Qwen2-VL-2B-Instruct...
INFO 06-09 17:15:47 [gpu_model_runner.py:1594] Loading model from scratch...
INFO 06-09 17:15:47 [cuda.py:256] Using FlexAttenion backend on V1 engine.
INFO 06-09 17:15:47 [weight_utils.py:292] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:03<00:03,  3.86s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00,  2.11s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00,  2.37s/it]

INFO 06-09 17:15:52 [default_loader.py:272] Loading weights took 4.78 seconds
INFO 06-09 17:15:53 [gpu_model_runner.py:1618] Model loading took 4.1514 GiB and 5.204752 seconds
INFO 06-09 17:15:56 [gpu_model_runner.py:1943] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
INFO 06-09 17:18:26 [kv_cache_utils.py:715] GPU KV cache size: 224,192 tokens
INFO 06-09 17:18:26 [kv_cache_utils.py:719] Maximum concurrency for 32,768 tokens per request: 6.84x
INFO 06-09 17:18:26 [cuda.py:256] Using FlexAttenion backend on V1 engine.
INFO 06-09 17:18:26 [core.py:171] init engine (profile, create kv cache, warmup model) took 152.97 seconds
INFO 06-09 17:18:29 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 14012
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
PASSED

Let's see if merge from main branch can fix this...

Signed-off-by: Isotr0py <[email protected]>

vllm/model_executor/model_loader/utils.py

Isotr0py · 2025-06-11T17:36:15Z

I reproduced the hanging issue on 34a5713 locally, and seems the deadlock is caused by tensor dtype conversion exactly:

DEBUG 06-12 01:29:07 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 06-12 01:29:17 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 06-12 01:29:27 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 06-12 01:29:37 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 06-12 01:29:47 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 06-12 01:29:57 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 06-12 01:30:07 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 06-12 01:30:17 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 06-12 01:30:27 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 06-12 01:30:37 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 06-12 01:30:47 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 06-12 01:30:57 [utils.py:485] Waiting for 1 local, 0 remote core engine proc(s) to start.
Fatal Python error: Aborted

Current thread 0x00007f452cc8f600 (most recent call first):
  File "/home/mozf/develop-projects/vllm/vllm/inputs/registry.py", line 165 in maybe_cast_dtype
  File "/home/mozf/develop-projects/vllm/vllm/jsontree.py", line 40 in json_map_leaves
  File "/home/mozf/develop-projects/vllm/vllm/jsontree.py", line 34 in json_map_leaves
  File "/home/mozf/develop-projects/vllm/vllm/inputs/registry.py", line 172 in call_hf_processor
  File "/home/mozf/develop-projects/vllm/vllm/model_executor/models/qwen2_vl.py", line 1019 in _call_hf_processor
  File "/home/mozf/develop-projects/vllm/vllm/multimodal/processing.py", line 1291 in _apply_hf_processor_text_mm
  File "/home/mozf/develop-projects/vllm/vllm/multimodal/processing.py", line 1361 in _apply_hf_processor_mm_only
  File "/home/mozf/develop-projects/vllm/vllm/multimodal/processing.py", line 1400 in _apply_hf_processor_main
  File "/home/mozf/develop-projects/vllm/vllm/multimodal/processing.py", line 1553 in _cached_apply_hf_processor
  File "/home/mozf/develop-projects/vllm/vllm/multimodal/processing.py", line 1787 in apply
  File "/home/mozf/develop-projects/vllm/vllm/multimodal/profiling.py", line 169 in _get_dummy_mm_inputs
  File "/home/mozf/develop-projects/vllm/vllm/multimodal/profiling.py", line 256 in get_mm_max_tokens
  File "/home/mozf/develop-projects/vllm/vllm/multimodal/registry.py", line 132 in get_max_tokens_per_item_by_modality
  File "/home/mozf/develop-projects/vllm/vllm/multimodal/registry.py", line 158 in get_max_tokens_per_item_by_nonzero_modality
  File "/home/mozf/develop-projects/vllm/vllm/v1/core/encoder_cache_manager.py", line 125 in _compute_encoder_budget_multimodal
  File "/home/mozf/develop-projects/vllm/vllm/v1/core/encoder_cache_manager.py", line 95 in compute_encoder_budget
  File "/home/mozf/develop-projects/vllm/vllm/v1/worker/gpu_model_runner.py", line 129 in __init__
  File "/home/mozf/develop-projects/vllm/vllm/v1/worker/gpu_worker.py", line 158 in init_device
  File "/home/mozf/develop-projects/vllm/vllm/worker/worker_base.py", line 606 in init_device
  File "/home/mozf/develop-projects/vllm/vllm/utils.py", line 2657 in run_method
  File "/home/mozf/develop-projects/vllm/vllm/executor/uniproc_executor.py", line 57 in collective_rpc
  File "/home/mozf/develop-projects/vllm/vllm/executor/uniproc_executor.py", line 47 in _init_executor
  File "/home/mozf/develop-projects/vllm/vllm/executor/executor_base.py", line 53 in __init__
  File "/home/mozf/develop-projects/vllm/vllm/v1/engine/core.py", line 76 in __init__
  File "/home/mozf/develop-projects/vllm/vllm/v1/engine/core.py", line 390 in __init__
  File "/home/mozf/develop-projects/vllm/vllm/v1/engine/core.py", line 506 in run_engine_core
  File "/home/mozf/miniconda3/lib/python3.12/multiprocessing/process.py", line 108 in run
  File "/home/mozf/miniconda3/lib/python3.12/multiprocessing/process.py", line 314 in _bootstrap
  File "/home/mozf/miniconda3/lib/python3.12/multiprocessing/popen_fork.py", line 71 in _launch
  File "/home/mozf/miniconda3/lib/python3.12/multiprocessing/popen_fork.py", line 19 in __init__
  File "/home/mozf/miniconda3/lib/python3.12/multiprocessing/context.py", line 282 in _Popen
  File "/home/mozf/miniconda3/lib/python3.12/multiprocessing/process.py", line 121 in start
  File "/home/mozf/develop-projects/vllm/vllm/v1/utils.py", line 265 in __init__
  File "/home/mozf/develop-projects/vllm/vllm/v1/engine/core_client.py", line 479 in _init_engines_direct
  File "/home/mozf/develop-projects/vllm/vllm/v1/engine/core_client.py", line 422 in __init__
  File "/home/mozf/develop-projects/vllm/vllm/v1/engine/core_client.py", line 716 in __init__
  File "/home/mozf/develop-projects/vllm/vllm/v1/engine/core_client.py", line 93 in make_async_mp_client
  File "/home/mozf/develop-projects/vllm/vllm/v1/engine/async_llm.py", line 124 in __init__
  File "/home/mozf/develop-projects/vllm/vllm/v1/engine/async_llm.py", line 189 in from_engine_args
  File "/home/mozf/develop-projects/vllm/tests/v1/engine/test_async_llm.py", line 110 in test_load
  File "/home/mozf/miniconda3/lib/python3.12/asyncio/events.py", line 88 in _run
  File "/home/mozf/miniconda3/lib/python3.12/asyncio/base_events.py", line 1999 in _run_once
  File "/home/mozf/miniconda3/lib/python3.12/asyncio/base_events.py", line 645 in run_forever
  File "/home/mozf/miniconda3/lib/python3.12/asyncio/base_events.py", line 678 in run_until_complete
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pytest_asyncio/plugin.py", line 773 in inner
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/_pytest/python.py", line 159 in pytest_pyfunc_call
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_callers.py", line 121 in _multicall
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 512 in __call__
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/_pytest/python.py", line 1627 in runtest
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pytest_asyncio/plugin.py", line 508 in runtest
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/_pytest/runner.py", line 174 in pytest_runtest_call
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_callers.py", line 121 in _multicall
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 512 in __call__
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/_pytest/runner.py", line 242 in <lambda>
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/_pytest/runner.py", line 341 in from_call
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/_pytest/runner.py", line 241 in call_and_report
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/_pytest/runner.py", line 132 in runtestprotocol
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/_pytest/runner.py", line 113 in pytest_runtest_protocol
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_callers.py", line 121 in _multicall
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 512 in __call__
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/_pytest/main.py", line 362 in pytest_runtestloop
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_callers.py", line 121 in _multicall
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 512 in __call__
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/_pytest/main.py", line 337 in _main
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/_pytest/main.py", line 283 in wrap_session
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/_pytest/main.py", line 330 in pytest_cmdline_main
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_callers.py", line 121 in _multicall
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 512 in __call__
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/_pytest/config/__init__.py", line 175 in main
  File "/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/_pytest/config/__init__.py", line 201 in console_main
  File "/home/mozf/develop-projects/vllm/.venv/bin/pytest", line 10 in <module>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, PIL._imaging, markupsafe._speedups, sklearn.__check_build._check_build, psutil._psutil_linux, psutil._psutil_posix, scipy._lib._ccallback_c, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, regex._regex, zmq.backend.cython._zmq, PIL._imagingft, msgspec._core, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, sentencepiece._sentencepiece, PIL._imagingmath, vllm.cumem_allocator, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.typing.builtins.itertools, numba.cpython.builtins.math, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box (total: 162)
FAILED

vllm/vllm/inputs/registry.py

Lines 162 to 166 in 34a5713

    
           def maybe_cast_dtype(x): 
        
               # This mimics the behavior of transformers.BatchFeature 
        
               if isinstance(x, torch.Tensor) and x.is_floating_point(): 
        
                   return x.to(dtype=self.model_config.dtype) 
        
               return x

@lgeiger @njhill Any ideas about this? 😢

lgeiger · 2025-06-11T20:22:26Z

I assume something interacts badly with the forking but not really sure how to fix it either.

Signed-off-by: Isotr0py <[email protected]>

gemini-code-assist · 2025-06-12T03:45:18Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Isotr0py · 2025-06-12T03:48:26Z

tests/v1/engine/test_async_llm.py

+        with set_default_torch_num_threads(1):
+            engine = AsyncLLM.from_engine_args(engine_args)


Seems that disable openmp by setting torch_num_threads=1 during engine forking can fix the deadlock issue locally. Let's see what's the CI going on then.

Signed-off-by: Isotr0py <[email protected]>

* [doc] clarify windows support (vllm-project#19088) Signed-off-by: youkaichao <[email protected]> * [CI/Build] Remove V0 LoRA test (vllm-project#19066) Signed-off-by: Jee Jee Li <[email protected]> * Fix underscores in dict keys passed via CLI (vllm-project#19030) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> --------- Signed-off-by: youkaichao <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>

* [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> * fix Signed-off-by: Amog Kamsetty <[email protected]> * remove logging Signed-off-by: Amog Kamsetty <[email protected]> --------- Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>

* [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> * fix Signed-off-by: Amog Kamsetty <[email protected]> --------- Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>

Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]>

Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: minpeter <[email protected]>

Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Yang Wang <[email protected]>

Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]>

* [doc] clarify windows support (vllm-project#19088) Signed-off-by: youkaichao <[email protected]> * [CI/Build] Remove V0 LoRA test (vllm-project#19066) Signed-off-by: Jee Jee Li <[email protected]> * Fix underscores in dict keys passed via CLI (vllm-project#19030) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> --------- Signed-off-by: youkaichao <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>

* [Bugfix] disable processor cache (vllm-project#19068) Signed-off-by: raushan <[email protected]> * [Doc] Improve the Pull Request template with key components (vllm-project#19086) Signed-off-by: Lu Fang <[email protected]> * [Misc] Add missing `_Backend` enums (vllm-project#19081) Signed-off-by: nicklucche <[email protected]> * [Misc] fix: add miss best_of param validation (vllm-project#18555) Signed-off-by: googs1025 <[email protected]> * [Misc] Add SPDX-FileCopyrightText (vllm-project#19100) Signed-off-by: simon-mo <[email protected]> * [Doc] Readme standardization (vllm-project#18695) Co-authored-by: Soren Dreano <[email protected]> * [doc] update docker version (vllm-project#19074) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Kernel] DeepEP dispatch-combine kernel integration (vllm-project#18434) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [V1] Support cross-layer KV sharing (vllm-project#18212) Signed-off-by: Yong Hoon Shin <[email protected]> * [Perf] Tune `scaled_fp8_quant` by increasing vectorization (vllm-project#18844) Signed-off-by: mgoin <[email protected]> * Fix interaction between `Optional` and `Annotated` in CLI typing (vllm-project#19093) Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> * [v1] Re-init input batch for multiple kv cache groups (vllm-project#18654) Signed-off-by: Chen Zhang <[email protected]> * [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix (vllm-project#18971) * [Bugfix] get_num_blocks_to_allocate with null_block (vllm-project#19031) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled (vllm-project#19075) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix][P/D] Fix Prefix Cache Bug (vllm-project#18411) Signed-off-by: nicklucche <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers (vllm-project#19029) Signed-off-by: Chen Zhang <[email protected]> * feat: add data parallel rank to KVEventBatch (vllm-project#18925) * [Misc] Fix path and python alias errors in disagg_prefill exmaples (vllm-project#18919) * [Docs] Add developer doc about CI failures (vllm-project#18782) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [CPU] V1 support for the CPU backend (vllm-project#16441) * [Core] Cast multimodal input in hf processor (vllm-project#18862) Signed-off-by: Lukas Geiger <[email protected]> * [KERNEL] Sampler. CUDA kernel for applying repetition penalty (vllm-project#18437) * [Cleanup][v1]:remote guided-decoding-backend for example (vllm-project#19059) Signed-off-by: calvin chen <[email protected]> * [NVIDIA] Add Cutlass MLA backend (vllm-project#17625) * [Bugfix] Fix FA3 full cuda graph correctness (vllm-project#19106) Signed-off-by: Woosuk Kwon <[email protected]> * Fix vllm-project#19130 (vllm-project#19132) Signed-off-by: 汪志鹏 <[email protected]> * [TPU] Skip hanging tests (vllm-project#19115) Signed-off-by: Siyuan Liu <[email protected]> * Fix ValueError: Missing value for tag key(s): model_name,engine. (vllm-project#19113) Signed-off-by: Seiji Eicher <[email protected]> * [Misc] Add packages for benchmark as extra dependency (vllm-project#19089) Signed-off-by: Isotr0py <[email protected]> * Improve the output precision of embedding models (vllm-project#19092) * [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 (vllm-project#18678) Signed-off-by: DarkLight1337 <[email protected]> * Add DeepSeek-R1-0528 function call chat template (vllm-project#18874) Signed-off-by: 许文卿 <[email protected]> * Sm100 blockwise fp8 swap ab (vllm-project#18564) * [Doc] Update V1 Guide for embedding models (vllm-project#19141) Signed-off-by: DarkLight1337 <[email protected]> * Allow AsyncLLMEngine.generate to target a specific DP rank (vllm-project#19102) Signed-off-by: Jon Swenson <[email protected]> * [Bugfix][EP+DP] Fix internode check (vllm-project#19112) Signed-off-by: Tyler Michael Smith <[email protected]> * [Perf] Tunings for SM100 FP8 CUTLASS kernel (vllm-project#18778) Signed-off-by: mgoin <[email protected]> * [TPU] Update dynamo dump file name in compilation test (vllm-project#19108) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix] fix v1 cpu worker fails on macOS (vllm-project#19121) * [Kernel] Integrate batched/masked deepgemm kernel (vllm-project#19111) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM (vllm-project#18817) Signed-off-by: googs1025 <[email protected]> * [P/D] Heterogeneous TP (vllm-project#18833) Signed-off-by: nicklucche <[email protected]> * [doc] small fix (vllm-project#19167) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][Nixl] Fix full prefix cache hit bug (vllm-project#18632) Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Fix port handling in make_zmq_path (vllm-project#19117) * [Torch Nightly]add missing dependency (vllm-project#18770) Signed-off-by: Yang Wang <[email protected]> * Handle non-serializable objects when dumping benchmark results (vllm-project#19114) * [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 (vllm-project#19171) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled (vllm-project#19135) Signed-off-by: chaunceyjiang <[email protected]> * [Build] Annotate wheel and container path for release workflow (vllm-project#19162) Signed-off-by: simon-mo <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Misc] Remove unnecessary fallback to prefill-decode attention (vllm-project#19138) Signed-off-by: vllmellm <[email protected]> * [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly (vllm-project#19105) Signed-off-by: 22quinn <[email protected]> * [Frontend] improve vllm run-batch --help display (vllm-project#19187) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided (vllm-project#19202) Signed-off-by: Guillaume Calmettes <[email protected]> * [mistral_common] Add v11 tokenizer (vllm-project#19193) Signed-off-by: Patrick von Platen <[email protected]> * Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (vllm-project#19205) * [Hardware][NVIDIA] FP4 MoE kernel optimization (vllm-project#19110) Signed-off-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> * [MISC][Bugfix] Use less CPU when message queue has been empty for some time (vllm-project#16226) Signed-off-by: Povilas Kanapickas <[email protected]> * [P/D][NixlConnector] Enable FlashInfer backend (vllm-project#19090) * [Quantization] Skip Fp4 Test for `compressed-tensors` (vllm-project#19217) * [V1] Use FlashInfer by default on Blackwell GPUs (vllm-project#19118) * [Model] NemotronH support (vllm-project#18863) Signed-off-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> * Fix AOPerModuleConfig name changes (vllm-project#18869) Signed-off-by: Jerry Zhang <[email protected]> * [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (vllm-project#19033) Signed-off-by: Benjamin Chislett <[email protected]> * [v1] Hybrid Memory Allocator (vllm-project#17996) Signed-off-by: Chen Zhang <[email protected]> * [TPU] update torch_xla pin (vllm-project#19231) Signed-off-by: Chengji Yao <[email protected]> * Support allowed_token_ids in ChatCompletionRequest (vllm-project#19143) Signed-off-by: Xu Song <[email protected]> * [Chore] update CODEOWNERS (vllm-project#19247) Signed-off-by: Aaron Pham <[email protected]> * [v1][P/D] Fix a edge case in kv cache schedule (vllm-project#19182) Co-authored-by: jinghui <[email protected]> * [TPU] fix kv cache dtype in model runner (vllm-project#19244) Signed-off-by: Chengji Yao <[email protected]> * [Quantization] Bump compressed-tensors version; update NVFP4A16 test model (vllm-project#19224) Signed-off-by: Dipika Sikka <[email protected]> * [Docs] Improve V1 KVConnector interface documentation (vllm-project#19172) Signed-off-by: Nick Hill <[email protected]> * Fix CompilationConfig repr (vllm-project#19091) Signed-off-by: rzou <[email protected]> * Unit Test for run_dp_sharded_vision_model (vllm-project#19103) Signed-off-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> * [Model] Optimize nemotron_h implementation (vllm-project#19249) Signed-off-by: Jee Jee Li <[email protected]> * [Core] Raise when non-multi-instance DP clients target a DP rank (vllm-project#19227) Signed-off-by: Jon Swenson <[email protected]> * improve logits bias (vllm-project#19041) * Fixed ppc build when it runs on non-RHEL based linux distros (vllm-project#18422) Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> * [BugFix] Fix MultiConnector test after HMA changes (vllm-project#19291) Signed-off-by: Nick Hill <[email protected]> * [Bugfix][Core] Update cancellation logic in `generate()` to handle Generator exits (vllm-project#19225) Co-authored-by: Adolfo Victoria <[email protected]> * [Core] Fix abrupt request abort (vllm-project#18485) Signed-off-by: nicklucche <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Fix tpu_model_runner block_id concatenation (vllm-project#19228) Signed-off-by: Nick Hill <[email protected]> * [Misc][Tools][Benchmark] Fix and improve auto tune script (vllm-project#19163) Signed-off-by: Chenyaaang <[email protected]> * [Build][ROCm] Update Dockerfile.rocm (vllm-project#19296) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Easy][Test] Simplify test_function_tool_use with multiple parametrizes (vllm-project#19269) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Integrate CUTLASS MoE kernel with PPLX (vllm-project#18762) Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [TPU][Test] Add script to run benchmark on TPU for buildkite (vllm-project#19039) Signed-off-by: Qiliang Cui <[email protected]> * [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py (vllm-project#19253) Signed-off-by: Aaruni Aggarwal <[email protected]> * Add FlexAttention to V1 (vllm-project#16078) Signed-off-by: drisspg <[email protected]> * [Misc] refactor context extension (vllm-project#19246) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI/Build] Improve Llama GGUF test robustness (vllm-project#19287) Signed-off-by: Isotr0py <[email protected]> * [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py (vllm-project#19311) Signed-off-by: Lifan Shen <[email protected]> * [AMD] Update compatible packaging version (vllm-project#19309) Signed-off-by: pramkuma <[email protected]> * [BugFix][V1] Fix memory profiling bug (vllm-project#18974) Signed-off-by: luka <[email protected]> * [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer (vllm-project#19283) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] Re-enable use_cudagraph in vLLM v1 (vllm-project#19299) Signed-off-by: Richard Zou <[email protected]> * [Misc] Change tests/compile to use VLLM_V1 by default (vllm-project#19302) Signed-off-by: rzou <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B (vllm-project#19315) Signed-off-by: Xu Wenqing <[email protected]> * [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection (vllm-project#19082) Signed-off-by: Akash Kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> * [Quantization] Add compressed-tensors NVFP4 support (vllm-project#18312) * [Multi Modal] Add an env var for message queue max chunk bytes (vllm-project#19242) Signed-off-by: yZhen <[email protected]> Co-authored-by: yZhen <[email protected]> * [Bugfix] model_max_length should consider max_model_len in tokenizer_config (vllm-project#19201) * [Deprecation] Remove `inputs` arg fallback in Engine classes (vllm-project#18799) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Add documentation update reminder to PR template (vllm-project#19289) Signed-off-by: Isotr0py <[email protected]> * [Frontend] Remove unreachable code from llm.py (vllm-project#19288) Signed-off-by: KsuParkhamchuk <[email protected]> * [Misc] Cleanup compilation tests (vllm-project#19343) Signed-off-by: rzou <[email protected]> * [doc] improve ci doc (vllm-project#19307) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc] Fix description in the Automatic Prefix Caching design doc (vllm-project#19333) Signed-off-by: cr7258 <[email protected]> * [CI/Build] Fix LoRA test (vllm-project#19350) Signed-off-by: Jee Jee Li <[email protected]> * [Fix] Allow kernel compilation for CUDA capability 8.7 (vllm-project#19328) Signed-off-by: Conroy Cheers <[email protected]> * [CI] Introduce rules for llama auto-label (vllm-project#19323) Signed-off-by: Lu Fang <[email protected]> * [Docs] Fix a bullet list in usage/security.md (vllm-project#19358) Signed-off-by: windsonsea <[email protected]> * [full_graph] Fix query_start_loc padding (vllm-project#19321) Signed-off-by: Yinghai Lu <[email protected]> * [v1] Add fp32 support to v1 engine through flex attn (vllm-project#19319) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. (vllm-project#19298) Signed-off-by: Varun <[email protected]> Co-authored-by: Varun <[email protected]> * [Bugfix][Core] Prevent token lengths exceeding `max_model_len` in V0 (vllm-project#19348) Signed-off-by: 22quinn <[email protected]> * [Quantization] Bump compressed-tensors version (vllm-project#19295) Signed-off-by: Kyle Sayers <[email protected]> * [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var (vllm-project#18472) Signed-off-by: liusiqian <[email protected]> * [TPU]Fix KV cache sharing tests (vllm-project#19371) * [HOT-FIX] Add `kv_sharing_target_layer_name` argument to cutlass_mla backend (vllm-project#19374) Signed-off-by: Pavani Majety <[email protected]> * [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration (vllm-project#19383) Signed-off-by: Siyuan Liu <[email protected]> * [V1] Reuse V0's memory_profiling util for gpu worker memory profiling (vllm-project#19312) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [Bugfix] Fix benchmark_moe.py (vllm-project#19016) Signed-off-by: Tianyu Guo <[email protected]> * Use xla flag to improve the quantized model performance (vllm-project#19303) Signed-off-by: Xiongfei Wei <[email protected]> * Fix docs/mkdocs/hooks/remove_announcement.py (vllm-project#19382) * [Frontend] Add tqdm_leave_pbar to control progress bar visibility (vllm-project#19357) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Core] Use tuple for kv cache group block ids (vllm-project#19175) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix modelscope token passed in (vllm-project#19389) Signed-off-by: wangli <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Core] Batch multi modal input using pinned memory (vllm-project#19169) Signed-off-by: Lukas Geiger <[email protected]> * Add security warning to bug report template (vllm-project#19365) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Copilot <[email protected]> * [Misc] refactor neuron_multimodal and profiling (vllm-project#19397) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add clear documentation around the impact of debugging flag (vllm-project#19369) Signed-off-by: Anna Pendleton <[email protected]> * Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (vllm-project#17930) Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: Li, Jiang <[email protected]> * Revert "[v1] Add fp32 support to v1 engine through flex attn" (vllm-project#19404) * [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword `use_irope` (vllm-project#19134) Signed-off-by: Yunqiu Guo <[email protected]> * [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral (vllm-project#19411) Signed-off-by: jiang.li <[email protected]> * Simplify ep kernels installation (vllm-project#19412) Signed-off-by: youkaichao <[email protected]> * [Misc] Slight improvement of the BNB (vllm-project#19418) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docs] Note that alternative structured output backends are supported (vllm-project#19426) Signed-off-by: Russell Bryant <[email protected]> * [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default (vllm-project#19440) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] use AutoWeightsLoader for commandr (vllm-project#19399) Signed-off-by: py-andy-c <[email protected]> * Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 (vllm-project#19401) Signed-off-by: 许文卿 <[email protected]> * [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 (vllm-project#19390) Signed-off-by: rzou <[email protected]> * [New Model]: Support Qwen3 Embedding & Reranker (vllm-project#19260) * [BugFix] Fix docker build cpu-dev image error (vllm-project#19394) Signed-off-by: niu_he <[email protected]> * Fix test_max_model_len in tests/entrypoints/llm/test_generate.py (vllm-project#19451) Signed-off-by: Lu Fang <[email protected]> * [CI] Disable failing GGUF model test (vllm-project#19454) Signed-off-by: mgoin <[email protected]> * [Misc] Remove unused `MultiModalHasher.hash_prompt_mm_data` (vllm-project#19422) Signed-off-by: Lukas Geiger <[email protected]> * Add fused MOE config for Qwen3 30B A3B on B200 (vllm-project#19455) Signed-off-by: Junhao Li <[email protected]> * Fix Typo in Documentation and Function Name (vllm-project#19442) * [ROCm] Add rules to automatically label ROCm related PRs (vllm-project#19405) Signed-off-by: Lu Fang <[email protected]> * [Kernel] Support deep_gemm for linear methods (vllm-project#19085) Signed-off-by: artetaout <[email protected]> * [Doc] Update V1 User Guide for Hardware and Models (vllm-project#19474) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Fix quantization link titles (vllm-project#19478) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Support "important" and "announcement" admonitions (vllm-project#19479) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Reduce warning message introduced in env_override (vllm-project#19476) Signed-off-by: Lu Fang <[email protected]> * Support non-string values in JSON keys from CLI (vllm-project#19471) Signed-off-by: DarkLight1337 <[email protected]> * Add cache to cuda get_device_capability (vllm-project#19436) Signed-off-by: mgoin <[email protected]> * Fix some typo (vllm-project#19475) Signed-off-by: ximing.wxm <[email protected]> Co-authored-by: ximing.wxm <[email protected]> * Support no privileged mode on CPU for docker and kubernetes deployments (vllm-project#19241) Signed-off-by: Tsai, Louie <[email protected]> * [Bugfix] Update the example code, make it work with the latest lmcache (vllm-project#19453) Signed-off-by: Runzhen Wang <[email protected]> * [CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297) Signed-off-by: mgoin <[email protected]> * [doc] fix "Other AI accelerators" getting started page (vllm-project#19457) Signed-off-by: David Xia <[email protected]> * [Misc] Fix misleading ROCm warning (vllm-project#19486) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] Remove WIP features in V1 guide (vllm-project#19498) Signed-off-by: Woosuk Kwon <[email protected]> * [Kernels] Add activation chunking logic to FusedMoEModularKernel (vllm-project#19168) Signed-off-by: Bill Nell <[email protected]> * [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <[email protected]> * [UX] Add Feedback During CUDAGraph Capture (vllm-project#19501) Signed-off-by: [email protected] <[email protected]> * [CI/Build] Fix torch nightly CI dependencies (vllm-project#19505) Signed-off-by: Richard Zou <[email protected]> * [CI] change spell checker from codespell to typos (vllm-project#18711) Signed-off-by: Andy Xie <[email protected]> * [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import (vllm-project#19514) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Add Triton Fused MoE kernel config for E=16 on B200 (vllm-project#19518) Signed-off-by: Brayden Zhong <[email protected]> * [Frontend] Improve error message in tool_choice validation (vllm-project#19239) Signed-off-by: 22quinn <[email protected]> * [BugFix] Work-around incremental detokenization edge case error (vllm-project#19449) Signed-off-by: Nick Hill <[email protected]> * [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API (vllm-project#19522) Signed-off-by: strutive07 <[email protected]> * [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm (vllm-project#19509) Signed-off-by: Randall Smith <[email protected]> * Fix typo (vllm-project#19525) Signed-off-by: 2niuhe <[email protected]> * [Security] Prevent new imports of (cloud)pickle (vllm-project#18018) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix][V1] Allow manual FlashAttention for Blackwell (vllm-project#19492) Signed-off-by: mgoin <[email protected]> * [Bugfix] Respect num-gpu-blocks-override in v1 (vllm-project#19503) Signed-off-by: Jon Swenson <[email protected]> * [Quantization] Improve AWQ logic (vllm-project#19431) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add V1 column to supported models list (vllm-project#19523) Signed-off-by: DarkLight1337 <[email protected]> * [V1][NixlConnector] Drop `num_blocks` check (vllm-project#19532) Signed-off-by: NickLucche <[email protected]> * [Perf] Vectorize static / dynamic INT8 quant kernels (vllm-project#19233) Signed-off-by: yewentao256 <[email protected]> * Fix TorchAOConfig skip layers (vllm-project#19265) Signed-off-by: mobicham <[email protected]> * [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Sage Moore <[email protected]> * [doc] Make top navigation sticky (vllm-project#19540) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (vllm-project#18847) * [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (vllm-project#19506) * [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (vllm-project#19452) Signed-off-by: mgoin <[email protected]> * [Doc] Unify structured outputs examples (vllm-project#18196) Signed-off-by: Aaron Pham <[email protected]> * [V1] Resolve failed concurrent structured output requests (vllm-project#19565) Signed-off-by: Russell Bryant <[email protected]> * Revert "[Build/CI] Add tracing deps to vllm container image (vllm-project#15224)" (vllm-project#19378) * [BugFix] : Fix Batched DeepGemm Experts (vllm-project#19515) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix EAGLE vocab embedding for multimodal target model (vllm-project#19570) Signed-off-by: qizixi <[email protected]> * [Doc] uses absolute links for structured outputs (vllm-project#19582) Signed-off-by: Aaron Pham <[email protected]> * [doc] fix incorrect link (vllm-project#19586) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Correct broken docs link (vllm-project#19553) Signed-off-by: Zerohertz <[email protected]> * [CPU] Refine default config for the CPU backend (vllm-project#19539) Signed-off-by: jiang1.li <[email protected]> * [Fix] bump mistral common to support magistral (vllm-project#19533) Signed-off-by: 汪志鹏 <[email protected]> * [Fix] The zip function in Python 3.9 does not have the strict argument (vllm-project#19549) Signed-off-by: 汪志鹏 <[email protected]> * use base version for version comparison (vllm-project#19587) Signed-off-by: Boyuan Feng <[email protected]> * [torch.compile] reorganize the cache directory to support compiling multiple models (vllm-project#19064) Signed-off-by: youkaichao <[email protected]> * [BugFix] Honor `enable_caching` in connector-delayed kvcache load case (vllm-project#19435) Signed-off-by: Nick Hill <[email protected]> * [Model] Fix minimax model cache & lm_head precision (vllm-project#19592) Signed-off-by: qingjun <[email protected]> * [Refactor] Remove unused variables in `moe_permute_unpermute_kernel.inl` (vllm-project#19573) Signed-off-by: yewentao256 <[email protected]> * [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (vllm-project#19606) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CUDA] Enable full cudagraph for FlashMLA (vllm-project#18581) Signed-off-by: luka <[email protected]> * [Doc] Add troubleshooting section to k8s deployment (vllm-project#19377) Signed-off-by: Anna Pendleton <[email protected]> * [torch.compile] Use custom ops when use_inductor=False (vllm-project#19618) * Adding "AMD: Multi-step Tests" to amdproduction. (vllm-project#19508) Signed-off-by: Yida Wu <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <[email protected]> * [BugFix] Fix DP Coordinator incorrect debug log message (vllm-project#19624) Signed-off-by: Nick Hill <[email protected]> * [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (vllm-project#18354) Signed-off-by: Saheli Bhattacharjee <[email protected]> * [Bugfix] Fix the speculative decoding test by setting the target dtype (vllm-project#19633) * [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (vllm-project#19593) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> * [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (vllm-project#19500) * Only build CUTLASS MoE kernels on Hopper (vllm-project#19648) * [Bugfix] Don't attempt to use triton if no driver is active (vllm-project#19561) * [Fix] Convert kv_transfer_config from dict to KVTransferConfig (vllm-project#19262) * [Perf] Further tunings for SM100 FP8 CUTLASS kernel (vllm-project#19566) * [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (vllm-project#19644) * [Kernel] Raise verbose error and consolidate `num_heads/num_kv_heads` divisibility check (vllm-project#19339) Signed-off-by: 22quinn <[email protected]> * [Benchmark] Refactor benchmark script for fp8 & int8 (vllm-project#19627) Signed-off-by: yewentao256 <[email protected]> * Enable prefix caching with full cuda graphs (vllm-project#19617) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] Fix torch nightly CI dependencies part 2 (vllm-project#19589) * [Misc] Remove duplicate multiproc method setting for CPU platform (vllm-project#19649) Signed-off-by: Isotr0py <[email protected]> * [MISC] Remove unused variableds in C++ (vllm-project#19609) Signed-off-by: Lu Fang <[email protected]> * [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (vllm-project#18957) Signed-off-by: 刘全 <[email protected]> Co-authored-by: 刘全 <[email protected]> * [Misc][Frontend] passthrough `bad_words` (vllm-project#19564) Signed-off-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (vllm-project#19660) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [TPU] support attention head dim smaller than 128 (vllm-project#19620) Signed-off-by: Chengji Yao <[email protected]> Co-authored-by: mgoin <[email protected]> * [MISC] typo fix (vllm-project#19672) Signed-off-by: Andy Xie <[email protected]> * [CI] Add mteb testing for rerank models (vllm-project#19344) * [Docs] Move multiproc doc to v1 dir (vllm-project#19651) Signed-off-by: Russell Bryant <[email protected]> * [Kernel] GGUF MMVQ kernel for multiple input vectors (vllm-project#18754) Signed-off-by: SzymonOzog <[email protected]> * [BugFix] Don't catch BaseException when dumping execute_model errors (vllm-project#19626) Signed-off-by: Nick Hill <[email protected]> * [DOC] Add reasoning capability to vLLM streamlit code (vllm-project#19557) * [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (vllm-project#19652) Signed-off-by: Shawn Tan <[email protected]> * [Bugfix] Fix TP inference for Flex attention backend (vllm-project#19657) Signed-off-by: Isotr0py <[email protected]> * [MISC] bump huggingface_hub pkg to 0.33.0 (vllm-project#19547) Signed-off-by: Andy Xie <[email protected]> * [Bugfix] fix missing 'finish_reason': null in streaming chat (vllm-project#19662) Signed-off-by: chaunceyjiang <[email protected]> * [Kernels] Use empty for modular MoE workspaces (vllm-project#19667) Signed-off-by: Bill Nell <[email protected]> * [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (vllm-project#19677) Signed-off-by: QscQ <[email protected]> * [V1] Change return type on get_multimodal_embeddings() (vllm-project#19446) Signed-off-by: Russell Bryant <[email protected]> * fix Signed-off-by: Amog Kamsetty <[email protected]> * remove logging Signed-off-by: Amog Kamsetty <[email protected]> --------- Signed-off-by: raushan <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: nicklucche <[email protected]> Signed-off-by: googs1025 <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Varun <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: calvin chen <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: 许文卿 <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: 22quinn <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: Patrick von Platen <[email protected]> Signed-off-by: Chiyue Wei <[email protected]> Signed-off-by: Povilas Kanapickas <[email protected]> Signed-off-by: Luis Vega <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Xu Song <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Dipika Sikka <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Siqi Yan <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Md. Shafi Hussain <[email protected]> Signed-off-by: npanpaliya <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Lifan Shen <[email protected]> Signed-off-by: pramkuma <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: Xu Wenqing <[email protected]> Signed-off-by: Akash Kaothalkar <[email protected]> Signed-off-by: yZhen <[email protected]> Signed-off-by: KsuParkhamchuk <[email protected]> Signed-off-by: cr7258 <[email protected]> Signed-off-by: Conroy Cheers <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: liusiqian <[email protected]> Signed-off-by: Pavani Majety <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Tianyu Guo <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: Anna Pendleton <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Signed-off-by: Yunqiu Guo <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: py-andy-c <[email protected]> Signed-off-by: niu_he <[email protected]> Signed-off-by: Junhao Li <[email protected]> Signed-off-by: artetaout <[email protected]> Signed-off-by: ximing.wxm <[email protected]> Signed-off-by: Runzhen Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: strutive07 <[email protected]> Signed-off-by: 2niuhe <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: yewentao256 <[email protected]> Signed-off-by: mobicham <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Saheli Bhattacharjee <[email protected]> Signed-off-by: 刘全 <[email protected]> Signed-off-by: Francesco Bertolotti <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Shawn Tan <[email protected]> Signed-off-by: QscQ <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Raushan Turganbay <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: CYJiang <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: SorenDreano <[email protected]> Co-authored-by: Soren Dreano <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yan Ru Pei <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: Calvin Chen <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Xu Wenqing <[email protected]> Co-authored-by: Lain <[email protected]> Co-authored-by: jmswen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: vllmellm <[email protected]> Co-authored-by: 22quinn <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Chiyue Wei <[email protected]> Co-authored-by: Povilas Kanapickas <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Luis Vega <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Xu Song <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Jinghui Zhang <[email protected]> Co-authored-by: jinghui <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Siqi Yan <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Yu Guo <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Md. Shafi Hussain <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Adolfo Victoria <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Lifans <[email protected]> Co-authored-by: pramenku <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Akash kaothalkar <[email protected]> Co-authored-by: Akash Kaothalkar <[email protected]> Co-authored-by: jennyyyyzhen <[email protected]> Co-authored-by: yZhen <[email protected]> Co-authored-by: Kseniya Parkhamchuk <[email protected]> Co-authored-by: Se7en <[email protected]> Co-authored-by: Conroy Cheers <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: liusiqian-tal <[email protected]> Co-authored-by: Pavani Majety <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Tianyu Guo <[email protected]> Co-authored-by: XiongfeiWei <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Anna Pendleton <[email protected]> Co-authored-by: Louie Tsai <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Rachel Guo <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: py-andy-c <[email protected]> Co-authored-by: niu_he <[email protected]> Co-authored-by: Junhao Li <[email protected]> Co-authored-by: leopardracer <[email protected]> Co-authored-by: artetaout <[email protected]> Co-authored-by: Ximingwang-09 <[email protected]> Co-authored-by: ximing.wxm <[email protected]> Co-authored-by: runzhen <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: wonjun Jang <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: mobicham <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: kourosh hakhamaneshi <[email protected]> Co-authored-by: qizixi <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Saheli Bhattacharjee <[email protected]> Co-authored-by: jiahanc <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: maobaolong <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: quanliu <[email protected]> Co-authored-by: 刘全 <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Francesco Bertolotti <[email protected]> Co-authored-by: Szymon Ożóg <[email protected]> Co-authored-by: Navanit Dubey <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: qscqesze <[email protected]>

Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: avigny <[email protected]>

fix BatchFeature dtype casting

34a5713

Signed-off-by: Isotr0py <[email protected]>

Isotr0py requested a review from DarkLight1337 June 7, 2025 14:52

gemini-code-assist bot reviewed Jun 7, 2025

View reviewed changes

DarkLight1337 approved these changes Jun 7, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) June 7, 2025 15:42

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 7, 2025

jeejeelee approved these changes Jun 8, 2025

View reviewed changes

Isotr0py added 3 commits June 10, 2025 01:26

Merge branch 'vllm-project:main' into fix-dtype-cast

cef0fcf

fix deadlock

6da9036

Signed-off-by: Isotr0py <[email protected]>

cleanup import

84ab35d

Signed-off-by: Isotr0py <[email protected]>

mergify bot added the tpu Related to Google TPUs label Jun 11, 2025

Isotr0py commented Jun 11, 2025

View reviewed changes

vllm/model_executor/model_loader/utils.py Outdated Show resolved Hide resolved

Merge branch 'vllm-project:main' into fix-dtype-cast

4c0d49a

revert

c7931cc

Signed-off-by: Isotr0py <[email protected]>

Isotr0py disabled auto-merge June 12, 2025 03:37

fix deadlock

c0c1f3e

Signed-off-by: Isotr0py <[email protected]>

Isotr0py removed the tpu Related to Google TPUs label Jun 12, 2025

mergify bot added the v1 label Jun 12, 2025

Isotr0py commented Jun 12, 2025

View reviewed changes

deadlock for remain tests

ff4c09d

Signed-off-by: Isotr0py <[email protected]>

Isotr0py enabled auto-merge (squash) June 12, 2025 06:51

fix other deadlock

9a177f7

Signed-off-by: Isotr0py <[email protected]>

Isotr0py merged commit 2db9044 into vllm-project:main Jun 14, 2025
68 of 69 checks passed

Isotr0py deleted the fix-dtype-cast branch June 14, 2025 15:18

njhill mentioned this pull request Jun 20, 2025

[CI/Build][Bugfix] Fix deadlock on v1 engine test CI #19872

Merged

4 tasks

yeqcharlotte pushed a commit to yeqcharlotte/vllm that referenced this pull request Jun 22, 2025

[Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316)

f9544bd

Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]>

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025

[Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316)

de316ea

Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]>

wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025

[Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316)

84be406

Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]>

h-avsha mentioned this pull request Jul 16, 2025

[fix] fix qwen image_embeds input #21049

Merged

avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025

[Bugfix] Fix auto dtype casting for BatchFeature (vllm-project#19316)

6cafc3b

Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: avigny <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix auto dtype casting for BatchFeature #19316

[Bugfix] Fix auto dtype casting for BatchFeature #19316

Uh oh!

Isotr0py commented Jun 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

DarkLight1337 left a comment

Uh oh!

DarkLight1337 commented Jun 9, 2025

Uh oh!

Isotr0py commented Jun 9, 2025

Uh oh!

Uh oh!

Isotr0py commented Jun 11, 2025 •

edited

Loading

Uh oh!

lgeiger commented Jun 11, 2025

Uh oh!

gemini-code-assist bot commented Jun 12, 2025

Uh oh!

Isotr0py Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

		with set_default_torch_num_threads(1):
		engine = AsyncLLM.from_engine_args(engine_args)

Uh oh!

[Bugfix] Fix auto dtype casting for BatchFeature #19316

[Bugfix] Fix auto dtype casting for BatchFeature #19316

Uh oh!

Conversation

Isotr0py commented Jun 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Jun 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Merge Readiness

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Jun 9, 2025

Uh oh!

Isotr0py commented Jun 9, 2025

Uh oh!

Uh oh!

Isotr0py commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgeiger commented Jun 11, 2025

Uh oh!

gemini-code-assist bot commented Jun 12, 2025

Uh oh!

Isotr0py Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Isotr0py commented Jun 7, 2025 •

edited by github-actions bot

Loading

Isotr0py commented Jun 11, 2025 •

edited

Loading