This release:
- 🎉 Supports embedding models on vLLM v1!
- 🔥 Removes all remaining support for vLLM v0
- ⚡ Contains performance and stability fixes for continuous batching
- ⚗️ Support for up to
--max-num-seqs 4 --max-model-len 8192 --tensor-parallel-size 4
has been tested on ibm-granite/granite-3.3-8b-instruct
- ⚗️ Support for up to
- 📦 Officially supports vllm 0.9.2 and 0.10.0
What's Changed
- [SB] relax constraint on min number of new tokens by @yannicks1 in #322
- [CB] bug fix: account for prefill token by @yannicks1 in #320
- Documents a bit CB script and tests by @sducouedic in #300
- 🧪 add long context test by @joerunde in #330
- [docs] Add install from PyPI to docs by @ckadner in #327
- ⬆️ bump base image by @joerunde in #328
- [ppc64le] Introduce ppc64le benchmarking scripts by @Daniel-Schenker in #311
- [CB] Override number of Spyre blocks: replace env var with top level argument by @yannicks1 in #331
- [CB] Add scheduling tests by @sducouedic in #329
- 🎨 add values in test asserts by @prashantgupta24 in #333
- [CB] Refactoring/Cleaning up prepare_prompt/decode by @yannicks1 in #335
- feat: enable FP8 quantized models loading by @rafvasq in #316
- ♻️ Compatibility with vllm main by @prashantgupta24 in #338
- V1 embeddings by @maxdebayser in #277
- feat: detect CPUs and configure threading sensibly by @tjohnson31415 in #291
- [CB] Support pseudo batch size 1 for decode, adjust warmup by @yannicks1 in #287
- fix introduced merge conflict on main by @yannicks1 in #345
- Add CB API tests on the correct use of max_tokens by @gmarinho2 in #339
- ♻️ fix vllm:main by @prashantgupta24 in #341
- [CB] Optimization: Reduce wastage in prefill compute and pad blocks in homogeneous continuous batching by @yannicks1 in #262
- [CI] Tests for graph comparison between vllm and AFTU by @wallashss in #286
- [CB] refactoring warmup for batch size 1 by @yannicks1 in #347
- [CB][Tests] Check output of scheduling tests on Spyre by @sducouedic in #337
- [v1] remove v0 code by @yannicks1 in #344
- ♻️ enable offline mode in GHA tests by @prashantgupta24 in #349
- ⬆️ bump base image with more CB fixes by @joerunde in #351
- Upstream compatibility tests by @maxdebayser in #343
- ⬆️ Bump locked vllm to 0.10.0 by @joerunde in #352
New Contributors
- @Daniel-Schenker made their first contribution in #311
Full Changelog: v0.5.3...v0.6.0