Skip to content

[Frontend] Expose revision arg in OpenAI server #8501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 16, 2024

Conversation

lewtun
Copy link
Contributor

@lewtun lewtun commented Sep 16, 2024

This PR exposes the revision arg in the OpenAPI server so that one can run inference at a desired model revision via:

vllm serve {MODEL_ID} --revision {REVISION}

In particular, it resolves the following error that arises when passing --revision to vllm serve for model repos that do not have a config.json file on the main branch (a common pattern among HF Hub users who use branches for versioning experiments):

Traceback (most recent call last):
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/HuggingFaceH4/gemma-2-2b-gkd/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/transformers/utils/hub.py", line 402, in cached_file
    resolved_file = hf_hub_download(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1240, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1303, in _hf_hub_download_to_cache_dir
    (url_to_download, etag, commit_hash, expected_size, head_call_error) = _get_metadata_or_catch_error(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1752, in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1674, in get_hf_file_metadata
    r = _request_wrapper(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 376, in _request_wrapper
    response = _request_wrapper(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 400, in _request_wrapper
    hf_raise_for_status(response)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 315, in hf_raise_for_status
    raise EntryNotFoundError(message, response) from e
huggingface_hub.utils._errors.EntryNotFoundError: 404 Client Error. (Request ID: Root=1-66e7e09a-17e4310c5f1032c2189efc96;f483305a-4822-4bd4-a502-3a867fafecc9)

Entry Not Found for url: https://huggingface.co/HuggingFaceH4/gemma-2-2b-gkd/resolve/main/config.json.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fsx/lewis/miniconda3/envs/mixeval/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/scripts.py", line 165, in main
    args.dispatch_function(args)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/scripts.py", line 37, in serve
    asyncio.run(run_server(args))
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 462, in run_server
    async with build_async_engine_client(args) as async_engine_client:
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 108, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 130, in build_async_engine_client_from_engine_args
    if (model_is_embedding(engine_args.model, engine_args.trust_remote_code,
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 71, in model_is_embedding
    return ModelConfig(model=model_name,
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/config.py", line 176, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 66, in get_config
    config = AutoConfig.from_pretrained(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 976, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/transformers/configuration_utils.py", line 632, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/transformers/configuration_utils.py", line 689, in _get_config_dict
    resolved_config_file = cached_file(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/transformers/utils/hub.py", line 456, in cached_file
    raise EnvironmentError(
OSError: HuggingFaceH4/gemma-2-2b-gkd does not appear to have a file named config.json. Checkout 'https://huggingface.co/HuggingFaceH4/gemma-2-2b-gkd/tree/main' for available files.

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
  • [Kernel] for changes affecting CUDA kernels or other compute kernels.
  • [Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
  • [Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • We adhere to Google Python style guide and Google C++ style guide.
  • Pass all linter checks. Please use format.sh to format your code.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
  • Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Adding or changing kernels

Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

  • Make sure custom ops are registered following PyTorch guidelines: Custom C++ and CUDA Operators and The Custom Operators Manual
  • Custom operations that return Tensors require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.
  • Use torch.libary.opcheck() to test the function registration and meta-function for any registered ops. See tests/kernels for examples.
  • When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.
  • If a new custom type is needed, see the following document: Custom Class Support in PT2.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

  • After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
  • After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
  • After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
  • Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

This PR exposes the revision arg in the OpenAPI server so that one can run inference at a desired model revision. In particular, it resolves the following error that arises when passing `--revision` to `vllm serve`:

```
Traceback (most recent call last):
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/HuggingFaceH4/gemma-2-2b-gkd/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/transformers/utils/hub.py", line 402, in cached_file
    resolved_file = hf_hub_download(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1240, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1303, in _hf_hub_download_to_cache_dir
    (url_to_download, etag, commit_hash, expected_size, head_call_error) = _get_metadata_or_catch_error(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1752, in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1674, in get_hf_file_metadata
    r = _request_wrapper(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 376, in _request_wrapper
    response = _request_wrapper(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 400, in _request_wrapper
    hf_raise_for_status(response)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 315, in hf_raise_for_status
    raise EntryNotFoundError(message, response) from e
huggingface_hub.utils._errors.EntryNotFoundError: 404 Client Error. (Request ID: Root=1-66e7e09a-17e4310c5f1032c2189efc96;f483305a-4822-4bd4-a502-3a867fafecc9)

Entry Not Found for url: https://huggingface.co/HuggingFaceH4/gemma-2-2b-gkd/resolve/main/config.json.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fsx/lewis/miniconda3/envs/mixeval/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/scripts.py", line 165, in main
    args.dispatch_function(args)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/scripts.py", line 37, in serve
    asyncio.run(run_server(args))
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 462, in run_server
    async with build_async_engine_client(args) as async_engine_client:
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 108, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 130, in build_async_engine_client_from_engine_args
    if (model_is_embedding(engine_args.model, engine_args.trust_remote_code,
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 71, in model_is_embedding
    return ModelConfig(model=model_name,
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/config.py", line 176, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 66, in get_config
    config = AutoConfig.from_pretrained(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 976, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/transformers/configuration_utils.py", line 632, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/transformers/configuration_utils.py", line 689, in _get_config_dict
    resolved_config_file = cached_file(
  File "/fsx/lewis/miniconda3/envs/mixeval/lib/python3.10/site-packages/transformers/utils/hub.py", line 456, in cached_file
    raise EnvironmentError(
OSError: HuggingFaceH4/gemma-2-2b-gkd does not appear to have a file named config.json. Checkout 'https://huggingface.co/HuggingFaceH4/gemma-2-2b-gkd/tree/main' for available files.
```
@lewtun lewtun changed the title Expose revision arg in OpenAPI server Expose revision arg in OpenAI server Sep 16, 2024
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@lewtun lewtun changed the title Expose revision arg in OpenAI server [Bugfix] Expose revision arg in OpenAI server Sep 16, 2024
@lewtun lewtun changed the title [Bugfix] Expose revision arg in OpenAI server [Frontend] Expose revision arg in OpenAI server Sep 16, 2024
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 16, 2024 11:33
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 16, 2024
@DarkLight1337 DarkLight1337 merged commit 837c196 into vllm-project:main Sep 16, 2024
61 checks passed
@lewtun lewtun deleted the patch-1 branch September 17, 2024 08:57
Manikandan-Thangaraj-ZS0321 added a commit to Manikandan-Thangaraj-ZS0321/vllm that referenced this pull request Sep 25, 2024
* [Kernel] Enable 8-bit weights in Fused Marlin MoE (vllm-project#8032)

Co-authored-by: Dipika <[email protected]>

* [Frontend] Expose revision arg in OpenAI server (vllm-project#8501)

* [BugFix] Fix clean shutdown issues (vllm-project#8492)

* [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (vllm-project#8506)

* [Kernel] AQ AZP 3/4: Asymmetric quantization kernels (vllm-project#7270)

* [doc] update doc on testing and debugging (vllm-project#8514)

* [Bugfix] Bind api server port before starting engine (vllm-project#8491)

* [perf bench] set timeout to debug hanging (vllm-project#8516)

* [misc] small qol fixes for release process (vllm-project#8517)

* [Bugfix] Fix 3.12 builds on main (vllm-project#8510)

Signed-off-by: Joe Runde <[email protected]>

* [refactor] remove triton based sampler (vllm-project#8524)

* [Frontend] Improve Nullable kv Arg Parsing (vllm-project#8525)

Signed-off-by: Alex-Brooks <[email protected]>

* [Misc][Bugfix] Disable guided decoding for mistral tokenizer (vllm-project#8521)

* [torch.compile] register allreduce operations as custom ops (vllm-project#8526)

* [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (vllm-project#8509)

Signed-off-by: Rui Qiao <[email protected]>

* [Benchmark] Support sample from HF datasets and image input for benchmark_serving (vllm-project#8495)

* [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (vllm-project#7631)

* [Feature][kernel] tensor parallelism with bitsandbytes quantization (vllm-project#8434)

* [Model] Add mistral function calling format to all models loaded with "mistral" format (vllm-project#8515)

Co-authored-by: Cyrus Leung <[email protected]>

* [Misc] Don't dump contents of kvcache tensors on errors (vllm-project#8527)

* [Bugfix] Fix TP > 1 for new granite (vllm-project#8544)

Signed-off-by: Joe Runde <[email protected]>

* [doc] improve installation doc (vllm-project#8550)

Co-authored-by: Andy Dai <[email protected]>

* [CI/Build] Excluding kernels/test_gguf.py from ROCm (vllm-project#8520)

* [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (vllm-project#8012)

* [CI/Build] fix Dockerfile.cpu on podman (vllm-project#8540)

* [Misc] Add argument to disable FastAPI docs (vllm-project#8554)

* [CI/Build] Avoid CUDA initialization (vllm-project#8534)

* [CI/Build] Update Ruff version (vllm-project#8469)

Signed-off-by: Aaron Pham <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (vllm-project#8157)

Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Simon Mo <[email protected]>

* [Core] *Prompt* logprobs support in Multi-step (vllm-project#8199)

* [Core] zmq: bind only to 127.0.0.1 for local-only usage (vllm-project#8543)

Signed-off-by: Russell Bryant <[email protected]>

* [Model] Support Solar Model (vllm-project#8386)

Co-authored-by: Michael Goin <[email protected]>

* [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (vllm-project#8380)

Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]>
Co-authored-by: Michael Goin <[email protected]>

* [Kernel] Change interface to Mamba selective_state_update for continuous batching (vllm-project#8039)

* [BugFix] Nonzero exit code if MQLLMEngine startup fails (vllm-project#8572)

* [Bugfix] add `dead_error` property to engine client (vllm-project#8574)

Signed-off-by: Joe Runde <[email protected]>

* [Kernel] Remove marlin moe templating on thread_m_blocks (vllm-project#8573)

Co-authored-by: [email protected]

* [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models.  (vllm-project#8545)

* Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (vllm-project#8593)

* [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (vllm-project#8616)

* [MISC] remove engine_use_ray in benchmark_throughput.py (vllm-project#8615)

* [Frontend] Use MQLLMEngine for embeddings models too (vllm-project#8584)

* [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (vllm-project#8577)

* [Core] simplify logits resort in _apply_top_k_top_p (vllm-project#8619)

* [Doc] Add documentation for GGUF quantization (vllm-project#8618)

* Create SECURITY.md (vllm-project#8642)

* [CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (vllm-project#8551)

* [Misc] guard against change in cuda library name (vllm-project#8609)

* [Bugfix] Fix Phi3.5 mini and MoE LoRA inference (vllm-project#8571)

* [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (vllm-project#8474)

* [Core] Support Lora lineage and base model metadata management (vllm-project#6315)

* [Model] Add OLMoE (vllm-project#7922)

* [CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build (vllm-project#8670)

* [Bugfix] Validate SamplingParam n is an int (vllm-project#8548)

* [Misc] Show AMD GPU topology in `collect_env.py` (vllm-project#8649)

* [Bugfix] Config got an unexpected keyword argument 'engine' (vllm-project#8556)

* [Bugfix][Core] Fix tekken edge case for mistral tokenizer (vllm-project#8640)

* [Doc] neuron documentation update (vllm-project#8671)

Signed-off-by: omrishiv <[email protected]>

* [Hardware][AWS] update neuron to 2.20 (vllm-project#8676)

Signed-off-by: omrishiv <[email protected]>

* [Bugfix] Fix incorrect llava next feature size calculation (vllm-project#8496)

* [Core] Rename `PromptInputs` and `inputs`(vllm-project#8673)

* [MISC] add support custom_op check (vllm-project#8557)

Co-authored-by: youkaichao <[email protected]>

* [Core] Factor out common code in `SequenceData` and `Sequence` (vllm-project#8675)

* [beam search] add output for manually checking the correctness (vllm-project#8684)

* [Kernel] Build flash-attn from source (vllm-project#8245)

* [VLM] Use `SequenceData.from_token_counts` to create dummy data (vllm-project#8687)

* [Doc] Fix typo in AMD installation guide (vllm-project#8689)

* [Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (vllm-project#8646)

* [dbrx] refactor dbrx experts to extend FusedMoe class (vllm-project#8518)

* [Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (vllm-project#8643)

* [Bugfix] Refactor composite weight loading logic (vllm-project#8656)

* [ci][build] fix vllm-flash-attn (vllm-project#8699)

* [Model] Refactor BLIP/BLIP-2 to support composite model loading (vllm-project#8407)

* [Misc] Use NamedTuple in Multi-image example (vllm-project#8705)

Signed-off-by: Alex-Brooks <[email protected]>

* [MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (vllm-project#8703)

* [Model][VLM] Add LLaVA-Onevision model support (vllm-project#8486)

Co-authored-by: litianjian <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>

* [SpecDec][Misc] Cleanup, remove bonus token logic. (vllm-project#8701)

* [build] enable existing pytorch (for GH200, aarch64, nightly) (vllm-project#8713)

* [misc] upgrade mistral-common (vllm-project#8715)

* [Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (vllm-project#8702)

* [Bugfix] Fix CPU CMake build (vllm-project#8723)

Co-authored-by: Yuan <[email protected]>

* [Bugfix] fix docker build for xpu (vllm-project#8652)

* [Core][Frontend] Support Passing Multimodal Processor Kwargs (vllm-project#8657)

Signed-off-by: Alex-Brooks <[email protected]>

* [Hardware][CPU] Refactor CPU model runner (vllm-project#8729)

* [Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner (vllm-project#8733)

* [Model] Support pp for qwen2-vl (vllm-project#8696)

* [VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size (vllm-project#8707)

* [CI/Build] use setuptools-scm to set __version__ (vllm-project#4738)

Co-authored-by: youkaichao <[email protected]>

* [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (vllm-project#7701)

Co-authored-by: mgoin <[email protected]>
Co-authored-by: Divakar Verma <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>

* [Kernel][LoRA]  Add assertion for punica sgmv kernels (vllm-project#7585)

* [Core] Allow IPv6 in VLLM_HOST_IP with zmq (vllm-project#8575)

Signed-off-by: Russell Bryant <[email protected]>

* Fix typical acceptance sampler with correct recovered token ids (vllm-project#8562)

* Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (vllm-project#8335)

* [Hardware][AMD] ROCm6.2 upgrade (vllm-project#8674)

* Fix tests in test_scheduler.py that fail with BlockManager V2 (vllm-project#8728)

* re-implement beam search on top of vllm core (vllm-project#8726)

Co-authored-by: Brendan Wong <[email protected]>

* Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (vllm-project#8750)

* [MISC] Skip dumping inputs when unpicklable (vllm-project#8744)

* [Core][Model] Support loading weights by ID within models (vllm-project#7931)

* [Model] Expose Phi3v num_crops as a mm_processor_kwarg (vllm-project#8658)

Signed-off-by: Alex-Brooks <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>

* [Bugfix] Fix potentially unsafe custom allreduce synchronization (vllm-project#8558)

* [Kernel] Split Marlin MoE kernels into multiple files (vllm-project#8661)

Co-authored-by: mgoin <[email protected]>

* [Frontend] Batch inference for llm.chat() API  (vllm-project#8648)

Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>

* [Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (vllm-project#8748)

* [CI/Build] fix setuptools-scm usage (vllm-project#8771)

* [misc] soft drop beam search (vllm-project#8763)

* [[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (vllm-project#8768)

* [Core][Bugfix] Support prompt_logprobs returned with speculative decoding (vllm-project#8047)

Signed-off-by: Travis Johnson <[email protected]>

* [Core] Adding Priority Scheduling (vllm-project#5958)

* [Bugfix] Use heartbeats instead of health checks (vllm-project#8583)

* Fix test_schedule_swapped_simple in test_scheduler.py (vllm-project#8780)

* [Bugfix][Kernel] Implement acquire/release polyfill for Pascal (vllm-project#8776)

* Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (vllm-project#8752)

* [BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv (vllm-project#8250)

* [Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (vllm-project#8770)

* [Bugfix] load fc bias from config for eagle (vllm-project#8790)

---------

Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: omrishiv <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
Co-authored-by: ElizaWszola <[email protected]>
Co-authored-by: Dipika <[email protected]>
Co-authored-by: lewtun <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: sasha0552 <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: Kevin Lin <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Co-authored-by: Alex Brooks <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Rui Qiao <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: sroy745 <[email protected]>
Co-authored-by: chenqianfzh <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Andy Dai <[email protected]>
Co-authored-by: Alexey Kondratiev(AMD) <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Daniele <[email protected]>
Co-authored-by: Jiaxin Shan <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Aaron Pham <[email protected]>
Co-authored-by: Alexander Matveev <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: afeldman-nm <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Co-authored-by: Geun, Lim <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Gregory Shtrasberg <[email protected]>
Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]>
Co-authored-by: Kuntai Du <[email protected]>
Co-authored-by: Kunshang Ji <[email protected]>
Co-authored-by: Charlie Fu <[email protected]>
Co-authored-by: 盏一 <[email protected]>
Co-authored-by: bnellnm <[email protected]>
Co-authored-by: Amit Garg <[email protected]>
Co-authored-by: William Lin <[email protected]>
Co-authored-by: Niklas Muennighoff <[email protected]>
Co-authored-by: saumya-saran <[email protected]>
Co-authored-by: Pastel! <[email protected]>
Co-authored-by: omrishiv <[email protected]>
Co-authored-by: zyddnys <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: rasmith <[email protected]>
Co-authored-by: Divakar Verma <[email protected]>
Co-authored-by: Huazhong Ji <[email protected]>
Co-authored-by: litianjian <[email protected]>
Co-authored-by: litianjian <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Lily Liu <[email protected]>
Co-authored-by: Yuan <[email protected]>
Co-authored-by: Yan Ma <[email protected]>
Co-authored-by: Li, Jiang <[email protected]>
Co-authored-by: Yanyi Liu <[email protected]>
Co-authored-by: Jani Monoses <[email protected]>
Co-authored-by: Lucas Wilkinson <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: jiqing-feng <[email protected]>
Co-authored-by: Hongxia Yang <[email protected]>
Co-authored-by: Brendan Wong <[email protected]>
Co-authored-by: Cody Yu <[email protected]>
Co-authored-by: Peter Salas <[email protected]>
Co-authored-by: Hanzhi Zhou <[email protected]>
Co-authored-by: Andy <[email protected]>
Co-authored-by: Travis Johnson <[email protected]>
Co-authored-by: Archit Patke <[email protected]>
Co-authored-by: zifeitong <[email protected]>
Co-authored-by: sohamparikh <[email protected]>
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants