Skip to content

[Hardware][CPU] Vllm int8 quantization enablement for ARM CPU #14129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 10, 2025

Conversation

nishith-fujitsu
Copy link
Contributor

@nishith-fujitsu nishith-fujitsu commented Mar 3, 2025

Description
This PR enables support of vLLM INT8 quantized model for AARCH64 architecture. Enabled ARM path for CPU inference of INT8 quantized models.

ARM Compatibility:
Modified the build scripts, and configuration files to ensure compatibility with ARM processors.

Checklist

Code changes have been tested on ARM devices (Graviton3).

Modifications

  1. Modifications have been made to dnnl_helper file, the memory tag check has been added for AArch64 CPUs to get optimal performing kernel.
  2. Flag VLLM build with ACL is added, by default it is set to off. This flag is useful to build oneDNN kernels with ACL, which can be utilized by CPU quantization kernels. The ACL library has to be built, and path need to be set ENV variable ACL_ROOT_DIR.
  3. Added NEON intrinsics for structs required by Int8 quantized kernel for enabling vLLM on ARM in cpu_types_arm.hpp.
  4. Added flags which are required to enable Int8 kernels in quant.hpp, torch_binding.cpp.
  5. Updated oneDNN version to 3.8.1 as the implementation for int8 matmul kernel for AARCH64/ARM machine is added to oneDNN 3.8.1 and later version.

Note: ACL kernels will not run for int8 because of per channel quantization strategy by default case in vLLM and ACL doesn't support per channel quantization.

Copy link

github-actions bot commented Mar 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Mar 3, 2025
@nishith-fujitsu nishith-fujitsu force-pushed the vllm_int8_AARCH64_Enablement branch 2 times, most recently from fe4714f to 84f660e Compare March 4, 2025 05:44
@akote123
Copy link

akote123 commented Mar 4, 2025

CC: @mgoin @tlrmchlsmth

@nishith-fujitsu nishith-fujitsu force-pushed the vllm_int8_AARCH64_Enablement branch 3 times, most recently from 8687635 to b9f210f Compare March 4, 2025 06:56
@nishith-fujitsu nishith-fujitsu force-pushed the vllm_int8_AARCH64_Enablement branch from 5012240 to 992cac7 Compare March 11, 2025 07:59
@nishith-fujitsu nishith-fujitsu force-pushed the vllm_int8_AARCH64_Enablement branch from 992cac7 to f690372 Compare March 19, 2025 06:00
@nishith-fujitsu
Copy link
Contributor Author

Hi @mgoin, can you please review my PR.
Thank you

@nishith-fujitsu nishith-fujitsu changed the title [Feature] Vllm int8 quantization enablement for ARM CPUs [Hardware][CPU] Vllm int8 quantization enablement for ARM CPUs Mar 20, 2025
@abhijain1204fujitsu
Copy link

Hi @mgoin , @tlrmchlsmth could you please support to review this PR.

@akote123
Copy link

CC: @mgoin

@nishith-fujitsu nishith-fujitsu force-pushed the vllm_int8_AARCH64_Enablement branch from 941e161 to 00306fb Compare June 11, 2025 06:59
@abhijain1204fujitsu
Copy link

@mgoin Kindly support to review the PR.

@nishith-fujitsu nishith-fujitsu force-pushed the vllm_int8_AARCH64_Enablement branch from 00306fb to 537dd46 Compare July 8, 2025 05:54
@nishith-fujitsu nishith-fujitsu force-pushed the vllm_int8_AARCH64_Enablement branch from ddac255 to e2802af Compare July 8, 2025 06:29
@akote123
Copy link

akote123 commented Jul 9, 2025

@mgoin ,
Could you please support to review the PR

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping and apologies for the delay. I'll take your word for it that you've tested the models working. In the future, it would be great if we could setup CI or publish results in the PR for what has been tested.

@mgoin mgoin enabled auto-merge (squash) July 9, 2025 22:13
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 9, 2025
@mgoin mgoin added quantization and removed documentation Improvements or additions to documentation frontend speculative-decoding ready ONLY add when PR is ready to merge/full CI is needed ci/build v1 multi-modality Related to multi-modality (#4194) labels Jul 9, 2025
@mergify mergify bot added the ci/build label Jul 9, 2025
@mgoin mgoin added the cpu Related to CPU backends label Jul 9, 2025
@mgoin mgoin merged commit c7753a9 into vllm-project:main Jul 10, 2025
107 checks passed
Chen-zexi pushed a commit to Chen-zexi/vllm that referenced this pull request Jul 13, 2025
patrickvonplaten pushed a commit to patrickvonplaten/vllm that referenced this pull request Jul 15, 2025
LyrisZhong pushed a commit to LyrisZhong/vllm that referenced this pull request Jul 23, 2025
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build cpu Related to CPU backends quantization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants