Skip to content

support bitsandbytes 8-bit and FP4 quantized models #7445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 29, 2024

Conversation

chenqianfzh
Copy link
Contributor

@chenqianfzh chenqianfzh commented Aug 12, 2024

This PR does the following:

  1. support quantized bitsandbytes 8-bit models, such as meta-llama/Llama-Guard-3-8B-INT8
  2. support quantized bitsandbytes 4-bit FP4 models, such as PrunaAI/Einstein-v6.1-Llama3-8B-bnb-4bit-smashed
  3. Add comments about enforcing eager-mode in bnb quantization, as I identified it is a bug in the underlying dependency package of bitsandbytes.

FIX #6756

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work and refactoring, appreciate it! I need to do another pass through as it's a bit dense, so if you could document more of the config arguments and _apply_8bit_weight that would be helpful

@chenqianfzh
Copy link
Contributor Author

This is great work and refactoring, appreciate it! I need to do another pass through as it's a bit dense, so if you could document more of the config arguments and _apply_8bit_weight that would be helpful

Thanks. Will do.

with vllm_runner(model_name,
quantization='bitsandbytes',
load_format='bitsandbytes',
enforce_eager=True) as llm:
enforce_eager=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bitsandbytes-foundation/bitsandbytes#1330 has been merged. Regarding these tests, can we now use cudagraph?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @jeejeelee, thanks to your fix in bnb.

But the latest version of bnb package was released three weeks ago, which does not include your fix yet.

I will update the code after the next bnb release is out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think it is worth doing the package upgrade in another PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this PR introduces a new BNB kernel . I'm not sure if the previous modifications to BNB can support this kernel. What I mean is, perhaps we should first verify this (build BNB from source). If this kernel still not supported, we may need to continue refining the relevant BNB code.

If you're not available, I can verify it next week.

Copy link
Contributor Author

@chenqianfzh chenqianfzh Aug 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeejeelee
I tried the new bnb kernel with your fix to run the above tests under graph mode (and some other tests), it worked perfectly! :-)

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements, this looks good to me

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 23, 2024
@chenqianfzh
Copy link
Contributor Author

@mgoin The test errors seem unrelated to my change. What shall I do?

@mgoin
Copy link
Member

mgoin commented Aug 26, 2024

@chenqianfzh could you please merge with latest main? recent PRs don't seem to be failing, so I wouldn't expect test errors

@mgoin
Copy link
Member

mgoin commented Aug 27, 2024

I'm not sure what is the issue, I was also manually retrying tests in buildkite.. I will run the tests locally

@chenqianfzh
Copy link
Contributor Author

I'm not sure what is the issue, I was also manually retrying tests in buildkite.. I will run the tests locally

My local tests always pass.

However, I guess it is related to GPU memories not release correctly. I am trying my fix with a fake PR now.

@chenqianfzh
Copy link
Contributor Author

@mgoin I found all the checks have passed, could you help merge it? Thx

@mgoin mgoin merged commit 4664cea into vllm-project:main Aug 29, 2024
45 checks passed
@chenqianfzh chenqianfzh deleted the bnb-8bit branch August 30, 2024 00:52
@jvlinsta
Copy link

jvlinsta commented Sep 5, 2024

Does this also mean we can use bitsandbytes with tensor-parallel-size > 1?

@chenqianfzh
Copy link
Contributor Author

Does this also mean we can use bitsandbytes with tensor-parallel-size > 1?

No, not yet.

I am working on TP with bnb now. It will be out in a different PR.

@jvlinsta
Copy link

Hi @chenqianfzh thanks for that! Where does that PR live, so I can keep following up on it? ^^

@jvlinsta
Copy link

Is it here? bytedance-iaas@e8d5453

@chenqianfzh
Copy link
Contributor Author

Is it here? bd-iaas-us@e8d5453

yep. #8434 is for the bnb TP.

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
@molereddy
Copy link

@chenqianfzh it doesn't seem the vLLM BNB documentation has been updated to reflect that 8 bit quantization is now available. The default BNB quantization following the documentation is 4 bit.
The usage of 4 bit vs 8 bit is unclear to me in this PR. Can you clarify how to use 8 bit BNB quantization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Unable to run meta-llama/Llama-Guard-3-8B-INT8
5 participants