-
Notifications
You must be signed in to change notification settings - Fork 193
[ Docs ] Update FP8
example to use dynamic per token
#75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
git clone https://github.com/vllm-project/llm-compressor.git | ||
cd llm-compressor | ||
pip install -e . | ||
pip install llmcompressor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should pin the version so it is clear when this was made/updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
|
||
```python | ||
from llmcompressor.transformers import oneshot | ||
from llmcompressor.modifiers.quantization import QuantizationModifier | ||
|
||
# Configure the quantization algorithm to run. | ||
recipe = QuantizationModifier(targets="Linear", scheme="FP8", ignore=["lm_head"]) | ||
recipe = QuantizationModifier(targets="Linear", | ||
scheme="FP8_Dynamic", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it fine to not use all caps? I thought the scheme was FP8_DYNAMIC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to be working with FP8_Dynamic
, but I can adjust it
|
||
# Save to disk compressed. | ||
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-FP8" | ||
model.save_pretrained(SAVE_DIR, save_compressed=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
save_compressed=True
is the default now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
Neural Magic's fork of `lm-evaluation-harness` implements the evaluation strategy used by Meta in the Llama3.1 launch. You can install this branch from source below: | ||
|
||
```bash | ||
pip install vllm | ||
pip install git+https://github.com/neuralmagic/lm-evaluation-harness.git@a0e54e5f1a0a52abaedced474854ae2ce4e68ded | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be best to use a task that doesn't require a fork of lm-eval to reproduce results. AFAIK it is only ARC-C and GSM8k that require these custom changes. Winogrande is pretty fast, so maybe use that with lm-eval==0.4.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like GSM because its easy to understand + its a good proof point for users that its working in a generative task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to have the reproduction of paper results then? I think we shouldn't push people towards our fork if possible. I think it also makes the most realistic example to show evals for both the unquantized and quantized checkpoint, so it shouldn't matter to get this specific COT setup.
FP8
Docs To Highlight Dynamic Per Token
FP8
Docs To Highlight Dynamic Per TokenFP8
example to use dynamic per token
FP8
example to use dynamic per tokenFP8
example to use dynamic per token
* reduce appropriate dim * tests
SUMMARY:
FP8_DYNAMIC
TEST PLAN: