Skip to content

[Examples] Standardize AWQ example #1412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 14, 2025
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions examples/awq/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Quantizing Models with Activation-Aware Quantization (AWQ) #

Activation Aware Quantization (AWQ) is a state-of-the-art technique to quantize the weights of large language models which involves using a small calibration dataset to calibrate the model. The AWQ algorithm utilizes calibration data to derive scaling factors which reduce the dynamic range of weights while minimizing accuracy loss to the most salient weight values.

The AWQ implementation found in LLM Compressor is derived from the pioneering work of [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) and with assistance from its original maintainer, [@casper-hansen](https://github.com/casper-hansen).

## AWQ Recipe ##

The AWQ recipe has been inferfaced as follows, where the `AWQModifier` adjusts model scales ahead of efficient weight quantization by the `QuantizationModifier`

```python
recipe = [
AWQModifier(bits=4, symmetric=False),
QuantizationModifier(
ignore=["lm_head"],
config_groups={
"group_0": QuantizationScheme(
targets=["Linear"],
weights=QuantizationArgs(
num_bits=4,
type=QuantizationType.INT,
dynamic=False,
symmetric=False,
strategy=QuantizationStrategy.GROUP,
group_size=128,
),
)
},
),
]
```

## Compressing Your Own Model ##
To use your own model, start with an existing example change the `model_id` to match your own model stub.
```python
model_id = "path/to/your/model"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
)
```

## Adding Mappings ##
In order to target weight and activation scaling locations within the model, the `AWQModifier` must be provided an AWQ mapping. For example, the AWQ mapping for the Llama family of models looks like this:

```python
[
AWQMapping(
"re:.*input_layernorm",
["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"],
),
AWQMapping("re:.*v_proj", ["re:.*o_proj"]),
AWQMapping(
"re:.*post_attention_layernorm",
["re:.*gate_proj", "re:.*up_proj"],
),
AWQMapping(
"re:.*up_proj",
["re:.*down_proj"],
),
]
```

To support other model families, you can add supply your own mappings via the `mappings` argument with instantiating the `AWQModifier`, or you can add them to the registry [here](/src/llmcompressor/modifiers/awq/mappings.py) (contributions are welcome!)
97 changes: 56 additions & 41 deletions examples/awq/awq_one_shot.py → examples/awq/llama_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,28 +5,61 @@
QuantizationStrategy,
QuantizationType,
)
from datasets import load_dataset
from lm_eval.utils import make_table
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.modifiers.quantization import QuantizationModifier

# This example demonstrates how to:
# 1) Run the `llm-compressor` implementation of AWQ
# 2) Evaluate the compressed model with the lm_eval framework
# Select model and load it.
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Select calibration dataset.
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
DATASET_ID = "mit-han-lab/pile-val-backup"
DATASET_SPLIT = "validation"

# Select number of samples. 256 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 512
OUTPUT_DIR = MODEL_ID.split("/")[-1] + "-awq-asym"

#
# 1) Run LLM Compressor AWQ implementation
#
# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)


def preprocess(example):
return {
"text": tokenizer.apply_chat_template(
[{"role": "user", "content": example["text"]}],
tokenize=False,
)
}


ds = ds.map(preprocess)


# Tokenize inputs.
def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)


# Configure the quantization algorithm to run.
recipe = [
AWQModifier(bits=4, symmetric=False),
QuantizationModifier(
Expand All @@ -47,54 +80,36 @@
),
]

model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)


def get_calib_dataset(tokenizer):
from datasets import load_dataset

ds = load_dataset(
DATASET_ID,
split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES*100}]",
)

def preprocess(example):
return {
"input_ids": tokenizer.encode(example["text"].strip())[:MAX_SEQUENCE_LENGTH]
}

ds = (
ds.shuffle(seed=42)
.map(preprocess, remove_columns=ds.column_names)
.filter(lambda example: len(example["input_ids"]) >= MAX_SEQUENCE_LENGTH)
.select(range(NUM_CALIBRATION_SAMPLES))
)

return ds


# Apply algorithms.
oneshot(
model=model,
dataset=get_calib_dataset(tokenizer=tokenizer),
dataset=ds,
recipe=recipe,
output_dir=OUTPUT_DIR,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

print("Done! model saved to", OUTPUT_DIR)
# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[-1] + "-awq-asym"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

#
# 2) Evaluate model on wikitext perplexity
#

results = lm_eval.simple_evaluate(
model="vllm",
model="hf",
model_args={
"pretrained": OUTPUT_DIR,
"pretrained": SAVE_DIR,
"add_bos_token": True,
"dtype": "bfloat16",
"gpu_memory_utilization": 0.5,
Expand Down