AutoAWQ Integration Script #2038

a-ys · 2024-06-07T19:19:43Z

Description

This PR introduces:

A "quantize" handler in the djl_serving.huggingface handler that uses AutoAWQ to quantize a model.
A code path in the serving partitioning scripts that allows DIY users to run quantization.
- This is enabled when a user passes the --quantization awq option when running the partition script or option.quantize=awq in serving.properties
A Neo handler script that allows Neo to run quantization through Neo's expected interface.

Note on serving tensor_parallel_degree

For Llama-2-7b, the tp_degree is limited to 1,2 based on vLLM AWQ implementation. The reason is the tp_degree must satisfy

(intermediate_size / tp_degree) % group_size == 0

Where:

llama-2-7b intermediate_size = 11008
group_size = 128 which is defined in quant_config in djl_serving.huggingface.quantize()

Validation

Llama-2-7b (Working)

This feature has been tested with Llama-2-7b:
Quantization command:

docker run -it --rm \
        -v ./llama-2-7b:/opt/ml/input/data/training \
        -v ./logs:/opt/djl/logs \
        -v ./output:/opt/djl/output \
        --runtime=nvidia \
        --shm-size=12gb \
        deepjavalibrary/djl-serving:lmi-nightly partition --save-mp-checkpoint-path /opt/djl/output --skip-copy

Serving.properties:

engine=MPI
option.tensor_parallel_degree=8
option.quantize=awq

The outputted model was loaded and served with the LMI container.

Llama-2-70b (Not passing)

Quantization currently failing with

AssertionError
    model_service.invoke_handler("quantize", inputs)
  File "/tmp/djlserving/cache/djl_python/service_loader.py", line 29, in invoke_handler
    return getattr(self.module, function_name)(inputs)
  File "/tmp/djlserving/cache/djl_python/huggingface.py", line 607, in quantize
    _service.quantize(inputs.get_properties())
  File "/tmp/djlserving/cache/djl_python/huggingface.py", line 555, in quantize
    awq_model.quantize(self.tokenizer, quant_config=quant_config)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/awq/models/base.py", line 186, in quantize
    self.quantizer.quantize()
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 156, in quantize
    scales_list = [
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 157, in <listcomp>
    self._search_best_scale(self.modules[i], **layer)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 277, in _search_best_scale
    best_scales = self._compute_best_scale(
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 334, in _compute_best_scale
    self.pseudo_quantize_tensor(fc.weight.data)[0] / scales_view
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 69, in pseudo_quantize_tensor
    assert torch.isnan(w).sum() == 0

Serving.properties:

engine=MPI
option.tensor_parallel_degree=8
option.quantize=awq

lanking520

Please make sure you have some CI testing in place to make sure those functions are working

serving/docker/lmi.Dockerfile

serving/docker/dockerd-entrypoint-with-cuda-compat.sh

serving/docker/partition/sm_neo_utils.py

engines/python/setup/djl_python/huggingface.py

serving/docker/partition/properties_manager.py

Fixes an issue in partition where model weights will not be loaded if .safetensors are not present, regardless of whether or not .bin weights are present.

…trypoint

Added envvar support to partition PropertiesManager

a-ys · 2024-06-12T21:21:32Z

Update: these last few commits include:

Refactoring quantization out of HuggingFace.py and into partition.py.
envvar configuration support for partitioning, using existing functionality from trt_llm_partition.py
bugfix for loading .bin files

Additionally, 70b is able to be quantized now. Previously the error was due to corrupted model weights from incomplete download.

serving/docker/partition/sm_neo_quantize.py

serving/docker/partition/properties_manager.py

serving/docker/partition/sm_neo_quantize.py

- Fix neo quantize command.

Co-authored-by: Andrew Song <[email protected]>

a-ys added 8 commits May 30, 2024 21:14

[Neo] Initial AWQ quantization support

b369cf3

[Neo] Neo Quantization Integration

444ab35

[Neo] Add neo integration to vllm container

aecf7a7

[Neo] Consolidate quantization functionality into partition scripts

3229a21

Remove unused function

f965946

Run formatter

752b893

remove unused imports

1e83f2f

Merge branch 'master' into awq_integration

7e8421e

a-ys requested review from a team, frankfliu and zachgk as code owners June 7, 2024 19:19

format

1e36c8e

lanking520 reviewed Jun 7, 2024

View reviewed changes

sindhuvahinis reviewed Jun 10, 2024

View reviewed changes

serving/docker/partition/properties_manager.py Outdated Show resolved Hide resolved

a-ys added 5 commits June 11, 2024 23:16

[partition] Fix model loading bug for .bin files

6b723e3

Fixes an issue in partition where model weights will not be loaded if .safetensors are not present, regardless of whether or not .bin weights are present.

[Neo] Fix autoawq version in LMI container. Remove unused quantize en…

531bfbe

…trypoint

Refactor quantization to partition.py

8aef797

Added envvar support to partition PropertiesManager

Fix autoawq version

669e02d

Remove hanging quantization code in run_partition

6ec9272

sindhuvahinis reviewed Jun 12, 2024

View reviewed changes

serving/docker/partition/sm_neo_quantize.py Show resolved Hide resolved

serving/docker/partition/properties_manager.py Outdated Show resolved Hide resolved

serving/docker/partition/sm_neo_quantize.py Outdated Show resolved Hide resolved

a-ys added 3 commits June 13, 2024 00:00

Fix passing options to HuggingFaceProperties

cdfca3e

Load envvars after serving.propertes in partition.

89a4dea

- Fix neo quantize command.

Quantization fixes

1984fb4

a-ys force-pushed the awq_integration branch from b86341c to 1984fb4 Compare June 13, 2024 00:44

sindhuvahinis approved these changes Jun 13, 2024

View reviewed changes

sindhuvahinis merged commit 9f484df into deepjavalibrary:master Jun 13, 2024

sindhuvahinis pushed a commit to sindhuvahinis/djl-serving that referenced this pull request Jun 13, 2024

AutoAWQ Integration Script (deepjavalibrary#2038)

90d4a37

sindhuvahinis added a commit that referenced this pull request Jun 13, 2024

[0.28.0-dlc][cherry-pick] AutoAWQ Integration Script (#2038) (#2061)

6851bd3

Co-authored-by: Andrew Song <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AutoAWQ Integration Script #2038

AutoAWQ Integration Script #2038

Uh oh!

a-ys commented Jun 7, 2024 •

edited

Loading

Uh oh!

lanking520 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

a-ys commented Jun 12, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AutoAWQ Integration Script #2038

AutoAWQ Integration Script #2038

Uh oh!

Conversation

a-ys commented Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Note on serving tensor_parallel_degree

Validation

Llama-2-7b (Working)

Llama-2-70b (Not passing)

Uh oh!

lanking520 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

a-ys commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

a-ys commented Jun 7, 2024 •

edited

Loading

a-ys commented Jun 12, 2024 •

edited

Loading