Skip to content

Conversation

@a-ys
Copy link
Contributor

@a-ys a-ys commented Jun 7, 2024

Description

This PR introduces:

  1. A "quantize" handler in the djl_serving.huggingface handler that uses AutoAWQ to quantize a model.
  2. A code path in the serving partitioning scripts that allows DIY users to run quantization.
    • This is enabled when a user passes the --quantization awq option when running the partition script or option.quantize=awq in serving.properties
  3. A Neo handler script that allows Neo to run quantization through Neo's expected interface.

Note on serving tensor_parallel_degree

For Llama-2-7b, the tp_degree is limited to 1,2 based on vLLM AWQ implementation. The reason is the tp_degree must satisfy

(intermediate_size / tp_degree) % group_size == 0

Where:

  • llama-2-7b intermediate_size = 11008
  • group_size = 128 which is defined in quant_config in djl_serving.huggingface.quantize()

Validation

Llama-2-7b (Working)

This feature has been tested with Llama-2-7b:
Quantization command:

docker run -it --rm \
        -v ./llama-2-7b:/opt/ml/input/data/training \
        -v ./logs:/opt/djl/logs \
        -v ./output:/opt/djl/output \
        --runtime=nvidia \
        --shm-size=12gb \
        deepjavalibrary/djl-serving:lmi-nightly partition --save-mp-checkpoint-path /opt/djl/output --skip-copy

Serving.properties:

engine=MPI
option.tensor_parallel_degree=8
option.quantize=awq

The outputted model was loaded and served with the LMI container.

Llama-2-70b (Not passing)

Quantization currently failing with

AssertionError
    model_service.invoke_handler("quantize", inputs)
  File "/tmp/djlserving/cache/djl_python/service_loader.py", line 29, in invoke_handler
    return getattr(self.module, function_name)(inputs)
  File "/tmp/djlserving/cache/djl_python/huggingface.py", line 607, in quantize
    _service.quantize(inputs.get_properties())
  File "/tmp/djlserving/cache/djl_python/huggingface.py", line 555, in quantize
    awq_model.quantize(self.tokenizer, quant_config=quant_config)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/awq/models/base.py", line 186, in quantize
    self.quantizer.quantize()
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 156, in quantize
    scales_list = [
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 157, in <listcomp>
    self._search_best_scale(self.modules[i], **layer)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 277, in _search_best_scale
    best_scales = self._compute_best_scale(
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 334, in _compute_best_scale
    self.pseudo_quantize_tensor(fc.weight.data)[0] / scales_view
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 69, in pseudo_quantize_tensor
    assert torch.isnan(w).sum() == 0

Serving.properties:

engine=MPI
option.tensor_parallel_degree=8
option.quantize=awq

@a-ys a-ys requested review from a team, frankfliu and zachgk as code owners June 7, 2024 19:19
Copy link
Contributor

@lanking520 lanking520 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure you have some CI testing in place to make sure those functions are working

a-ys added 5 commits June 11, 2024 23:16
Fixes an issue in partition where model weights will not be loaded if
.safetensors are not present, regardless of whether or not .bin weights
are present.
Added envvar support to partition PropertiesManager
@a-ys
Copy link
Contributor Author

a-ys commented Jun 12, 2024

Update: these last few commits include:

  • Refactoring quantization out of HuggingFace.py and into partition.py.
  • envvar configuration support for partitioning, using existing functionality from trt_llm_partition.py
  • bugfix for loading .bin files

Additionally, 70b is able to be quantized now. Previously the error was due to corrupted model weights from incomplete download.

@a-ys a-ys force-pushed the awq_integration branch from b86341c to 1984fb4 Compare June 13, 2024 00:44
@sindhuvahinis sindhuvahinis merged commit 9f484df into deepjavalibrary:master Jun 13, 2024
sindhuvahinis pushed a commit to sindhuvahinis/djl-serving that referenced this pull request Jun 13, 2024
sindhuvahinis added a commit that referenced this pull request Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants