Skip to content

Commit 3058388

Browse files
authored
Add doc for client usage (#1914)
Signed-off-by: yiliu30 <[email protected]>
1 parent 29471df commit 3058388

File tree

3 files changed

+57
-0
lines changed

3 files changed

+57
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ In particular, the tool provides the key features, typical examples, and open co
2626
* Collaborate with cloud marketplaces such as [Google Cloud Platform](https://console.cloud.google.com/marketplace/product/bitnami-launchpad/inc-tensorflow-intel?project=verdant-sensor-286207), [Amazon Web Services](https://aws.amazon.com/marketplace/pp/prodview-yjyh2xmggbmga#pdp-support), and [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/bitnami.inc-tensorflow-intel), software platforms such as [Alibaba Cloud](https://www.intel.com/content/www/us/en/developer/articles/technical/quantize-ai-by-oneapi-analytics-on-alibaba-cloud.html), [Tencent TACO](https://new.qq.com/rain/a/20221202A00B9S00) and [Microsoft Olive](https://github.com/microsoft/Olive), and open AI ecosystem such as [Hugging Face](https://huggingface.co/blog/intel), [PyTorch](https://pytorch.org/tutorials/recipes/intel_neural_compressor_for_pytorch.html), [ONNX](https://github.com/onnx/models#models), [ONNX Runtime](https://github.com/microsoft/onnxruntime), and [Lightning AI](https://github.com/Lightning-AI/lightning/blob/master/docs/source-pytorch/advanced/post_training_quantization.rst)
2727

2828
## What's New
29+
* [2024/07] Performance optimizations and usability improvements on [client-side](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md).
2930
* [2024/03] A new SOTA approach [AutoRound](https://github.com/intel/auto-round) Weight-Only Quantization on [Intel Gaudi2 AI accelerator](https://habana.ai/products/gaudi2/) is available for LLMs.
3031

3132
## Installation

docs/3x/PT_WeightOnlyQuant.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ PyTorch Weight Only Quantization
1515
- [HQQ](#hqq)
1616
- [Specify Quantization Rules](#specify-quantization-rules)
1717
- [Saving and Loading](#saving-and-loading)
18+
- [Efficient Usage on Client-Side](#efficient-usage-on-client-side)
1819
- [Examples](#examples)
1920

2021
## Introduction
@@ -276,6 +277,11 @@ loaded_model = load(
276277
) # Please note that the original_model parameter passes the original model.
277278
```
278279

280+
## Efficient Usage on Client-Side
281+
282+
For client machines with limited RAM and cores, we offer optimizations to reduce computational overhead and minimize memory usage. For detailed information, please refer to [Quantization on Client](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md).
283+
284+
279285
## Examples
280286

281287
Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only) on how to quantize a model with WeightOnlyQuant.

docs/3x/client_quant.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
Quantization on Client
2+
==========================================
3+
4+
1. [Introduction](#introduction)
5+
2. [Get Started](#get-started) \
6+
2.1 [Get Default Algorithm Configuration](#get-default-algorithm-configuration)\
7+
2.2 [Optimal Performance and Peak Memory Usage](#optimal-performance-and-peak-memory-usage)
8+
9+
10+
## Introduction
11+
12+
For `RTN`, `GPTQ`, and `Auto-Round` algorithms, we provide default algorithm configurations for different processor types (`client` and `sever`). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.
13+
14+
15+
## Get Started
16+
17+
### Get Default Algorithm Configuration
18+
19+
Here, we take the `RTN` algorithm as example to demonstrate the usage on a client machine.
20+
21+
```python
22+
from neural_compressor.torch.quantization import get_default_rtn_config, convert, prepare
23+
from neural_compressor.torch import load_empty_model
24+
25+
model_state_dict_path = "/path/to/model/state/dict"
26+
float_model = load_empty_model(model_state_dict_path)
27+
quant_config = get_default_rtn_config()
28+
prepared_model = prepare(float_model, quant_config)
29+
quantized_model = convert(prepared_model)
30+
```
31+
32+
> [!TIP]
33+
> By default, the appropriate configuration is determined based on hardware information, but users can explicitly specify `processor_type` as either `client` or `server` when calling `get_default_rtn_config`.
34+
35+
36+
For Windows machines, run the following command to utilize all available cores automatically:
37+
38+
```bash
39+
python main.py
40+
```
41+
42+
> [!TIP]
43+
> For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the `OMP_NUM_THREADS` explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using `taskset`.
44+
45+
### Optimal Performance and Peak Memory Usage
46+
47+
Below are approximate performance and memory usage figures conducted on a client machine with 24 cores and 32GB of RAM. These figures provide a rough estimate for quick reference and may vary based on specific hardware and configurations.
48+
49+
- 7B models (e.g., [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)): the quantization process takes about 65 seconds, with a peak memory usage of around 6GB.
50+
- 1.5B models (e.g., [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct)), the quantization process takes about 20 seconds, with a peak memory usage of around 5GB.

0 commit comments

Comments
 (0)