Skip to content

Commit 2fc7255

Browse files
xin3hechensuyue
andauthored
implement incbench command for ease-of-use benchmark (#1884)
# Description implement incbench command as entrypoint for ease-of-use benchmark automatically check numa/socket info and dump it with table for ease-of-understand supports both Linux and Windows platform add benchmark documents dump benchmark summary add benchmark UTs # General Use Cases incbench main.py: run 1 instance on NUMA:0. incbench --num_i 2 main.py: run 2 instances on NUMA:0. incbench --num_c 2 main.py: run multi-instances with 2 cores per instance on NUMA:0. incbench -C 24-47 main.py: run 1 instance on COREs:24-47. incbench -C 24-47 --num_c 4 main.py: run multi-instances with 4 COREs per instance on COREs:24-47. --------- Signed-off-by: xin3he <[email protected]> Co-authored-by: chen, suyue <[email protected]>
1 parent de8577e commit 2fc7255

File tree

17 files changed

+914
-238
lines changed

17 files changed

+914
-238
lines changed

docs/3x/benchmark.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
Benchmark
2+
---
3+
4+
1. [Introduction](#introduction)
5+
6+
2. [Supported Matrix](#supported-matrix)
7+
8+
3. [Usage](#usage)
9+
10+
## Introduction
11+
12+
Intel Neural Compressor provides a command `incbench` to launch the Intel CPU performance benchmark.
13+
14+
To get the peak performance on Intel Xeon CPU, we should avoid crossing NUMA node in one instance.
15+
Therefore, by default, `incbench` will trigger 1 instance on the first NUMA node.
16+
17+
## Supported Matrix
18+
19+
| Platform | Status |
20+
|:---:|:---:|
21+
| Linux | &#10004; |
22+
| Windows | &#10004; |
23+
24+
## Usage
25+
26+
| Parameters | Default | comments |
27+
|:----------------------:|:------------------------:|:-------------------------------------:|
28+
| num_instances | 1 | Number of instances |
29+
| num_cores_per_instance | None | Number of cores in each instance |
30+
| C, cores | 0-${num_cores_on_NUMA-1} | decides the visible core range |
31+
| cross_memory | False | whether to allocate memory cross NUMA |
32+
33+
> Note: cross_memory is set to True only when memory is insufficient.
34+
35+
### General Use Cases
36+
37+
1. `incbench main.py`: run 1 instance on NUMA:0.
38+
2. `incbench --num_i 2 main.py`: run 2 instances on NUMA:0.
39+
3. `incbench --num_c 2 main.py`: run multi-instances with 2 cores per instance on NUMA:0.
40+
4. `incbench -C 24-47 main.py`: run 1 instance on COREs:24-47.
41+
5. `incbench -C 24-47 --num_c 4 main.py`: run multi-instances with 4 COREs per instance on COREs:24-47.
42+
43+
> Note:
44+
> - `num_i` works the same as `num_instances`
45+
> - `num_c` works the same as `num_cores_per_instance`
46+
47+
### Dump Throughput and Latency Summary
48+
49+
To merge benchmark results from multi-instances, "incbench" automatically checks log file messages for "throughput" and "latency" information matching the following patterns.
50+
51+
```python
52+
throughput_pattern = r"[T,t]hroughput:\s*([0-9]*\.?[0-9]+)\s*([a-zA-Z/]*)"
53+
latency_pattern = r"[L,l]atency:\s*([0-9]*\.?[0-9]+)\s*([a-zA-Z/]*)"
54+
```
55+
56+
#### Demo usage
57+
58+
```python
59+
print("Throughput: {:.3f} samples/sec".format(throughput))
60+
print("Latency: {:.3f} ms".format(latency * 10**3))
61+
```

examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/smooth_quant/run_benchmark.sh

Lines changed: 22 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -75,22 +75,34 @@ function run_benchmark {
7575

7676
if [ "${topology}" = "opt_125m_ipex_sq" ]; then
7777
model_name_or_path="facebook/opt-125m"
78-
extra_cmd=$extra_cmd" --ipex --sq --alpha 0.5"
78+
extra_cmd=$extra_cmd" --ipex"
7979
elif [ "${topology}" = "llama2_7b_ipex_sq" ]; then
8080
model_name_or_path="meta-llama/Llama-2-7b-hf"
81-
extra_cmd=$extra_cmd" --ipex --sq --alpha 0.8"
81+
extra_cmd=$extra_cmd" --ipex"
8282
elif [ "${topology}" = "gpt_j_ipex_sq" ]; then
8383
model_name_or_path="EleutherAI/gpt-j-6b"
84-
extra_cmd=$extra_cmd" --ipex --sq --alpha 1.0"
84+
extra_cmd=$extra_cmd" --ipex"
8585
fi
8686

87-
python -u run_clm_no_trainer.py \
88-
--model ${model_name_or_path} \
89-
--approach ${approach} \
90-
--output_dir ${tuned_checkpoint} \
91-
--task ${task} \
92-
--batch_size ${batch_size} \
93-
${extra_cmd} ${mode_cmd}
87+
if [[ ${mode} == "accuracy" ]]; then
88+
python -u run_clm_no_trainer.py \
89+
--model ${model_name_or_path} \
90+
--approach ${approach} \
91+
--output_dir ${tuned_checkpoint} \
92+
--task ${task} \
93+
--batch_size ${batch_size} \
94+
${extra_cmd} ${mode_cmd}
95+
elif [[ ${mode} == "performance" ]]; then
96+
incbench --num_cores_per_instance 4 run_clm_no_trainer.py \
97+
--model ${model_name_or_path} \
98+
--approach ${approach} \
99+
--batch_size ${batch_size} \
100+
--output_dir ${tuned_checkpoint} \
101+
${extra_cmd} ${mode_cmd}
102+
else
103+
echo "Error: No such mode: ${mode}"
104+
exit 1
105+
fi
94106
}
95107

96108
main "$@"

examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/smooth_quant/run_clm_no_trainer.py

Lines changed: 46 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
import os
33
import sys
44

5-
sys.path.append('./')
5+
sys.path.append("./")
66
import time
77
import re
88
import torch
@@ -12,15 +12,11 @@
1212
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
1313

1414
parser = argparse.ArgumentParser()
15+
parser.add_argument("--model", nargs="?", default="EleutherAI/gpt-j-6b")
16+
parser.add_argument("--trust_remote_code", default=True, help="Transformers parameter: use the external repo")
1517
parser.add_argument(
16-
"--model", nargs="?", default="EleutherAI/gpt-j-6b"
18+
"--revision", default=None, help="Transformers parameter: set the model hub commit number"
1719
)
18-
parser.add_argument(
19-
"--trust_remote_code", default=True,
20-
help="Transformers parameter: use the external repo")
21-
parser.add_argument(
22-
"--revision", default=None,
23-
help="Transformers parameter: set the model hub commit number")
2420
parser.add_argument("--dataset", nargs="?", default="NeelNanda/pile-10k", const="NeelNanda/pile-10k")
2521
parser.add_argument("--output_dir", nargs="?", default="./saved_results")
2622
parser.add_argument("--quantize", action="store_true")
@@ -29,29 +25,26 @@
2925
action="store_true",
3026
help="By default it is int8-fp32 mixed, to enable int8 mixed amp bf16 (work on platforms like SPR)",
3127
)
28+
parser.add_argument("--seed", type=int, default=42, help="Seed for sampling the calibration data.")
3229
parser.add_argument(
33-
'--seed',
34-
type=int, default=42, help='Seed for sampling the calibration data.'
30+
"--approach", type=str, default="static", help="Select from ['dynamic', 'static', 'weight-only']"
3531
)
36-
parser.add_argument("--approach", type=str, default='static',
37-
help="Select from ['dynamic', 'static', 'weight-only']")
3832
parser.add_argument("--int8", action="store_true")
3933
parser.add_argument("--ipex", action="store_true", help="Use intel extension for pytorch.")
4034
parser.add_argument("--load", action="store_true", help="Load quantized model.")
4135
parser.add_argument("--accuracy", action="store_true")
4236
parser.add_argument("--performance", action="store_true")
43-
parser.add_argument("--iters", default=100, type=int,
44-
help="For accuracy measurement only.")
45-
parser.add_argument("--batch_size", default=1, type=int,
46-
help="For accuracy measurement only.")
47-
parser.add_argument("--save_accuracy_path", default=None,
48-
help="Save accuracy results path.")
49-
parser.add_argument("--pad_max_length", default=512, type=int,
50-
help="Pad input ids to max length.")
51-
parser.add_argument("--calib_iters", default=512, type=int,
52-
help="calibration iters.")
53-
parser.add_argument("--tasks", default="lambada_openai,hellaswag,winogrande,piqa,wikitext",
54-
type=str, help="tasks for accuracy validation")
37+
parser.add_argument("--iters", default=100, type=int, help="For accuracy measurement only.")
38+
parser.add_argument("--batch_size", default=1, type=int, help="For accuracy measurement only.")
39+
parser.add_argument("--save_accuracy_path", default=None, help="Save accuracy results path.")
40+
parser.add_argument("--pad_max_length", default=512, type=int, help="Pad input ids to max length.")
41+
parser.add_argument("--calib_iters", default=512, type=int, help="calibration iters.")
42+
parser.add_argument(
43+
"--tasks",
44+
default="lambada_openai,hellaswag,winogrande,piqa,wikitext",
45+
type=str,
46+
help="tasks for accuracy validation",
47+
)
5548
parser.add_argument("--peft_model_id", type=str, default=None, help="model_name_or_path of peft model")
5649
# ============SmoothQuant configs==============
5750
parser.add_argument("--sq", action="store_true")
@@ -91,7 +84,7 @@ def collate_batch(self, batch):
9184
pad_len = self.pad_max - input_ids.shape[0]
9285
last_ind.append(input_ids.shape[0] - 1)
9386
if self.is_calib:
94-
input_ids = input_ids[:self.pad_max] if len(input_ids) > self.pad_max else input_ids
87+
input_ids = input_ids[: self.pad_max] if len(input_ids) > self.pad_max else input_ids
9588
else:
9689
input_ids = pad(input_ids, (0, pad_len), value=self.pad_val)
9790
input_ids_padded.append(input_ids)
@@ -144,6 +137,7 @@ def get_user_model():
144137

145138
if args.peft_model_id is not None:
146139
from peft import PeftModel
140+
147141
user_model = PeftModel.from_pretrained(user_model, args.peft_model_id)
148142

149143
# to channels last
@@ -158,7 +152,9 @@ def get_user_model():
158152
calib_dataset = load_dataset(args.dataset, split="train")
159153
# calib_dataset = datasets.load_from_disk('/your/local/dataset/pile-10k/') # use this if trouble with connecting to HF
160154
calib_dataset = calib_dataset.shuffle(seed=args.seed)
161-
calib_evaluator = Evaluator(calib_dataset, tokenizer, args.batch_size, pad_max=args.pad_max_length, is_calib=True)
155+
calib_evaluator = Evaluator(
156+
calib_dataset, tokenizer, args.batch_size, pad_max=args.pad_max_length, is_calib=True
157+
)
162158
calib_dataloader = DataLoader(
163159
calib_evaluator.dataset,
164160
batch_size=calib_size,
@@ -167,6 +163,7 @@ def get_user_model():
167163
)
168164

169165
from neural_compressor.torch.quantization import SmoothQuantConfig
166+
170167
args.alpha = eval(args.alpha)
171168
excluded_precisions = [] if args.int8_bf16_mixed else ["bf16"]
172169
quant_config = SmoothQuantConfig(alpha=args.alpha, folding=False, excluded_precisions=excluded_precisions)
@@ -176,6 +173,7 @@ def get_user_model():
176173

177174
from neural_compressor.torch.algorithms.smooth_quant import move_input_to_device
178175
from tqdm import tqdm
176+
179177
def run_fn(model):
180178
calib_iter = 0
181179
for batch in tqdm(calib_dataloader, total=args.calib_iters):
@@ -186,16 +184,18 @@ def run_fn(model):
186184
model(**batch)
187185
else:
188186
model(batch)
189-
187+
190188
calib_iter += 1
191189
if calib_iter >= args.calib_iters:
192190
break
193191
return
194192

195193
from utils import get_example_inputs
194+
196195
example_inputs = get_example_inputs(user_model, calib_dataloader)
197196

198197
from neural_compressor.torch.quantization import prepare, convert
198+
199199
user_model = prepare(model=user_model, quant_config=quant_config, example_inputs=example_inputs)
200200
run_fn(user_model)
201201
user_model = convert(user_model)
@@ -207,6 +207,7 @@ def run_fn(model):
207207
if args.int8 or args.int8_bf16_mixed:
208208
print("load int8 model")
209209
from neural_compressor.torch.quantization import load
210+
210211
tokenizer = AutoTokenizer.from_pretrained(args.model)
211212
config = AutoConfig.from_pretrained(args.model)
212213
user_model = load(os.path.abspath(os.path.expanduser(args.output_dir)))
@@ -218,6 +219,7 @@ def run_fn(model):
218219
if args.accuracy:
219220
user_model.eval()
220221
from intel_extension_for_transformers.transformers.llm.evaluation.lm_eval import evaluate, LMEvalParser
222+
221223
eval_args = LMEvalParser(
222224
model="hf",
223225
user_model=user_model,
@@ -233,32 +235,25 @@ def run_fn(model):
233235
else:
234236
acc = results["results"][task_name]["acc,none"]
235237
print("Accuracy: %.5f" % acc)
236-
print('Batch size = %d' % args.batch_size)
238+
print("Batch size = %d" % args.batch_size)
237239

238240
if args.performance:
239241
user_model.eval()
240-
from intel_extension_for_transformers.transformers.llm.evaluation.lm_eval import evaluate, LMEvalParser
242+
batch_size, input_leng = args.batch_size, 512
243+
example_inputs = torch.ones((batch_size, input_leng), dtype=torch.long)
244+
print("Batch size = {:d}".format(batch_size))
245+
print("The length of input tokens = {:d}".format(input_leng))
241246
import time
242247

243-
samples = args.iters * args.batch_size
244-
eval_args = LMEvalParser(
245-
model="hf",
246-
user_model=user_model,
247-
tokenizer=tokenizer,
248-
batch_size=args.batch_size,
249-
tasks=args.tasks,
250-
limit=samples,
251-
device="cpu",
252-
)
253-
start = time.time()
254-
results = evaluate(eval_args)
255-
end = time.time()
256-
for task_name in args.tasks.split(","):
257-
if task_name == "wikitext":
258-
acc = results["results"][task_name]["word_perplexity,none"]
259-
else:
260-
acc = results["results"][task_name]["acc,none"]
261-
print("Accuracy: %.5f" % acc)
262-
print('Throughput: %.3f samples/sec' % (samples / (end - start)))
263-
print('Latency: %.3f ms' % ((end - start) * 1000 / samples))
264-
print('Batch size = %d' % args.batch_size)
248+
total_iters = args.iters
249+
warmup_iters = 5
250+
with torch.no_grad():
251+
for i in range(total_iters):
252+
if i == warmup_iters:
253+
start = time.time()
254+
user_model(example_inputs)
255+
end = time.time()
256+
latency = (end - start) / ((total_iters - warmup_iters) * args.batch_size)
257+
throughput = ((total_iters - warmup_iters) * args.batch_size) / (end - start)
258+
print("Latency: {:.3f} ms".format(latency * 10**3))
259+
print("Throughput: {:.3f} samples/sec".format(throughput))

examples/3.x_api/pytorch/recommendation/dlrm/static_quant/ipex/dlrm_s_pytorch.py

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -394,7 +394,7 @@ def dash_separated_ints(value):
394394
return value
395395

396396

397-
def trace_model(args, dlrm, test_ld, inplace=True):
397+
def trace_or_load_model(args, dlrm, test_ld, inplace=True):
398398
dlrm.eval()
399399
for j, inputBatch in enumerate(test_ld):
400400
X, lS_o, lS_i, _, _, _ = unpack_batch(inputBatch)
@@ -462,7 +462,7 @@ def inference(
462462
total_time = 0
463463
total_iter = 0
464464
if args.inference_only and trace:
465-
dlrm = trace_model(args, dlrm, test_ld)
465+
dlrm = trace_or_load_model(args, dlrm, test_ld)
466466
if args.share_weight_instance != 0:
467467
run_throughput_benchmark(args, dlrm, test_ld)
468468
with torch.cpu.amp.autocast(enabled=args.bf16):
@@ -833,11 +833,11 @@ def eval_func(model):
833833

834834
# calibration
835835
def calib_fn(model):
836-
calib_number = 0
836+
calib_iter = 0
837837
for X_test, lS_o_test, lS_i_test, T in train_ld:
838-
if calib_number < 100:
838+
if calib_iter < 100:
839839
model(X_test, lS_o_test, lS_i_test)
840-
calib_number += 1
840+
calib_iter += 1
841841
else:
842842
break
843843

@@ -857,8 +857,22 @@ def calib_fn(model):
857857
dlrm.save(args.save_model)
858858
exit(0)
859859
if args.benchmark:
860-
# To do
861-
print('Not implemented yet')
860+
dlrm = trace_or_load_model(args, dlrm, test_ld, inplace=True)
861+
import time
862+
X_test, lS_o_test, lS_i_test, T = next(iter(test_ld))
863+
total_iters = 100
864+
warmup_iters = 5
865+
with torch.no_grad():
866+
for i in range(total_iters):
867+
if i == warmup_iters:
868+
start = time.time()
869+
dlrm(X_test, lS_o_test, lS_i_test)
870+
end = time.time()
871+
latency = (end - start) / ((total_iters - warmup_iters) * args.mini_batch_size)
872+
throughput = ((total_iters - warmup_iters) * args.mini_batch_size) / (end - start)
873+
print('Batch size = {:d}'.format(args.mini_batch_size))
874+
print('Latency: {:.3f} ms'.format(latency * 10**3))
875+
print('Throughput: {:.3f} samples/sec'.format(throughput))
862876
exit(0)
863877

864878
if args.accuracy_only:
@@ -934,7 +948,7 @@ def update_training_performance(time, iters, training_record=training_record):
934948
training_record[0] += time
935949
training_record[1] += 1
936950

937-
def print_training_performance( training_record=training_record):
951+
def print_training_performance(training_record=training_record):
938952
if training_record[0] == 0:
939953
print("num-batches larger than warm up iters, please increase num-batches or decrease warmup iters")
940954
exit()

examples/3.x_api/pytorch/recommendation/dlrm/static_quant/ipex/run_benchmark.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ function run_tuning {
8080
--save-model ${tuned_checkpoint} --test-freq=2048 --print-auc $ARGS \
8181
--load-model=${input_model} --accuracy_only
8282
elif [[ ${mode} == "performance" ]]; then
83-
python -u $MODEL_SCRIPT \
83+
incbench --num_cores_per_instance 4 -u $MODEL_SCRIPT \
8484
--raw-data-file=${dataset_location}/day --processed-data-file=${dataset_location}/terabyte_processed.npz \
8585
--data-set=terabyte --benchmark \
8686
--memory-map --mlperf-bin-loader --round-targets=True --learning-rate=1.0 \

0 commit comments

Comments
 (0)