Skip to content

Commit 59f8422

Browse files
brian-dellabettashanjiaz
authored andcommitted
fix lm eval test reproducibility issues (#1260)
SUMMARY: lm-eval multimodal tests were failing to reproduce across different versions of compressed tensors. After upgrading the models from 2B to 7B, the tests appear to be reproducing across compressed tensors 0.9.1, 0.9.2 and nightly. I ran extensively for the fp8 config across different versions of CT, and it always returned the same result. I also removed the random seed from the configs. after running several of each of the 3 configs, i did not see any change in result. this may cause errors during ci/cd testing but I'd like to see if it does, i feel that is a better e2e test anyway. Tests take roughly 1hr30m - 1h45m to run. TEST PLAN: no new src code, just fixing tests --------- Signed-off-by: Brian Dellabetta <[email protected]> Signed-off-by: shanjiaz <[email protected]>
1 parent ffa570c commit 59f8422

File tree

5 files changed

+54
-35
lines changed

5 files changed

+54
-35
lines changed
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
cadence: weekly
2+
model: Qwen/Qwen2.5-VL-7B-Instruct
3+
model_class: TraceableQwen2_5_VLForConditionalGeneration
4+
scheme: FP8_DYNAMIC
5+
lmeval:
6+
model: "hf-multimodal"
7+
model_args:
8+
dtype: bfloat16
9+
add_bos_token: True
10+
convert_img_format: True
11+
task: mmmu_val_literature
12+
num_fewshot: 0
13+
batch_size: 8
14+
# dense model achieves accuracy of 0.9 +/ 0.0557
15+
metrics:
16+
acc,none: 0.8667
17+
acc_stderr,none: 0.0557
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,20 @@
11
cadence: "weekly"
2-
model: llava-hf/llava-1.5-7b-hf
3-
model_class: TraceableLlavaForConditionalGeneration
2+
model: Qwen/Qwen2.5-VL-7B-Instruct
3+
model_class: TraceableQwen2_5_VLForConditionalGeneration
44
scheme: INT8_dyn_per_token
55
recipe: tests/e2e/vLLM/recipes/INT8/recipe_int8_channel_weight_dynamic_per_token.yaml
66
dataset_id: lmms-lab/flickr30k
77
dataset_split: "test[:512]"
8-
seed: 42 #compressed model is sensitive to random seed
98
lmeval:
109
model: "hf-multimodal"
1110
model_args:
1211
dtype: bfloat16
1312
add_bos_token: True
1413
convert_img_format: True
15-
task: mmmu_val_economics
14+
task: mmmu_val_literature
1615
num_fewshot: 0
16+
batch_size: 8
17+
# dense model achieves accuracy of 0.9 +/ 0.0557
1718
metrics:
18-
acc,none: 0.233
19-
batch_size: 8
19+
acc,none: 0.833
20+
acc_stderr,none: 0.0557
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,20 @@
11
cadence: "weekly"
2-
model: Qwen/Qwen2-VL-2B-Instruct
3-
model_class: TraceableQwen2VLForConditionalGeneration
2+
model: Qwen/Qwen2.5-VL-7B-Instruct
3+
model_class: TraceableQwen2_5_VLForConditionalGeneration
4+
scheme: W4A16_actorder_weight
45
recipe: tests/e2e/vLLM/recipes/actorder/recipe_w4a16_actorder_weight.yaml
56
dataset_id: lmms-lab/flickr30k
67
dataset_split: "test[:512]"
7-
scheme: W4A16_actorder_group
8-
seed: 42 #compressed model is sensitive to random seed
98
lmeval:
109
model: "hf-multimodal"
1110
model_args:
1211
dtype: bfloat16
1312
add_bos_token: True
1413
convert_img_format: True
15-
task: mmmu_val_economics
14+
task: mmmu_val_literature
1615
num_fewshot: 0
16+
batch_size: 8
17+
# dense model achieves accuracy of 0.9 +/ 0.0557
1718
metrics:
18-
acc,none: 0.366
19-
batch_size: 4
19+
acc,none: 0.8333
20+
acc_stderr,none: 0.0557

tests/lmeval/skipped_configs/vl_fp8_dynamic_per_token.yaml

Lines changed: 0 additions & 16 deletions
This file was deleted.

tests/lmeval/test_lmeval.py

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -155,12 +155,28 @@ def _run_lm_eval(self):
155155
)
156156

157157
metrics = results["results"][self.lmeval.task]
158-
for metric, expected_val in self.lmeval.metrics.items():
159-
actual_val = metrics.get(metric)
160-
logger.info(
161-
f"Comparing {metric}: Expected {expected_val}, Got {actual_val}"
162-
)
163-
assert numpy.isclose(expected_val, actual_val, rtol=0.05)
158+
for metric_key, expected_val in self.lmeval.metrics.items():
159+
# stderr metrics are only used as absolute tolerance
160+
# checks for actual values
161+
if "stderr" in metric_key:
162+
continue
163+
actual_val = metrics.get(metric_key)
164+
# If stderr is provided, use it as absolute tolerance
165+
# Otherwise, default to a 5% relative tolerance
166+
stderr_key = metric_key.replace(",", "_stderr,")
167+
std_err = self.lmeval.metrics.get(stderr_key)
168+
if std_err is None:
169+
logger.info(
170+
f"Comparing {metric_key}: Expected {expected_val} "
171+
f"±5%, Got {actual_val}"
172+
)
173+
assert numpy.isclose(expected_val, actual_val, rtol=0.05)
174+
else:
175+
logger.info(
176+
f"Comparing {metric_key}: Expected {expected_val} "
177+
f"±{std_err*100}%, Got {actual_val}"
178+
)
179+
assert numpy.isclose(expected_val, actual_val, atol=std_err)
164180

165181
def tear_down(self):
166182
timer = get_singleton_manager()

0 commit comments

Comments
 (0)