fix lm eval test reproducibility issues #1260

brian-dellabetta · 2025-03-17T18:29:36Z

SUMMARY:
lm-eval multimodal tests were failing to reproduce across different versions of compressed tensors. After upgrading the models from 2B to 7B, the tests appear to be reproducing across compressed tensors 0.9.1, 0.9.2 and nightly. I ran extensively for the fp8 config across different versions of CT, and it always returned the same result.

I also removed the random seed from the configs. after running several of each of the 3 configs, i did not see any change in result. this may cause errors during ci/cd testing but I'd like to see if it does, i feel that is a better e2e test anyway.

Tests take roughly 1hr30m - 1h45m to run.

TEST PLAN:
no new src code, just fixing tests

github-actions · 2025-03-17T18:29:47Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

dsikka

We should look into the error you're seeing for GPTQ.
We were seeing 0.233 for llava, the entire time we've had this test running?

kylesayrs · 2025-03-17T19:06:38Z

@brian-dellabetta The down_proj is the most likely to fail hessian inversion, since it is the weight with the largest input size.

This is likely a normal hessian inevitability issue, which can be fixed by shuffling the dataset/ using an image dataset

brian-dellabetta · 2025-03-19T17:16:56Z

Hitting several issues around reproducibility and slowness of lm-eval (takes about a minute for each of the 30 samples to run). Will continue this next week after resolving more urgent tasks

SUMMARY: multi-modal lm-eval tests are failing due to a non-reproducibility issue that still needs to be resolved. In the meantime, moving those tests to a skipped folder until resolution. Resolution can be tracked in #1260 TEST PLAN: no new source code Signed-off-by: Brian Dellabetta <[email protected]>

tests/lmeval/configs/vl_w4a16_actorder_weight.yaml

tests/e2e/vLLM/recipes/actorder/recipe_w4a16_actorder_weight_dampfrac1e-1.yaml

tests/lmeval/test_lmeval.py

tests/lmeval/configs/vl_fp8_dynamic_per_token.yaml

rahul-tuli

LGTM pending one typo!

tests/lmeval/configs/vl_w4a16_actorder_weight.yaml

dsikka

LGTM. Can you verify with Domenic that there will not be any conflicts with the existing tests that we're now tracking here? https://fantastic-adventure-plymqoj.pages.github.io/timings/lmeval/

I think it should be fine since we use the config name but just in case

brian-dellabetta · 2025-05-06T15:27:21Z

LGTM. Can you verify with Domenic that there will not be any conflicts with the existing tests that we're now tracking here? https://fantastic-adventure-plymqoj.pages.github.io/timings/lmeval/

I think it should be fine since we use the config name but just in case

Confirmed with domenic here, I will ask rahul to approve and then merge this in

rahul-tuli

LGTM!

Signed-off-by: Brian Dellabetta <[email protected]>

SUMMARY: lm-eval multimodal tests were failing to reproduce across different versions of compressed tensors. After upgrading the models from 2B to 7B, the tests appear to be reproducing across compressed tensors 0.9.1, 0.9.2 and nightly. I ran extensively for the fp8 config across different versions of CT, and it always returned the same result. I also removed the random seed from the configs. after running several of each of the 3 configs, i did not see any change in result. this may cause errors during ci/cd testing but I'd like to see if it does, i feel that is a better e2e test anyway. Tests take roughly 1hr30m - 1h45m to run. TEST PLAN: no new src code, just fixing tests --------- Signed-off-by: Brian Dellabetta <[email protected]> Signed-off-by: shanjiaz <[email protected]>

brian-dellabetta requested review from kylesayrs, dsikka and dhuangnm March 17, 2025 18:29

brian-dellabetta added the ready When a PR is ready for review label Mar 17, 2025

dsikka reviewed Mar 17, 2025

View reviewed changes

brian-dellabetta removed the ready When a PR is ready for review label Mar 19, 2025

brian-dellabetta mentioned this pull request Mar 20, 2025

move failing mulitmodal lmeval tests to skipped folder #1273

Merged

brian-dellabetta force-pushed the bdellabe/lmeval-multimodal-test-fixes branch from 07e5877 to 596f562 Compare May 1, 2025 19:56

dbarbuzzi reviewed May 2, 2025

View reviewed changes

tests/lmeval/configs/vl_w4a16_actorder_weight.yaml Outdated Show resolved Hide resolved

brian-dellabetta force-pushed the bdellabe/lmeval-multimodal-test-fixes branch from 019c6bb to 6884ec6 Compare May 2, 2025 20:32

brian-dellabetta added the ready When a PR is ready for review label May 2, 2025

brian-dellabetta changed the title ~~fix lm eval test reproducbility issues~~ fix lm eval test reproducibility issues May 2, 2025

brian-dellabetta requested review from dsikka, rahul-tuli and dbarbuzzi May 5, 2025 14:17

kylesayrs requested changes May 5, 2025

View reviewed changes

tests/e2e/vLLM/recipes/actorder/recipe_w4a16_actorder_weight_dampfrac1e-1.yaml Outdated Show resolved Hide resolved

tests/lmeval/test_lmeval.py Outdated Show resolved Hide resolved

tests/lmeval/test_lmeval.py Outdated Show resolved Hide resolved

kylesayrs reviewed May 5, 2025

View reviewed changes

tests/lmeval/configs/vl_fp8_dynamic_per_token.yaml Outdated Show resolved Hide resolved

tests/lmeval/configs/vl_fp8_dynamic_per_token.yaml Show resolved Hide resolved

kylesayrs previously approved these changes May 5, 2025

View reviewed changes

rahul-tuli reviewed May 6, 2025

View reviewed changes

tests/lmeval/configs/vl_w4a16_actorder_weight.yaml Outdated Show resolved Hide resolved

brian-dellabetta commented May 6, 2025

View reviewed changes

tests/lmeval/configs/vl_w4a16_actorder_weight.yaml Outdated Show resolved Hide resolved

brian-dellabetta dismissed kylesayrs’s stale review via 56c12fa May 6, 2025 14:22

dsikka approved these changes May 6, 2025

View reviewed changes

rahul-tuli approved these changes May 6, 2025

View reviewed changes

brian-dellabetta enabled auto-merge (squash) May 6, 2025 15:30

fix lm eval test reproducbility issues

3e987fb

Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta and others added 17 commits May 6, 2025 10:30

updated tests with additional notes of results

40089ba

Signed-off-by: Brian Dellabetta <[email protected]>

typo

f67ed51

Signed-off-by: Brian Dellabetta <[email protected]>

scoring update

1ecef6d

Signed-off-by: Brian Dellabetta <[email protected]>

mmmu_val_literature

9c87598

Signed-off-by: Brian Dellabetta <[email protected]>

w4a16 pathway

f0b5a74

Signed-off-by: Brian Dellabetta <[email protected]>

touchups

17e0938

Signed-off-by: Brian Dellabetta <[email protected]>

fix

9a11cbe

Signed-off-by: Brian Dellabetta <[email protected]>

cleanup

01d2ac5

Signed-off-by: Brian Dellabetta <[email protected]>

codereview updates

57fb3f6

Signed-off-by: Brian Dellabetta <[email protected]>

GPTQ w8a8

1ee103f

Signed-off-by: Brian Dellabetta <[email protected]>

codereview updates

2f8c5e7

Signed-off-by: Brian Dellabetta <[email protected]>

re-add scheme to config file

6b7e4df

Signed-off-by: Brian Dellabetta <[email protected]>

codereview fixes, refactor stderr config

a9f19b9

Signed-off-by: Brian Dellabetta <[email protected]>

coderview edits

61a7c2f

Signed-off-by: Brian Dellabetta <[email protected]>

no longer need damp frac recipe

9b98cc0

Signed-off-by: Brian Dellabetta <[email protected]>

drop use_stderr_atol from configs

c681c28

Signed-off-by: Brian Dellabetta <[email protected]>

change scheme typo

823fdd5

brian-dellabetta force-pushed the bdellabe/lmeval-multimodal-test-fixes branch from 56c12fa to 823fdd5 Compare May 6, 2025 15:30

brian-dellabetta merged commit 80155e8 into main May 6, 2025
7 of 8 checks passed

brian-dellabetta deleted the bdellabe/lmeval-multimodal-test-fixes branch May 6, 2025 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix lm eval test reproducibility issues #1260

fix lm eval test reproducibility issues #1260

Uh oh!

brian-dellabetta commented Mar 17, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Mar 17, 2025

Uh oh!

dsikka left a comment

Uh oh!

kylesayrs commented Mar 17, 2025

Uh oh!

brian-dellabetta commented Mar 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahul-tuli left a comment

Uh oh!

Uh oh!

Uh oh!

dsikka left a comment

Uh oh!

brian-dellabetta commented May 6, 2025 •

edited

Loading

Uh oh!

rahul-tuli left a comment

Uh oh!

Uh oh!

Uh oh!

fix lm eval test reproducibility issues #1260

fix lm eval test reproducibility issues #1260

Uh oh!

Conversation

brian-dellabetta commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 17, 2025

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented Mar 17, 2025

Uh oh!

brian-dellabetta commented Mar 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta commented Mar 17, 2025 •

edited

Loading

brian-dellabetta commented May 6, 2025 •

edited

Loading