Skip to content

fix lm eval test reproducibility issues #1260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
May 6, 2025

Conversation

brian-dellabetta
Copy link
Collaborator

@brian-dellabetta brian-dellabetta commented Mar 17, 2025

SUMMARY:
lm-eval multimodal tests were failing to reproduce across different versions of compressed tensors. After upgrading the models from 2B to 7B, the tests appear to be reproducing across compressed tensors 0.9.1, 0.9.2 and nightly. I ran extensively for the fp8 config across different versions of CT, and it always returned the same result.

I also removed the random seed from the configs. after running several of each of the 3 configs, i did not see any change in result. this may cause errors during ci/cd testing but I'd like to see if it does, i feel that is a better e2e test anyway.

Tests take roughly 1hr30m - 1h45m to run.

TEST PLAN:
no new src code, just fixing tests

Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@brian-dellabetta brian-dellabetta added the ready When a PR is ready for review label Mar 17, 2025
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should look into the error you're seeing for GPTQ.
We were seeing 0.233 for llava, the entire time we've had this test running?

@kylesayrs
Copy link
Collaborator

@brian-dellabetta The down_proj is the most likely to fail hessian inversion, since it is the weight with the largest input size.

This is likely a normal hessian inevitability issue, which can be fixed by shuffling the dataset/ using an image dataset

@brian-dellabetta brian-dellabetta removed the ready When a PR is ready for review label Mar 19, 2025
@brian-dellabetta
Copy link
Collaborator Author

Hitting several issues around reproducibility and slowness of lm-eval (takes about a minute for each of the 30 samples to run). Will continue this next week after resolving more urgent tasks

dsikka pushed a commit that referenced this pull request Mar 21, 2025
SUMMARY:
multi-modal lm-eval tests are failing due to a non-reproducibility issue
that still needs to be resolved. In the meantime, moving those tests to
a skipped folder until resolution.

Resolution can be tracked in #1260 


TEST PLAN:
no new source code

Signed-off-by: Brian Dellabetta <[email protected]>
@brian-dellabetta brian-dellabetta force-pushed the bdellabe/lmeval-multimodal-test-fixes branch from 07e5877 to 596f562 Compare May 1, 2025 19:56
@brian-dellabetta brian-dellabetta force-pushed the bdellabe/lmeval-multimodal-test-fixes branch from 019c6bb to 6884ec6 Compare May 2, 2025 20:32
@brian-dellabetta brian-dellabetta added the ready When a PR is ready for review label May 2, 2025
@brian-dellabetta brian-dellabetta changed the title fix lm eval test reproducbility issues fix lm eval test reproducibility issues May 2, 2025
kylesayrs
kylesayrs previously approved these changes May 5, 2025
Copy link
Collaborator

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending one typo!

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you verify with Domenic that there will not be any conflicts with the existing tests that we're now tracking here? https://fantastic-adventure-plymqoj.pages.github.io/timings/lmeval/

I think it should be fine since we use the config name but just in case

@brian-dellabetta
Copy link
Collaborator Author

brian-dellabetta commented May 6, 2025

LGTM. Can you verify with Domenic that there will not be any conflicts with the existing tests that we're now tracking here? https://fantastic-adventure-plymqoj.pages.github.io/timings/lmeval/

I think it should be fine since we use the config name but just in case

Confirmed with domenic here, I will ask rahul to approve and then merge this in

Copy link
Collaborator

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@brian-dellabetta brian-dellabetta enabled auto-merge (squash) May 6, 2025 15:30
brian-dellabetta and others added 17 commits May 6, 2025 10:30
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
@brian-dellabetta brian-dellabetta force-pushed the bdellabe/lmeval-multimodal-test-fixes branch from 56c12fa to 823fdd5 Compare May 6, 2025 15:30
@brian-dellabetta brian-dellabetta merged commit 80155e8 into main May 6, 2025
7 of 8 checks passed
@brian-dellabetta brian-dellabetta deleted the bdellabe/lmeval-multimodal-test-fixes branch May 6, 2025 15:31
shanjiaz pushed a commit that referenced this pull request May 6, 2025
SUMMARY:
lm-eval multimodal tests were failing to reproduce across different
versions of compressed tensors. After upgrading the models from 2B to
7B, the tests appear to be reproducing across compressed tensors 0.9.1,
0.9.2 and nightly. I ran extensively for the fp8 config across different
versions of CT, and it always returned the same result.

I also removed the random seed from the configs. after running several
of each of the 3 configs, i did not see any change in result. this may
cause errors during ci/cd testing but I'd like to see if it does, i feel
that is a better e2e test anyway.

Tests take roughly 1hr30m - 1h45m to run.

TEST PLAN:
no new src code, just fixing tests

---------

Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: shanjiaz <[email protected]>
shanjiaz pushed a commit that referenced this pull request May 6, 2025
SUMMARY:
lm-eval multimodal tests were failing to reproduce across different
versions of compressed tensors. After upgrading the models from 2B to
7B, the tests appear to be reproducing across compressed tensors 0.9.1,
0.9.2 and nightly. I ran extensively for the fp8 config across different
versions of CT, and it always returned the same result.

I also removed the random seed from the configs. after running several
of each of the 3 configs, i did not see any change in result. this may
cause errors during ci/cd testing but I'd like to see if it does, i feel
that is a better e2e test anyway.

Tests take roughly 1hr30m - 1h45m to run.

TEST PLAN:
no new src code, just fixing tests

---------

Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: shanjiaz <[email protected]>
shanjiaz pushed a commit that referenced this pull request May 7, 2025
SUMMARY:
lm-eval multimodal tests were failing to reproduce across different
versions of compressed tensors. After upgrading the models from 2B to
7B, the tests appear to be reproducing across compressed tensors 0.9.1,
0.9.2 and nightly. I ran extensively for the fp8 config across different
versions of CT, and it always returned the same result.

I also removed the random seed from the configs. after running several
of each of the 3 configs, i did not see any change in result. this may
cause errors during ci/cd testing but I'd like to see if it does, i feel
that is a better e2e test anyway.

Tests take roughly 1hr30m - 1h45m to run.

TEST PLAN:
no new src code, just fixing tests

---------

Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: shanjiaz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants