-
Notifications
You must be signed in to change notification settings - Fork 188
fix lm eval test reproducibility issues #1260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should look into the error you're seeing for GPTQ.
We were seeing 0.233 for llava, the entire time we've had this test running?
@brian-dellabetta The down_proj is the most likely to fail hessian inversion, since it is the weight with the largest input size. This is likely a normal hessian inevitability issue, which can be fixed by shuffling the dataset/ using an image dataset |
Hitting several issues around reproducibility and slowness of lm-eval (takes about a minute for each of the 30 samples to run). Will continue this next week after resolving more urgent tasks |
SUMMARY: multi-modal lm-eval tests are failing due to a non-reproducibility issue that still needs to be resolved. In the meantime, moving those tests to a skipped folder until resolution. Resolution can be tracked in #1260 TEST PLAN: no new source code Signed-off-by: Brian Dellabetta <[email protected]>
07e5877
to
596f562
Compare
019c6bb
to
6884ec6
Compare
tests/e2e/vLLM/recipes/actorder/recipe_w4a16_actorder_weight_dampfrac1e-1.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending one typo!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can you verify with Domenic that there will not be any conflicts with the existing tests that we're now tracking here? https://fantastic-adventure-plymqoj.pages.github.io/timings/lmeval/
I think it should be fine since we use the config name but just in case
Confirmed with domenic here, I will ask rahul to approve and then merge this in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
56c12fa
to
823fdd5
Compare
SUMMARY: lm-eval multimodal tests were failing to reproduce across different versions of compressed tensors. After upgrading the models from 2B to 7B, the tests appear to be reproducing across compressed tensors 0.9.1, 0.9.2 and nightly. I ran extensively for the fp8 config across different versions of CT, and it always returned the same result. I also removed the random seed from the configs. after running several of each of the 3 configs, i did not see any change in result. this may cause errors during ci/cd testing but I'd like to see if it does, i feel that is a better e2e test anyway. Tests take roughly 1hr30m - 1h45m to run. TEST PLAN: no new src code, just fixing tests --------- Signed-off-by: Brian Dellabetta <[email protected]> Signed-off-by: shanjiaz <[email protected]>
SUMMARY: lm-eval multimodal tests were failing to reproduce across different versions of compressed tensors. After upgrading the models from 2B to 7B, the tests appear to be reproducing across compressed tensors 0.9.1, 0.9.2 and nightly. I ran extensively for the fp8 config across different versions of CT, and it always returned the same result. I also removed the random seed from the configs. after running several of each of the 3 configs, i did not see any change in result. this may cause errors during ci/cd testing but I'd like to see if it does, i feel that is a better e2e test anyway. Tests take roughly 1hr30m - 1h45m to run. TEST PLAN: no new src code, just fixing tests --------- Signed-off-by: Brian Dellabetta <[email protected]> Signed-off-by: shanjiaz <[email protected]>
SUMMARY: lm-eval multimodal tests were failing to reproduce across different versions of compressed tensors. After upgrading the models from 2B to 7B, the tests appear to be reproducing across compressed tensors 0.9.1, 0.9.2 and nightly. I ran extensively for the fp8 config across different versions of CT, and it always returned the same result. I also removed the random seed from the configs. after running several of each of the 3 configs, i did not see any change in result. this may cause errors during ci/cd testing but I'd like to see if it does, i feel that is a better e2e test anyway. Tests take roughly 1hr30m - 1h45m to run. TEST PLAN: no new src code, just fixing tests --------- Signed-off-by: Brian Dellabetta <[email protected]> Signed-off-by: shanjiaz <[email protected]>
SUMMARY:
lm-eval multimodal tests were failing to reproduce across different versions of compressed tensors. After upgrading the models from 2B to 7B, the tests appear to be reproducing across compressed tensors 0.9.1, 0.9.2 and nightly. I ran extensively for the fp8 config across different versions of CT, and it always returned the same result.
I also removed the random seed from the configs. after running several of each of the 3 configs, i did not see any change in result. this may cause errors during ci/cd testing but I'd like to see if it does, i feel that is a better e2e test anyway.
Tests take roughly 1hr30m - 1h45m to run.
TEST PLAN:
no new src code, just fixing tests