-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Gaudi: add CI #3160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gaudi: add CI #3160
Conversation
dd187d2
to
119bdbd
Compare
I’ll wait for the Gaudi integration test CI to pass before merging anything: The previous run was green, which gives me confidence in the current changes: Unfortunately, it can take days to get assigned a Gaudi1 runner 😭, so I figured I could start iterating on your reviews in the meantime rather than wait for the CI to finish before requesting feedback. In any case, I’ll only merge once the Gaudi integration test passes in the CI also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
We should soon have access to Gaudi2 and Gaudi3 ephemeral runners on demand, which will makes things much easier than waiting for a DL1 instance. I suggest we wait for this to be available to update and merge this PR.
Ok, I will wait for the new runners before adding Gaudi to the CI, as indeed the DL1 runners are super unreliable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
7bfba23
to
29b9c32
Compare
The runners for Gaudi are ready! 🙌 Thanks @regisss Just requesting some new reviews to be sure everything is still okay. Since the last review I just rebased on main and use the new runners. Now the integration test are passing and the runners are super fast! https://github.com/huggingface/text-generation-inference/actions/runs/15160963395/job/42627380206?pr=3160 |
.github/workflows/build.yaml
Outdated
@@ -129,9 +129,9 @@ jobs: | |||
export label_extension="-gaudi" | |||
export docker_volume="/mnt/cache" | |||
export docker_devices="" | |||
export runs_on="ubuntu-latest" | |||
export runs_on="itac-bm-emr-gaudi3-dell-1gaudi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All tests are going to pass with 1 device only? Big (i.e. 70B+ parameters) models are not tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I disable big models for testing and only did a small model for faster iteration. I just activated back a mutli-card test and the test is broken 😬. There seems to be a regression between the original PR and the latest TGI backend so I am looking into it 👀. Also the error is different based on the hardware Gaudi 1 vs 3 😣
@baptistecolle A couple of questions:
|
Some additional useful remark: you also need to add the new config with
|
c3241f4
to
55cdfbf
Compare
55cdfbf
to
0295bf2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just left a couple of comments regarding the runner size.
I'll add more tests later anyway for Llama4 and R1 on 8 devices.
Do you know if there is a nightly CI in TGI?
export platform="" | ||
export extra_pytest="" | ||
export extra_pytest="--gaudi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will run the models with "run_by_default": True
in PRs right?
If yes, I think we should change the runner above from itac-bm-emr-gaudi3-dell-8gaudi
to itac-bm-emr-gaudi3-dell-2gaudi
so that we test Llama 8b on a single device and on 2 devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is correct "run_by_default": True
make the test run on the CI.
The CI follow this commands
make -C backends/gaudi run-integration-tests
which run all the "run_by_default": True
tests
There is also this command
make -C backends/gaudi run-integration-tests-with-all-models
that runs all the model config definied in the test cases. This is useful when doing a big refactoring and checking everything is still working as expected
I updated the CI to a smaller runner so that I do testing on a 1 Gaudi card and a test on 2 Gaudi cards to test the model sharding logic To answer your question
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What does this PR do?
This PR adds CI support for the Gaudi backend. It includes an integration test that starts the model "meta-llama/Llama-3.1-8B-Instruct", performs a few requests, and verifies that the outputs match the expected results.
Additional models are also supported, but running tests for all of them is quite slow, so they are not included in the CI by default. However, instructions on how to run the integration tests for all supported models have been added to the Gaudi backend README.