Gaudi: add CI #3160

baptistecolle · 2025-04-10T09:06:13Z

What does this PR do?

This PR adds CI support for the Gaudi backend. It includes an integration test that starts the model "meta-llama/Llama-3.1-8B-Instruct", performs a few requests, and verifies that the outputs match the expected results.

Additional models are also supported, but running tests for all of them is quite slow, so they are not included in the CI by default. However, instructions on how to run the integration tests for all supported models have been added to the Gaudi backend README.

baptistecolle · 2025-04-22T10:01:45Z

I’ll wait for the Gaudi integration test CI to pass before merging anything:
https://github.com/huggingface/text-generation-inference/actions/runs/14591230970/job/40927197928?pr=3160

The previous run was green, which gives me confidence in the current changes:
https://github.com/huggingface/text-generation-inference/actions/runs/14384130453/job/40336095297

Unfortunately, it can take days to get assigned a Gaudi1 runner 😭, so I figured I could start iterating on your reviews in the meantime rather than wait for the CI to finish before requesting feedback. In any case, I’ll only merge once the Gaudi integration test passes in the CI also

regisss

LGTM!

We should soon have access to Gaudi2 and Gaudi3 ephemeral runners on demand, which will makes things much easier than waiting for a DL1 instance. I suggest we wait for this to be available to update and merge this PR.

baptistecolle · 2025-04-23T07:42:32Z

Ok, I will wait for the new runners before adding Gaudi to the CI, as indeed the DL1 runners are super unreliable

Narsil

LGTM

baptistecolle · 2025-05-21T11:49:30Z

The runners for Gaudi are ready! 🙌 Thanks @regisss

Just requesting some new reviews to be sure everything is still okay. Since the last review I just rebased on main and use the new runners. Now the integration test are passing and the runners are super fast! https://github.com/huggingface/text-generation-inference/actions/runs/15160963395/job/42627380206?pr=3160

regisss · 2025-05-21T12:47:44Z

.github/workflows/build.yaml

@@ -129,9 +129,9 @@ jobs:
                export label_extension="-gaudi"
                export docker_volume="/mnt/cache"
                export docker_devices=""
-                export runs_on="ubuntu-latest"
+                export runs_on="itac-bm-emr-gaudi3-dell-1gaudi"


All tests are going to pass with 1 device only? Big (i.e. 70B+ parameters) models are not tested?

Indeed, I disable big models for testing and only did a small model for faster iteration. I just activated back a mutli-card test and the test is broken 😬. There seems to be a regression between the original PR and the latest TGI backend so I am looking into it 👀. Also the error is different based on the hardware Gaudi 1 vs 3 😣

regisss · 2025-05-22T07:18:10Z

@baptistecolle A couple of questions:

It's not possible to select a specific runner for each test config right?
If I want to add a new model to test, I just need to add a new test config in test_gaudi_generate.py?

baptistecolle · 2025-05-22T07:22:51Z

@baptistecolle A couple of questions:

It's not possible to select a specific runner for each test config right?

If I want to add a new model to test, I just need to add a new test config in test_gaudi_generate.py?

No it is not i think this would require some rework of the build workflow which is global for all the hardwares. The best alternative would be to use a runner with 8 card and then set HABANA_VISIBLE_DEVICES=1
Yes, that's correct.

Some additional useful remark: you also need to add the new config with "run_by_default": True

text-generation-inference/integration-tests/gaudi/test_gaudi_generate.py

Line 35 in c3241f4

"run_by_default": True,

to run in the CI as there a lot of test, for faster CI testing I only run a subset of the test on the CI and not all the possible model we support

regisss

I just left a couple of comments regarding the runner size.
I'll add more tests later anyway for Llama4 and R1 on 8 devices.

Do you know if there is a nightly CI in TGI?

regisss · 2025-06-23T16:52:28Z

.github/workflows/build.yaml

                export platform=""
-                export extra_pytest=""
+                export extra_pytest="--gaudi"


This will run the models with "run_by_default": True in PRs right?
If yes, I think we should change the runner above from itac-bm-emr-gaudi3-dell-8gaudi to itac-bm-emr-gaudi3-dell-2gaudi so that we test Llama 8b on a single device and on 2 devices.

Yes this is correct "run_by_default": True make the test run on the CI.

The CI follow this commands
make -C backends/gaudi run-integration-tests which run all the "run_by_default": True tests
There is also this command
make -C backends/gaudi run-integration-tests-with-all-models that runs all the model config definied in the test cases. This is useful when doing a big refactoring and checking everything is still working as expected

integration-tests/gaudi/test_gaudi_generate.py

baptistecolle · 2025-06-24T07:47:50Z

I updated the CI to a smaller runner so that I do testing on a 1 Gaudi card and a test on 2 Gaudi cards to test the model sharding logic

To answer your question

Do you know if there is a nightly CI in TGI?
No there is not but because there is active developement, the Gaudi CI should run daily. Any changes to main will rebuild the gaudi image and test it (this is the same for all the TGI variants)

regisss

LGTM!

baptistecolle force-pushed the gaudi/add-ci branch from dd187d2 to 119bdbd Compare April 22, 2025 08:43

baptistecolle requested review from Narsil and regisss April 22, 2025 09:56

baptistecolle marked this pull request as ready for review April 22, 2025 10:01

regisss reviewed Apr 22, 2025

View reviewed changes

regisss mentioned this pull request Apr 22, 2025

Add integration tests for Gaudi huggingface/text-embeddings-inference#598

Merged

baptistecolle marked this pull request as draft April 23, 2025 07:42

Narsil previously approved these changes Apr 23, 2025

View reviewed changes

baptistecolle dismissed Narsil’s stale review via 29b9c32 May 21, 2025 11:27

baptistecolle force-pushed the gaudi/add-ci branch from 7bfba23 to 29b9c32 Compare May 21, 2025 11:27

baptistecolle requested review from Narsil and regisss May 21, 2025 11:47

baptistecolle marked this pull request as ready for review May 21, 2025 11:49

regisss reviewed May 21, 2025

View reviewed changes

baptistecolle marked this pull request as draft May 22, 2025 07:24

baptistecolle force-pushed the gaudi/add-ci branch from c3241f4 to 55cdfbf Compare May 22, 2025 14:44

baptistecolle and others added 9 commits June 23, 2025 11:12

wip(test): adding test to ci

4e40467

wip: able to launch gaudi tests

b4917f6

feat(ci): llama3 test working

7779d0c

feat(ci): llama3 test working

781dd20

fix llama failing test

8568f91

wip(ci): rerun ci to debug

76d155e

Update tests.yaml

1bd2ad9

Update tests.yaml

2c2cfc0

wip(ci): debug the ci

4b5e812

baptistecolle and others added 8 commits June 23, 2025 11:13

wip(ci): debug the ci

a2a5772

change defualt behaviour to only run a subset of all the models

9c67763

change defualt behaviour to only run a subset of all the models

59dc8c2

testing

fcf6870

feat(gaudi/ci): added ci for gaudi device

9c235f4

add new gaudi3 runners

8768085

enable multi-card test

1f03afe

fix broken test

0295bf2

baptistecolle force-pushed the gaudi/add-ci branch from 55cdfbf to 0295bf2 Compare June 23, 2025 12:12

baptistecolle added 2 commits June 23, 2025 12:26

fix style

a32025f

update conftest

ae7f3ae

baptistecolle marked this pull request as ready for review June 23, 2025 14:15

baptistecolle requested a review from regisss June 23, 2025 14:15

regisss reviewed Jun 23, 2025

View reviewed changes

fix(ci): use smaller runner

b159d02

baptistecolle requested a review from regisss June 24, 2025 08:05

regisss approved these changes Jun 24, 2025

View reviewed changes

regisss merged commit 9f38d93 into main Jun 24, 2025
32 of 33 checks passed

regisss deleted the gaudi/add-ci branch June 24, 2025 16:51

Gaudi: add CI #3160

Gaudi: add CI #3160

Uh oh!

Conversation

baptistecolle commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

baptistecolle commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

baptistecolle commented Apr 23, 2025

Uh oh!

Narsil left a comment

Choose a reason for hiding this comment

Uh oh!

baptistecolle commented May 21, 2025

Uh oh!

regisss May 21, 2025

Choose a reason for hiding this comment

Uh oh!

baptistecolle May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

regisss commented May 22, 2025

Uh oh!

baptistecolle commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

regisss Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

baptistecolle Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

baptistecolle commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

baptistecolle commented Apr 10, 2025 •

edited

Loading

baptistecolle commented Apr 22, 2025 •

edited

Loading

baptistecolle May 22, 2025 •

edited

Loading

baptistecolle commented May 22, 2025 •

edited

Loading

baptistecolle commented Jun 24, 2025 •

edited

Loading