Skip to content

Conversation

ArangoGutierrez
Copy link
Collaborator

Fixes #489

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes the "ubi9" suffix from container image tags as part of fixing issue #489. The change simplifies the image tagging strategy by moving away from distribution-specific suffixes to a cleaner tag format.

  • Updated Makefile to use version-only tags instead of version-distribution format
  • Changed default target from "ubi9" to "distroless"
  • Modified CI configuration to remove UBI-specific tag handling

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
deployments/container/Makefile Updated image tagging logic and changed default distribution target
demo/clusters/kind/scripts/common.sh Added TODO comment about removing UBI tag suffix
demo/clusters/gke/install-dra-driver-gpu.sh Added TODO comment about removing UBI tag suffix
.nvidia-ci.yml Removed UBI-specific tag suffix handling from CI pipeline

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines -80 to +82
# IN_IMAGE="${IN_IMAGE_NAME}:${IN_IMAGE_TAG}${TAG_SUFFIX}"
# IN_IMAGE="${IN_IMAGE_NAME}:${IN_IMAGE_TAG}"
# to
# OUT_IMAGE="${OUT_IMAGE_NAME}:${OUT_IMAGE_TAG}${TAG_SUFFIX}"
# OUT_IMAGE="${OUT_IMAGE_NAME}:${OUT_IMAGE_TAG}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this at all anymore? IN_IMAGE seems to be the same as OUT_IMAGE

Copy link
Collaborator

@jgehrcke jgehrcke Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the diff: it's fine I think in this case -- the ${TAG_SUFFIX} concept I think was historically added in hindsight to this .copy-images this job. And this diff I believe effectively reverts that change.

More importantly, I think:

I have advocated before to not use the term "image name" to encode the full location including registry, to me that's the full image spec or even image URL (not just the name).

This is a reminder. Because:

IN_IMAGE seems to be the same as OUT_IMAGE

Good point.

AFAIU, the main purpose of this CI job is to copy an image from one registry to another, retaining the (what I would call) name and tag.

Here, we do encode the registry in the name. That is, IN_IMAGE and OUT_IMAGE differ in precisely (only) the registry. Example: copy from ghcr.io/nvidia/k8s-dra-driver-gpu:06db1a78-ubi9 to <nvidia-internal-registry>/k8s-dra-driver-gpu:06db1a78-ubi9

Copy link
Collaborator

@jgehrcke jgehrcke Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More context: I think .copy-images is somewhat a legacy-ish job, that is self-contained (provides an API that is useful). I think it's smart to not change it too much here. This API is consumed in the pull-images job defined elsewhere in this file:

pull-images:
  extends:
    - .copy-images

.nvidia-ci.yml Outdated
Comment on lines 170 to 166
export SCAN_IMAGE=${IMAGE}${TAG_SUFFIX}
export SCAN_IMAGE=${IMAGE}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a new variable if we are not adding the tag suffix

Comment on lines 241 to 242
rm -f ${VERSION_FILE}
echo "${IN_IMAGE_TAG}-ubi9 ${OUT_IMAGE_TAG}-ubi9" >> ${VERSION_FILE}
echo "${IN_IMAGE_TAG}-ubi9 ${OUT_IMAGE_TAG}" >> ${VERSION_FILE}
echo "${IN_IMAGE_TAG} ${OUT_IMAGE_TAG}" >> ${VERSION_FILE}
cat ${VERSION_FILE}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is in this VERSION_FILE exactly and how is it used? It feels unnecessary to copy both ${IN_IMAGE_TAG} and ${OUT_IMAGE_TAG} in here since they are now identical.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a templated default, see https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/.nvidia-ci.yml#L224-L230. We also have IN and OUT variables, even if they are identical. Removing these should be treated as an independent followup PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussed change looks OK-ish to me for now I think.

What is in this VERSION_FILE exactly and how is it used?

Good question (we should add more code comments here I think): this file is input to internal release tooling (internal ref). More specifically, this CI job is exercising our (team-internal) tooling for automatically submitting a merge request against an NGC publishing config repo. The output of this CI job is literally a merge request that needs to be reviewed. We exercise this CI job for every commit to test as much as we can about the actual release pipeline for every commit to main (without going all the way of actually releasing).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to add more detail: the current left-over line

echo "${IN_IMAGE_TAG} ${OUT_IMAGE_TAG}" >> ${VERSION_FILE}

can be thought of as eventually building an instruction to copy an image from one registry to another.

Hence, this is not happening:

unnecessary to copy both ${IN_IMAGE_TAG} and ${OUT_IMAGE_TAG} in here since they are now identical

(I however empathize with much of this being barely readable / understandable :P).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is in this VERSION_FILE exactly and how is it used? It feels unnecessary to copy both ${IN_IMAGE_TAG} and ${OUT_IMAGE_TAG} in here since they are now identical.

These are not identical. The IN_IMAGE_TAG is the short-commit-ref of the current HEAD and OUT_IMAGE_TAG is the human-readable version that we are releasing.

We are not copying anything here, we are adding a line:

aabbccdd v99.88.00

to a file which is processed by release tooling release the image with tag aabbccdd as v99.88.00 externally.

Comment on lines 17 to 18
REGCTL ?= regctl

BUILDX =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
REGCTL ?= regctl
BUILDX =
REGCTL ?= regctl
BUILDX =

Comment on lines -36 to +44
IMAGE_TAG ?= $(IMAGE_VERSION)-$(DIST)
IMAGE_TAG ?= $(IMAGE_VERSION)
IMAGE = $(IMAGE_NAME):$(IMAGE_TAG)

OUT_IMAGE_NAME ?= $(IMAGE_NAME)
OUT_IMAGE_VERSION ?= $(IMAGE_VERSION)
OUT_IMAGE_TAG = $(OUT_IMAGE_VERSION)-$(DIST)
OUT_IMAGE_TAG = $(OUT_IMAGE_VERSION)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels unnecessary to have both an IMAGE_TAG and an OUT_IMAGE_TAG now that they are identical.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but as commented on the .nvidia.yaml file, we need these IN and OUT variables for our templated GitLab CI pipelines. Modifying this is out of scope for this PR, as it would require a commit/PR specifically targeted at modifying our GitLab Pipeline

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am right now confused where this makefile flow is used at all (need to inspect that).

But generally, as pointed out in another comment here, I believe that these IN_ and OUT_ variables might differ in precisely the registry (and not image name/tag).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that before we had to switch to using the new release flow, we would copy images from our CI registries to NGC using regctl. In this case IMAGE_TAG and OUT_IMAGE_TAG did differ ince the former was derrived from the short-commit reference and the latter was the version that we were releasing. The tag also differed when "retagging" an image to drop the distribution-specific suffix, for example.

We did not properly audit the Makefile after the (somewhat rushed) migration to the new release flow.

Comment on lines 45 to 49
DEFAULT_PUSH_TARGET := ubi9
DISTRIBUTIONS = $(DEFAULT_PUSH_TARGET)
DEFAULT_PUSH_TARGET := distroless
DISTRIBUTIONS := $(DEFAULT_PUSH_TARGET)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale behind a "default" push target now taht we only have one push target at all? It feels like this can be simplified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simplified the Makefile now that we don't have the need for dynamic targets

Comment on lines 53 to 56
.PHONY: $(DISTRIBUTIONS) $(PUSH_TARGETS) $(BUILD_TARGETS) $(TEST_TARGETS) $(BUILD_TARGETS)
.PHONY: $(DISTRIBUTIONS) $(PUSH_TARGETS) $(BUILD_TARGETS) $(TEST_TARGETS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is BUILD_TARGETS being dropped here? I see that it was listed twice previously and now we are reducing it to only be called once. @elezar was this an oversight or on purpose?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we only have 1 target (no dynamic targets) I have simplified this

Comment on lines 99 to 102
$(DEFAULT_PUSH_TARGET): DIST = $(DEFAULT_PUSH_TARGET)
$(DEFAULT_PUSH_TARGET):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of having a make target here with no dependencies and no rules?

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@klueska klueska added this to the v25.8.0 milestone Aug 25, 2025
@klueska klueska added cleanup ci/testing issue/PR related to CI and/or testing build artifacts task/issue related to build artifacts (cont. images) labels Aug 25, 2025
@klueska
Copy link
Collaborator

klueska commented Aug 25, 2025

I'll leave the rest of the review to @elezar as he's more familiar with the inner working of this and what makes sense to change vs. leave as is, even in the presence of all these redundant variables.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@jgehrcke
Copy link
Collaborator

jgehrcke commented Sep 3, 2025

For the record: an important question here is "do we want to replace -ubi9?" (with a different tag). Not discussed here I think, so I want to quickly add that. I think the answer is "no", and to me personally the strongest argument is that we have previously made that decision for the GPU Operator: https://github.com/NVIDIA/gpu-operator/pkgs/container/gpu-operator

There, we do something like this:
image

The image without the suffix is a multi-arch image, example: https://github.com/NVIDIA/gpu-operator/pkgs/container/gpu-operator/503315840?tag=c65b8bee.

I believe for multi-arch images to work, we do not need to explicitly push the per-architecture tags.

@jgehrcke
Copy link
Collaborator

jgehrcke commented Sep 8, 2025

@ArangoGutierrez could you review this again today, and after all discussion double-check if this is in a state that is mergeable from your point of view (that might be the current state)?

IMO, it is OK to leave some (rarely or never exercised) local dev workflows behind knowingly broken, we can fix them out-of-band. The focus for this patch should be on what's exercised by CI.

Evan says he can maybe add feedback ~tomorrow, we can incorporate his feedback also for follow-up patches.

We need to land a pragmatic version ASAP. Reason for mild urgency: the earlier we try to land this, the earlier we see what we have forgotten to think about. We want to cut an RC release from the main branch any day now, and this change should be part of it.


ifneq ($(BUILD_MULTI_ARCH_IMAGES),true)
include $(CURDIR)/deployments/container/native-only.mk
else
include $(CURDIR)/deployments/container/multi-arch.mk
endif

# For the default push target we also push a short tag equal to the version.
# For the main push target we also push a short tag equal to the version.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one of the next iterations on this, we should reword this comment:

we also push a short tag equal to the version

I read this and wonder: What is a "short tag"? Where do we (want to) push and when?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this should be a followup TODO

Copy link
Member

@elezar elezar Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have a distribution suffix, then we don't have a short tag. I would argue that we could remove all "push" logic (with the exception of push=true when building) from this makefile to simplify it significantly.


# Use a generic build target to build the relevant images
$(IMAGE_TARGETS): image-%:
# Build the container image
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: this is also the target we use in CI to push (using push=true in the DOCKER_BUILD_OPTIONS).

None of the other push-related targets in this Makefile are used in CI, and I have personally so far never used them locally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The push targets are a remnant of earlier release processes that have not yet been removed.

Copy link
Collaborator

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

This patch was sitting too long, and we want to make progress to get the new tag structure in place for the first 25.8.0 release candidate. Hence, I want to merge this now and deal with any potential fallout asap post-merge.

Considerations:

  • To focus on the most important aspects of this patch, we need post-merge signal (does this actually work when it runs on main?).
  • We can take more time in the future to review which Makefile targets are really still needed.
  • We should review & fix periphery scripts in independent work.

@jgehrcke jgehrcke merged commit 9569c5b into NVIDIA:main Sep 9, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build artifacts task/issue related to build artifacts (cont. images) ci/testing issue/PR related to CI and/or testing cleanup
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

images: remove -ubi9 suffix?
4 participants