Skip to content

Conversation

@ethan-gallant
Copy link

@ethan-gallant ethan-gallant commented Jul 16, 2025

What this PR does / why we need it:
It enables removing the ExternallyProvisioned attribute, this can be useful if you want to transition hosts from an externally provisioned source into being directly managed.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2465

Signed-off-by: Ethan J. Gallant [email protected]

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign elfosardo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 16, 2025
@metal3-io-bot
Copy link
Contributor

Hi @ethan-gallant. Thanks for your PR.

I'm waiting for a metal3-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@metal3-io-bot metal3-io-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 16, 2025
Copy link
Member

@Rozzii Rozzii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this topic would need some community consensus at least a lazy one. I don't remember at the moment why was it decided to block an externally provisioned node to transition to the regular provisioned state.

In principle I don't see why we shouldn't allow this transition but I am not using this feature on a daily basis so I am interested in the opinions of those who do might have more experience with the use-case, they might know what would be the side effects of allowing the proposed state transition.

// if the credentials change and the Host must be re-registered.
if hsm.Host.Spec.ExternallyProvisioned {
hsm.NextState = metal3api.StateExternallyProvisioned
} else if hsm.Host.Status.Provisioning.State == metal3api.StateExternallyProvisioned {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this implementation move all externally provisioned BMH to "regular provisioned" state even if the user would add an unrelated annotation or the reconciliation would be triggered for whatever other reason?

It might be just a theoretical scenario but I would like to have an an additionl condition that checks that the "ExtarnallyProvisioned" filed is really set to false.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this lands in an else block after the check for externally provisioned I don't believe that would be possible here.

Happy to reformat it as well into a switch statement or similar as the conditional branches here are getting a bit wild already before the change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this does anything at all, because in handleRegistering() the hsm.Host.Status.Provisioning.State is, by definition, StateRegistering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this lands in an else block after the check for externally provisioned I don't believe that would be possible here.

Happy to reformat it as well into a switch statement or similar as the conditional branches here are getting a bit wild already before the change.

That is true , silly me :D .

But then there is the other thing zaneb pointed out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original bug is about the incorrect state in Ironic: once we get to Inspecting, BMO does not expect to find the Node in the active state in Ironic, but that's what actually happens. BMO needs to deprovision the host first. Then make sure that inspection still happen.

Comment on lines 84 to 85
Expect(bmh.Status.OperationHistory.Inspect.Start.IsZero()).To(BeTrue())
Expect(bmh.Status.OperationHistory.Provision.Start.IsZero()).To(BeTrue())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure these two conditions mean that the BMH is still in registering as they would never be true in case of an externally provisioned BMH at least that would be my expectation.

I assume you have run this test locally but I wonder how long does the BMH stay in "Registering" state that your implementation relies on.

How do you ensure that the BMH stays in registering state to allow the test code to do the bmh.Spec.ExternallyProvisioned field update? I am not a user of the externally provisioned feature so I might be missing something but I would have suspected the BMH to very quickly transition out of "Registering" and If that is true this test would be very prone to timing issues.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a manual test however I had some issues getting the e2e tests running locally with make test-e2e as the e2e image from quay.io could not be loaded.

I'll try again today as it seems that the tests did not align with what I saw manually testing. If you have any guidance that I might have missed getting the E2E working locally (on Linux / MacOS) that would be greatly appreciated.

@Rozzii
Copy link
Member

Rozzii commented Jul 17, 2025

/ok-to-test
/cc @dtantsur @zaneb

@metal3-io-bot metal3-io-bot requested review from dtantsur and zaneb July 17, 2025 06:35
@metal3-io-bot metal3-io-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 17, 2025
@Rozzii Rozzii changed the title 🐛 Enable removal of ExternallyProvisioned attribute from BareMetalHosts ⚠️ Enable removal of ExternallyProvisioned attribute from BareMetalHosts Jul 17, 2025
@Rozzii Rozzii moved this to BMO WIP in Metal3 - Roadmap Jul 17, 2025
@Rozzii Rozzii added this to the BMO - v0.11 milestone Jul 17, 2025
@Rozzii
Copy link
Member

Rozzii commented Jul 17, 2025

I have changed this from bug to breaking change, and IMO this is a feature not a bug as blocking the transition between externally and regularly provisioned state was intentional and documented.
/kind feature

@metal3-io-bot metal3-io-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 17, 2025
Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be useful if you want to transition hosts from an externally provisioned source into being directly managed.

I was going to say that the question is why you need to do that.
An externally provisioned host is still fully managed in terms of power state.
The other thing you can do with a provisioned host is deprovision it... but that's what happens when you remove the externallyProvisioned flag anyway.

However, I see that somebody has prevented you from removing the flag in the webhook. That is a real bug in my opinion. Otherwise (e.g. with the webhook disabled) it already works correctly and as-documented as far as I know.

// if the credentials change and the Host must be re-registered.
if hsm.Host.Spec.ExternallyProvisioned {
hsm.NextState = metal3api.StateExternallyProvisioned
} else if hsm.Host.Status.Provisioning.State == metal3api.StateExternallyProvisioned {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this does anything at all, because in handleRegistering() the hsm.Host.Status.Provisioning.State is, by definition, StateRegistering.

errs = append(errs, errors.New("bootMACAddress can not be changed once it is set"))
}

if oldObj.Spec.ExternallyProvisioned != newObj.Spec.ExternallyProvisioned {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid bug fix.
According to the state diagram, it is valid to remove the externallyProvisioned flag: https://github.com/metal3-io/baremetal-operator/blob/main/docs/baremetalhost-states.md
I have no idea why the webhook is preventing it.

Copy link
Author

@ethan-gallant ethan-gallant Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked issue gives a bit more background, the webhook seems to be a bandaid fix to prevent the host from getting stuck in the provisioning state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To echo what I said in the bug: if removing externallyProvisioned unconditionally deprovisions the host, we need to make sure that image and customDeploy are not set.

Or we need to permit them to be set and keep the host Provisioned.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I may have misunderstood the issue and this will need a bit of additional work.

Just to confirm (since I was not at the working group meetings) the desired behavior is:

  • If the Image & customDeploy are set
    • Move the host to the Provisioned state since the host complies with the standard setup, it was just (as expected) externallyProvisioned
  • If the host does not have the image & customDeploy fields set
    • Deprovision the host since the host can not be managed in the standard fashion

Overall I'm wondering if the approach originally for the externallyProvisioned would have been to follow the cert-manager pattern and had a ProvisionRequest CR where some external system could fulfill the provision instead of hacking around the BareMetalHost CR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm wondering if the approach originally for the externallyProvisioned would have been to follow the cert-manager pattern and had a ProvisionRequest CR where some external system could fulfill the provision instead of hacking around the BareMetalHost CR.

Just to reflect on this point, this would be a nice approach but in practice the system that has done the provisioning might not be cloud native or not even automated at all so a ProvisionRequest CR would provide some optional extra functionality, but it would still not cover all the use cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm wondering if the approach originally for the externallyProvisioned would have been to follow the cert-manager pattern and had a ProvisionRequest CR where some external system could fulfill the provision instead of hacking around the BareMetalHost CR.

Well, we should have had a separate CRD for provisioning for sure. The HostClaim work metal3-io/metal3-docs#408 is a step in that direction but it does not address externally provisioned.

Just to confirm (since I was not at the working group meetings) the desired behavior is

I think your summary is correct, except that: "If the Image or customDeploy are set" (one is enough).

@zaneb
Copy link
Member

zaneb commented Jul 18, 2025

On further inspection, I see that the webhook change was added recently to prevent people encountering bug #2465.

hsm.NextState = metal3api.StateExternallyProvisioned
} else if hsm.Host.Status.Provisioning.State == metal3api.StateExternallyProvisioned {
// Removing externallyManaged moves hosts to provisioned state
hsm.NextState = metal3api.StateProvisioned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Zane's comment: this is a wrong place. You need to fix handleExternallyProvisioned instead.

err = clusterProxy.GetClient().Update(ctx, &bmh)
Expect(err).NotTo(HaveOccurred())

Eventually(func(g Gomega) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use WaitForBmhInProvisioningState


Eventually(func(g Gomega) {
err = clusterProxy.GetClient().Get(ctx, kclient.ObjectKeyFromObject(&bmh), &bmh)
g.Expect(bmh.Status.Provisioning.State).To(Equal(metal3api.StateProvisioned))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the test is written, the expected state is not Provisioned. Or even: it make become Provisioned briefly, then BMO detects that there is no image and no customDeploy and deprovisions.

@metal3-io-bot metal3-io-bot added the needs-rebase Indicates that a PR cannot be merged because it has merge conflicts with HEAD. label Sep 24, 2025
@metal3-io-bot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Rozzii Rozzii modified the milestones: BMO - v0.11, BMO - v0.12 Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. needs-rebase Indicates that a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

Status: BMO WIP

Development

Successfully merging this pull request may close these issues.

Confusing behavior on removing externallyProvisioned: true

5 participants