Skip to content

Conversation

hardys
Copy link
Member

@hardys hardys commented Dec 19, 2023

Draft proposal to explore how we can enable the CAPM3 IPAM flow in a DHCP-less environment, following on from recent discussions on Slack

/cc @Rozzii @dtantsur

@metal3-io-bot metal3-io-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 19, 2023
@hardys hardys changed the title [WIP] DHCP-less proposal [WIP] DHCP-less CAPM3/IPAM proposal Dec 19, 2023
Copy link

@pierrecregut pierrecregut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my comments: the main issues I see are that I do not think there is a real need for the intermediate CR Metal3PreprovisioningTemplate and that we need to choose if we go for a format specific solution or a pure template based one.

@hardys
Copy link
Member Author

hardys commented Dec 21, 2023

Thanks for the review @pierrecregut - and apologies for not tagging you originally, I wasn't sure of your github ID but I see now I should have figured it out 😂

Copy link
Member

@Rozzii Rozzii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to give feedback sooner but I was busy, on first glance these are my ideas related to the topic, ofc we will discuss this proposal further on 31st of January on the M3TM.

@pierrecregut
Copy link

@hardys At some point the document was moved to the cluster-api-provider-metal3. But the solution should work at the baremetal-operator level and should not imply the use of of a cluster. I would also have not changed the title to include CAPM3 explicitly.

@hardys
Copy link
Member Author

hardys commented Jan 17, 2024

@hardys At some point the document was moved to the cluster-api-provider-metal3. But the solution should work at the baremetal-operator level and should not imply the use of of a cluster. I would also have not changed the title to include CAPM3 explicitly.

@pierrecregut this was in response to our previous discussion and an attempt to constrain the scope of this proposal such that it's actionable - we are exploring some BMO/BMH API dependencies, but the main aim is to extend the current CAPM3 templating so the existing IPAM integration works in DHCP-less situations?

@tuminoid
Copy link
Member

Please rebase at convenient time to get the new markdownlinter check.

@hardys hardys force-pushed the dhcp-less branch 2 times, most recently from 9f29eb4 to 3b00020 Compare January 25, 2024 16:38
@hardys
Copy link
Member Author

hardys commented Jan 25, 2024

@hardys At some point the document was moved to the cluster-api-provider-metal3. But the solution should work at the baremetal-operator level and should not imply the use of of a cluster. I would also have not changed the title to include CAPM3 explicitly.

I move the proposal back to the top-level since it now contains changes for CAPM3 and BMO, and I can adjust the PR title if that makes things clearer.

@hardys
Copy link
Member Author

hardys commented Jan 25, 2024

/retitle DHCP-less IPAM proposal

@metal3-io-bot metal3-io-bot changed the title [WIP] DHCP-less CAPM3/IPAM proposal DHCP-less IPAM proposal Jan 25, 2024
@metal3-io-bot metal3-io-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 25, 2024
@hardys
Copy link
Member Author

hardys commented Jan 25, 2024

Thanks everyone for your feedback so far - I've reworked the proposal and tried to take account of all the comments, please let me know if there are any further suggestions or things that I missed.

Copy link

@pierrecregut pierrecregut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hardys To summarize my comments: the proposal is great except may be the attempt for reclaiming IP that looks complex. There are two sections where alternatives are described and finally we should address how to give feedback when the template cannot expand on a host.

the pre-provisioning phase as described above. To reduce the required size of the pool, the IPClaim will be deleted after the
preprovisioning phase is completed, e.g the BMH resource becomes available.

In the provisioning phase another pool, associated with a cluster is used to template networkData as in the existing process.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand well, you want to use two network configuration for the IPA ? One during inspection and another during provisioning ? This seems complex and I thought that the same configuration would be used. If you really want to delete the claim. I would do it after the node is provisioned but this requires to also erase the preprovisioningNetworkDataName and so what it points to, so that when we upgrade the node we regenerate the data with a usable IP. Is IP address reuse on the provisioning network worth the complexity ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first use-case Preprovisioning - Common IPPool covers the case where the same network configuration is used for IPA and during provisioning, I agree this is the simpler and probably more common case.

My understanding is there are also a use-case where a separate network is used for the pre-provisioning phase, so I'm trying to cover that scenario here.

I agree perhaps reclaiming the IPClaim after inspection introduces undue complexity, but if we don't do that the provisioning network address range potentially becomes a lot larger (number of deployed BMH's vs the concurrency allowed by BMO), perhaps that's a worthwhile tradeoff and we can consider the IPClaim deletion in future if it proves necessary?

Copy link
Member

@zaneb zaneb Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case where there'd be two is if you have hosts that are dynamically assigned to clusters (and the networks reconfigured at the same time). I'm not sure this exists in the wild, but it would be good to allow it.
In the more common case of the network staying the same, I think you really want the IP address not to change. The one in the inspection data is the one you want to have in the cluster. So I think ideally you only want to delete the preprovisioning IP claim at provisioning time, and only if a separate provisioned IP claim exists.
In fact, I think you have to keep the claim until after provisioning starts, because the host will continue to use its preprovisioning IP to talk to ironic (and that is the only the host can get provisioned). The "preprovisioning" phase extends up until the BMH hits the provisioned state (i.e. after ironic reboots the host after writing the image to disk, which I think is when the node moves to active), not just after inspection.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can support dynamically assigning hosts to clusters without changing the IPA network configuration. It is true that the network used by Kubernetes needs to be reconfigured and be visible from the capm3 controller. On the other hand the IPA only needs to see Ironic (and os image repository). Servers should not see each other. As the requirements are really different, the best solution is to keep two separate network configurations for IPA and CAPM3 and in that case ipclaims are on different pools. If you have cluster definitions and baremetalhosts on different clusters (as with #429 or #408) I am not sure there is an easy path to keep ipclaims synchronized. If you really need to do something like that with a single network (that is what we were doing in old versions of Kanod and yes it was a bad idea), I would put the ipam on the baremetalHost side and push IPs through BaremetalHost annotations and that is why in our "template for BMH" experiment BmhData, metadata could also be exported as BMH annotations.

Copy link
Member

@Rozzii Rozzii Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong but this thread seem to be started with a misunderstanding. The question was:
"you want to use two network configuration for the IPA?" the answer is NO.

Then the discussion was mixed with reclaiming after inspection and @zaneb has correctly pointed out that pre-provisioning lasts until the BMH is in "provisioned" state, by default this means the reboot of the physical machine.

IMO the pre-provisioning IP can be reclaimed/released after the BMH is in "provisioned" state.

To summarize:

IPA will use only 1 IPPool during its runtime (preprovisioning pool), the second IPPool is used to generate the network data of the provisioned node and that network data is just written to the disk into the config-drive partition as network data that will be used by the provisioned machine after it restarts.

I would also like to avoid discussion about syncronizing the pools this proposal is about implementing pre-provisioning network data generation and linking the network data to the BMH. Now that the multy tenancy topics has been involved in this proposal we are well out of the original scope of the discussion IMO let's not involve yet an other topic.

I might be saying things that you already know but for me at least it feels this thread has been a bit derailed from the proposal.

a customized ISO, it can embed the configuration instead of Ironic to achieve the same result.

### Network configuration failure can't be reflected in status

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this part is out of the scope of the proposal and it is not worse than when the IPA fails because of a DHCP configuration problem. But we should address the feedback given when the templater fails to generate the preprovisioningNetworkData for a given baremetal host. Feedback at the level of the template would be strange if the reason is specific to a BMH. A condition on the BMH looks weird as this is not performed by the main controller. A solution would be to at least generate an event associated to the BMH.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is a good point - there are probably a few error-paths to consider:

  • baremetalhost.metal3.io/preprovisioningnetworkdata-reference value does not reference a valid resource (perhaps BMO should validate this, but just check the resource exists?)
  • Problem with the DataTemplate resource - the schema is enforced by the API but AFAICS for Metal3DataTemplate any other error will be reported via the resource status, so I think we'll do the same for Metal3PreprovisioningDataTemplate given they will likely share the same code
  • Some other problem in CAPM3 related to generating and setting the Secret, here we don't have a Machine resource during the pre-provisioning phase so will have to do as you suggest and generate an event associated with the BMH

There are several alternative approaches to solve the problem of preventing BMO starting inspection immediately on BMH registration:

* Add a new BareMetalHost API `PreprovisioningNetworkDataRequired` which defaults to false, but when set to true will describe that the host cannot move from Registering -> Inspecting until `preprovisioningNetworkDataName` has been set.
* Create the BMH with a preprovisioningNetworkDataName pointing to an empty Secret. BMO refuses to start inspecting until the Secret contains some data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The principle that necessary data can be specified as non existing resource and that the BMH will just wait and will not err until the data is provided is nice. Even with the current approach, this could be used by template custom resource that can generate the data but without the power to modify the BMH fields. And the principle can be generalized to other data (userData, networkData).

@hardys
Copy link
Member Author

hardys commented Jul 3, 2024

Thanks for the additional feedback @zaneb, one question below:

In the meantime, it seems to me that IPAM fits pretty clearly on the 'admin' side of the user-admin divide, and is something that should be configured as part of creating the BMH resources, not as part of provisioning them, in applications where those two roles are separate. I'm not sure that this changes this proposal at all, except that maybe infrastructure.cluster.x-k8s.io is the wrong namespace for the Metal3PreprovisioningDataTemplate. But maybe we should start to look at this as a replacement for, rather than an addition to, CAPM3 templates for the networkData.

If we're in agreement this should be aligned with the BMO/BMH not CAPM3, perhaps we should consider a more explicit BMH API addition instead of the annotation (this was previously discussed but rejected because I envisaged extending the current CAPM3 model), and add a new controller to the BMO which can handle templating as an alternative to the current CAPM3 flow? e.g something like:

apiVersion: ipam.metal3.io/v1alpha1
kind: IPPool
metadata:
  name: pool-1
spec:
  clusterName: cluster
---
apiVersion: metal3.io/v1alpha1
kind: NetworkDataTemplate
metadata:
  name: preprov-data-template
spec:
  format: cloud-init  # So we can add e.g nmstate without replacing the resource type
  networkData:
    networks:
      ipv4:
      - id: eth0
        ipAddressFromIPPool: pool-1
---
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
spec:
  preprovisioningNetworkDataTemplate: preprov-data-template
  # networkDataTemplate could also be added as an alternative to current CAPM3 templating
...

@dtantsur
Copy link
Member

dtantsur commented Jul 3, 2024

If we're in agreement this should be aligned with the BMO/BMH not CAPM3, perhaps we should consider a more explicit BMH API addition instead of the annotation (this was #373 (comment) but rejected because I envisaged extending the current CAPM3 model), and add a new controller to the BMO which can handle templating as an alternative to the current CAPM3 flow?

Absolutely yes. I think we only discussed an annotation to avoid the reverse dependency of BMO on CAPM3? The new API solves this issue.

@zaneb
Copy link
Member

zaneb commented Jul 4, 2024

That seems plausible, although it feels a bit weird having 2 separate APIs in the BMH for the same thing.
We want the template to be resolved for the BMH (because the preprovisioningNetworkData is also the default for networkData), but presumably the PreprovisioningImage will still take a Secret as an input.
So there'd be a bit of a weird thing where you supply the template name to BMH but then BMO resolves the template into a Secret and... writes that back into the BMH?

I think we are in agreement this should be aligned with the BMO/BMH not CAPM3 at least.

@hardys
Copy link
Member Author

hardys commented Jul 4, 2024

That seems plausible, although it feels a bit weird having 2 separate APIs in the BMH for the same thing. We want the template to be resolved for the BMH (because the preprovisioningNetworkData is also the default for networkData), but presumably the PreprovisioningImage will still take a Secret as an input. So there'd be a bit of a weird thing where you supply the template name to BMH but then BMO resolves the template into a Secret and... writes that back into the BMH?

Yeah a BMO controller writing back to the spec does seem awkward - another option could be to create the secret and track it in the BMH status, then when when we read e.g preprovisioningNetworkData we check for (and when needed generate) a template-generated secret if no user-provided one is present in the spec?

This is similar to how the Metal3Machine API works and would also mean if anyone mixes the new BMH APIs with the existing CAPM3 ones the old/existing approach would take precedence, users would opt in to the new behaviour by not setting any networkData in Metal3DataTemplate or Metal3Machine and instead setting networkDataTemplate on the BMH.

@hardys
Copy link
Member Author

hardys commented Jul 4, 2024

@lentzi90 any feedback on the alternate direction discussed above? It could help with the multi-tenancy direction so that tenant users no longer have any need to directly interact with *networkData, they would only need to specify metaData/userData at the machine level, with networking handled by the BMH owner.

@dtantsur
Copy link
Member

dtantsur commented Jul 4, 2024

BMO resolves the template into a Secret and... writes that back into the BMH?

I personally don't find this weird at all but I don't know what is standard in Kubernetes.

another option could be to create the secret and track it in the BMH status, then when when we read e.g preprovisioningNetworkData we check for (and when needed generate) a template-generated secret if no user-provided one is present in the spec?

Works for me too.

@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 8, 2024
@Rozzii
Copy link
Member

Rozzii commented Oct 9, 2024

/remove-lifecycle stale
IMO community would benefit from this very much and I personally would like this discussion to continue.

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 9, 2024
@kukacz
Copy link

kukacz commented Oct 9, 2024

/remove-lifecycle stale IMO community would benefit from this very much and I personally would like this discussion to continue.

Same for me. I find this feature important to complete the effort which virtualmedia booting begun - ie. to get rid of L2 network constraints and improve the declarative part of provisioning process.

Copy link
Member

@Rozzii Rozzii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a very good proposal, I would like to make this move forward.

@hardys is there a possibility to move this forward in this quarter?

I understand that there is work going on the multy tenancy at the same time but after that discussion was involved here I feel we have blown the whole proposal out of it's original scope a little bit.

I am also fine with this #373 (comment) approach, If CAPM3 is creating the pre-prow network data based on Metal3DataTemplate then let it do it but if you need an independent BMO way then just let's allow that too. I also agree with the precedence of actions mentioned here.

Is there any unanswered question or are we all okay with this approach #373 (comment) ? Is there something else we have to agree on?

Additionally Ironic may not support injecting `network_data` in every configuration, when the IPA image is
supplied as kernel and initramfs combo then Ironic can convert it to an ISO and inject the node `network_data`
but in the case where a pre-built ISO ramdisk ISO is provided the image conversion won't happen thus there is
no way to inject the network data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
no way to inject the network data.
no way to inject the pre-provisioning network data.

Regular network data will be injected without any issue.

the pre-provisioning phase as described above. To reduce the required size of the pool, the IPClaim will be deleted after the
preprovisioning phase is completed, e.g the BMH resource becomes available.

In the provisioning phase another pool, associated with a cluster is used to template networkData as in the existing process.
Copy link
Member

@Rozzii Rozzii Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong but this thread seem to be started with a misunderstanding. The question was:
"you want to use two network configuration for the IPA?" the answer is NO.

Then the discussion was mixed with reclaiming after inspection and @zaneb has correctly pointed out that pre-provisioning lasts until the BMH is in "provisioned" state, by default this means the reboot of the physical machine.

IMO the pre-provisioning IP can be reclaimed/released after the BMH is in "provisioned" state.

To summarize:

IPA will use only 1 IPPool during its runtime (preprovisioning pool), the second IPPool is used to generate the network data of the provisioned node and that network data is just written to the disk into the config-drive partition as network data that will be used by the provisioned machine after it restarts.

I would also like to avoid discussion about syncronizing the pools this proposal is about implementing pre-provisioning network data generation and linking the network data to the BMH. Now that the multy tenancy topics has been involved in this proposal we are well out of the original scope of the discussion IMO let's not involve yet an other topic.

I might be saying things that you already know but for me at least it feels this thread has been a bit derailed from the proposal.

There are several potential workarounds for this situation, but it would be preferable for a common solution
to be found which can benefit both Metal3 and the broader Ironic community, as such solving this problem is considered
outside the scope of this proposal, but we will follow and engage with the
[Ironic community spec related to this problem](https://review.opendev.org/c/openstack/ironic-specs/+/906324/1/specs/approved/fix-vmedia-boot-config.rst )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: They are all merged.

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zaneb for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the needs-rebase Indicates that a PR cannot be merged because it has merge conflicts with HEAD. label Oct 22, 2024
@metal3-io-bot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@lentzi90
Copy link
Member

lentzi90 commented Nov 1, 2024

@lentzi90 any feedback on the alternate direction discussed above? It could help with the multi-tenancy direction so that tenant users no longer have any need to directly interact with *networkData, they would only need to specify metaData/userData at the machine level, with networking handled by the BMH owner.

Sorry for the long delay.
This direction does sound interesting and I think we can move forward in this way. If we can keep the possibility for CAPM3 to still handle the networkData I think that would be good (but not a blocker). I guess that would anyway be mostly up to how CAPM3 is configured. In most cases it would probably make sense that the BMH owner takes care of it, so the CAPM3 default could be to not have access to it at all.

On a side note, I'm trying to piece together all of this with @zaneb 's draft for API decomposition and the CAPI/CAPM3 multi-tenancy to get an idea about what the end result could look like. I'll probably open up a discussion for it once I have some basic diagram 🙂

@hardys
Copy link
Member Author

hardys commented Nov 4, 2024

I think this is a very good proposal, I would like to make this move forward.

@hardys is there a possibility to move this forward in this quarter?

I understand that there is work going on the multy tenancy at the same time but after that discussion was involved here I feel we have blown the whole proposal out of it's original scope a little bit.

I am also fine with this #373 (comment) approach, If CAPM3 is creating the pre-prow network data based on Metal3DataTemplate then let it do it but if you need an independent BMO way then just let's allow that too. I also agree with the precedence of actions mentioned here.

Is there any unanswered question or are we all okay with this approach #373 (comment) ? Is there something else we have to agree on?

Sorry I've been struggling to find the time to revisit this, but yes I can do another pass and try to capture what I think was consensus in the recent comments (which is not yet fully reflected in the proposal) - e.g that we want to persue a new BMO controller.

I can't currently commit to finding time to write the code myself, but I'd definitely like to see this propoal merged then we can try to find resources to move the implementation forwards.

@lentzi90
Copy link
Member

lentzi90 commented Nov 5, 2024

Thanks @hardys !
I have now created a github discussion as mentioned earlier: https://github.com/orgs/metal3-io/discussions/2029
Feel free to chime in there! 🙂

@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 3, 2025
@Rozzii
Copy link
Member

Rozzii commented Feb 21, 2025

/remove-lifecycle stale
progress has slowed a bit but by no means stale, this topic was discussed on 2024 Feb 13. MTM.

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2025
@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2025
@Rozzii
Copy link
Member

Rozzii commented May 23, 2025

We still need this.
/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2025
@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 21, 2025
@tuminoid
Copy link
Member

/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase Indicates that a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: IPAM WIP

Development

Successfully merging this pull request may close these issues.

10 participants