Skip to content

Conversation

qeqar
Copy link
Contributor

@qeqar qeqar commented May 24, 2024

f.e if the creds break, but the cluster was ready, the operator can fix it.

What this PR does / why we need it:
Use transient Error if the Openstackcluster was once ready.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2097

Special notes for your reviewer:

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 24, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @qeqar. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 24, 2024
@k8s-ci-robot k8s-ci-robot requested review from EmilienM and mdbooth May 24, 2024 12:17
Copy link

netlify bot commented May 24, 2024

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit 9643251
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/6662ba1c58aff000086d02a3
😎 Deploy Preview https://deploy-preview-2099--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@qeqar qeqar changed the title :bug if the openstackcluster was ready, we don't want to set a treminalError 🐛 if the openstackcluster was ready, we don't want to set a treminalError May 24, 2024
@lentzi90
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 24, 2024
@qeqar qeqar changed the title 🐛 if the openstackcluster was ready, we don't want to set a treminalError 🐛 if the openstackcluster was ready, we don't want to set a terminalError May 24, 2024
@EmilienM
Copy link
Contributor

What seems wrong to me, is that if a user changes the bastion spec with a new flavor, that new flavor doesn't exist in Nova, the controller fails to create the bastion and wants to update FailureReason, it won't because cluster was Ready.
I'm wondering if we should rather set Readiness to false into this function as well?

@qeqar
Copy link
Contributor Author

qeqar commented May 27, 2024

As this is fixable by fixing the flavor name in the CR, it seems to be the same type of error. So it should not be permanently too.

Setting ready to false is done in no case at the moment, so this would change the behavior, if that is ok, i can put it in the else path, but then i need so find the proper place(s) to set it true again.

Edit: and i am not sure what capi will do if it goes to ready false while provisiond

@EmilienM
Copy link
Contributor

I'll ask @lentzi90 and @mdbooth what they think.

@qeqar
Copy link
Contributor Author

qeqar commented May 27, 2024

After some additional thinking, would setting ready to false, removes the validly of my if clause.

So first: it need to be set on every error, as a permanent error is more critical an should result in an ready false and second if we have two problems, f.e. some error and we activate the bastion host to have look and then do a flavor typo, it will go directly to a permanent error, as ready was already false.

If we start setting ready to false after it was set to true, we need to find a different switch between permanent and transient errors.

@mdbooth
Copy link
Contributor

mdbooth commented May 28, 2024

I've asked about setting Ready to false here: https://kubernetes.slack.com/archives/C8TSNPY4T/p1716892079243969

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 31, 2024
@qeqar qeqar force-pushed the use-transition-error-if-cluster-was-ok branch from 2dd8453 to 9703ff1 Compare May 31, 2024 10:19
Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of cases! I will sympathise entirely if you just want to cover the easy ones, including the one that's biting you.

In general, we want to set a terminal error only for:

  • bad user input
  • An unrecoverable error, like something we previously created no longer exists

The first is really only relevant during initial resource creation, and we should not set terminal errors for this.

The second is more complex in the cluster controller. My advice is just to punt on these for now.

If in doubt it shouldn't be terminal.

@lentzi90 @EmilienM I'd appreciate a second set of eyes on this list.

@qeqar
Copy link
Contributor Author

qeqar commented Jun 1, 2024

Will have a look on Monday, if i have some time, then i am out for the rest of the week, so could take some time.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 3, 2024
@qeqar qeqar force-pushed the use-transition-error-if-cluster-was-ok branch 2 times, most recently from fc58220 to ac13660 Compare June 4, 2024 07:52
Copy link
Contributor

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
/lgtm

We should probably think about adding conditions for some of the non-fatal errors at some point also

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 4, 2024
@qeqar
Copy link
Contributor Author

qeqar commented Jun 4, 2024

Ok, i am not fully happy with setting the isFatal inside the function call, specially the true case. But should be ok for this one.

@qeqar qeqar force-pushed the use-transition-error-if-cluster-was-ok branch from ac13660 to 9f1a352 Compare June 4, 2024 12:15
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 4, 2024
@qeqar
Copy link
Contributor Author

qeqar commented Jun 4, 2024

Ok now re-based with main to get the #2109 error class changes and all occurrences updated

Copy link
Contributor

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please squash the commits into one and check my question below

so we have the else path for some future logging/state improvments

Set most errors to transient/non-fatal
@qeqar qeqar force-pushed the use-transition-error-if-cluster-was-ok branch from 2918c08 to 9643251 Compare June 7, 2024 07:43
Copy link
Contributor

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lentzi90

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 7, 2024
Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thank you!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 7, 2024
@EmilienM
Copy link
Contributor

EmilienM commented Jun 7, 2024

Excellent contribution, thank you!
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 7, 2024
@k8s-ci-robot k8s-ci-robot merged commit 35efc1b into kubernetes-sigs:main Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Cluster stays in Failed state after Credentials rotations
5 participants