-
Notifications
You must be signed in to change notification settings - Fork 1.8k
exponential backoff for taskRun and customRun creation
#8902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The following is the coverage report on the affected files.
|
|
The following is the coverage report on the affected files.
|
f4a09d4 to
c2b4967
Compare
|
The following is the coverage report on the affected files.
|
|
The following is the coverage report on the affected files.
|
|
The following is the coverage report on the affected files.
|
|
/retest |
|
Please help me with the CI. I'm not sure if the issue is related to the changes in the PR: |
@pritidesai since it happens systematically on all CI jobs, it probably is related to the PR, unless there was a temporary issue in GitHub runners. You can use |
|
@pritidesai I have the same crashloop locally, the logs |
The createTaskRun and createCustomRun now uses wait.ExponentialBackoff to retry the creation of a taskRun or customRun when certain errors occur, specifically webhook timeouts. The function isWebhookTimeout checks if an error is a mutating adminssion webhook timeout, by looking for HTTP 500 and the phrase "timeout" in the error message. If a webhook timeout is detected, the backoff loop will retry the creation up to a configured number of steps, with increasing delay between attempts. if the error is not a webhook timeout, the function will not retry and will return the error immediately. Errors that not webhook timeouts, e.g. HTTP 400 bad request, validation errors, etc. are not retried and will cause the taskRun creation to fail as expected. By default, the exponential backoff strategy is disabled. To enable this feature, set the `enable-wait-exponential-backoff` to `true` in feature-flags config map. When enabled, the controller will use an exponential backoff strategy to retry taskRun and customRun creation if it encounters transient errors such as admission webhook timeouts. This improves robustness against temporary webhook issues. If the feature flag is set to false, the controller will not retry and will fail immediately on such errors. Configuration for the backoff parameters (duration, factor, steps, etc) can be set in the wait-exponential-backoff config map. Signed-off-by: Priti Desai <[email protected]>
|
The following is the coverage report on the affected files.
|
|
Thank you @afrittoli and @vdemeester. I had a mismatch in the ConfigMap names - its fixed now 🤞 |
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pritidesai - this looks good to me, limiting the retry to the 500 - timeout case makes this a solid option. Would it make sense to do the same in the TaskRun controller for Pod creations as well?
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: afrittoli The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Yes, definitely. I will create a new PR to update the taskRun controller once this one is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Changes
The
createTaskRunandcreateCustomRunfunctions now usewait.ExponentialBackoffto retry creation when specific transient errors occur - particularly admission webhook timeouts.The helper function
isWebhookTimeoutdetermines whether an error is due to a webhook timeout. It checks for:If a webhook timeout is detected, the backoff loop will retry the creation up to a configured number of steps, with increasing delay between attempts.
If the error is not a webhook timeout, the function will not retry and will return the error immediately.
Errors that are not webhook timeouts, e.g. HTTP 400, validation errors, etc. are not retried and will cause the taskRun creation to fail as expected.
By default, the exponential backoff strategy is disabled. To enable this feature, set the
enable-wait-exponential-backofftotrueinfeature-flagsconfig map.When enabled, the controller will use an exponential backoff strategy to retry
taskRunandcustomRuncreation if it encounters transient errors such as admission webhook timeouts.This improves robustness against temporary webhook issues by allowing the controller to gracefully retry instead of failing immediately.
Configuration for the backoff parameters (duration, factor, steps, etc) can be set in the
wait-exponential-backoffconfig map./kind feature
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
/kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes