Skip to content

Conversation

@pritidesai
Copy link
Member

Changes

The createTaskRun and createCustomRun functions now use wait.ExponentialBackoff to retry creation when specific transient errors occur - particularly admission webhook timeouts.

The helper function isWebhookTimeout determines whether an error is due to a webhook timeout. It checks for:

  • HTTP 500 status codes
  • The presence of the word "timeout" in the error message

If a webhook timeout is detected, the backoff loop will retry the creation up to a configured number of steps, with increasing delay between attempts.

If the error is not a webhook timeout, the function will not retry and will return the error immediately.

Errors that are not webhook timeouts, e.g. HTTP 400, validation errors, etc. are not retried and will cause the taskRun creation to fail as expected.

By default, the exponential backoff strategy is disabled. To enable this feature, set the enable-wait-exponential-backoff to true in feature-flags config map.

When enabled, the controller will use an exponential backoff strategy to retry taskRun and customRun creation if it encounters transient errors such as admission webhook timeouts.

This improves robustness against temporary webhook issues by allowing the controller to gracefully retry instead of failing immediately.

Configuration for the backoff parameters (duration, factor, steps, etc) can be set in the wait-exponential-backoff config map.

/kind feature

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes


- Introduced **exponential backoff retry** mechanism for `createTaskRun` and `createCustomRun` functions.
- Retries are triggered only on **mutating admission webhook timeouts** (HTTP 500 with "timeout" in the error message).
- Non-retryable errors (e.g., HTTP 400, validation failures) continue to fail immediately.
- Feature is **disabled by default**. To enable, set `enable-wait-exponential-backoff: "true"` in the `feature-flags` ConfigMap.
- Backoff parameters (duration, factor, steps) are configurable via the `wait-exponential-backoff` ConfigMap.
- Improves robustness against transient webhook issues in a heavy cluster during resource creation.

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Jul 22, 2025
@tekton-robot tekton-robot requested review from afrittoli and jerop July 22, 2025 00:09
@tekton-robot tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 22, 2025
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/config/feature_flags.go 94.8% 94.0% -0.8
pkg/apis/config/store.go 93.3% 93.9% 0.6
pkg/apis/config/wait_exponential_backoff.go Do not exist 87.0%
pkg/reconciler/pipelinerun/pipelinerun.go 91.6% 91.8% 0.2
test/controller.go 29.5% 30.1% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/config/feature_flags.go 94.8% 94.0% -0.8
pkg/apis/config/store.go 93.3% 93.9% 0.6
pkg/apis/config/wait_exponential_backoff.go Do not exist 87.0%
pkg/reconciler/pipelinerun/pipelinerun.go 91.6% 91.8% 0.2
test/controller.go 29.5% 30.1% 0.6

@pritidesai pritidesai force-pushed the backoff branch 2 times, most recently from f4a09d4 to c2b4967 Compare July 22, 2025 01:28
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/config/feature_flags.go 94.8% 94.0% -0.8
pkg/apis/config/store.go 93.3% 93.9% 0.6
pkg/apis/config/wait_exponential_backoff.go Do not exist 87.0%
pkg/reconciler/pipelinerun/pipelinerun.go 91.6% 91.8% 0.2
test/controller.go 29.5% 30.1% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/config/feature_flags.go 94.8% 94.0% -0.8
pkg/apis/config/store.go 93.3% 93.9% 0.6
pkg/apis/config/wait_exponential_backoff.go Do not exist 87.0%
pkg/reconciler/pipelinerun/pipelinerun.go 91.6% 91.8% 0.2
test/controller.go 29.5% 30.1% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/config/feature_flags.go 94.8% 94.0% -0.8
pkg/apis/config/store.go 93.3% 93.9% 0.6
pkg/apis/config/wait_exponential_backoff.go Do not exist 87.0%
pkg/reconciler/pipelinerun/pipelinerun.go 91.6% 91.8% 0.2
test/controller.go 29.5% 30.1% 0.6

@pritidesai
Copy link
Member Author

/retest

@pritidesai
Copy link
Member Author

Please help me with the CI. I'm not sure if the issue is related to the changes in the PR:

++ kubectl get pods --no-headers -n tekton-pipelines
+ local 'pods=tekton-events-controller-6b98c4df47-97tns     0/1   CrashLoopBackOff   1 (2s ago)   7s
tekton-pipelines-controller-5965b8d5c-drwfm   0/1   CrashLoopBackOff   1 (4s ago)   7s
tekton-pipelines-webhook-7bd87bfbd6-z87cz     0/1   CrashLoopBackOff   1 (2s ago)   7s'
++ echo 'tekton-events-controller-6b98c4df47-97tns     0/1   CrashLoopBackOff   1 (2s ago)   7s
tekton-pipelines-controller-5965b8d5c-drwfm   0/1   CrashLoopBackOff   1 (4s ago)   7s
tekton-pipelines-webhook-7bd87bfbd6-z87cz     0/1   CrashLoopBackOff   1 (2s ago)   7s'

@afrittoli
Copy link
Member

Please help me with the CI. I'm not sure if the issue is related to the changes in the PR:

++ kubectl get pods --no-headers -n tekton-pipelines
+ local 'pods=tekton-events-controller-6b98c4df47-97tns     0/1   CrashLoopBackOff   1 (2s ago)   7s
tekton-pipelines-controller-5965b8d5c-drwfm   0/1   CrashLoopBackOff   1 (4s ago)   7s
tekton-pipelines-webhook-7bd87bfbd6-z87cz     0/1   CrashLoopBackOff   1 (2s ago)   7s'
++ echo 'tekton-events-controller-6b98c4df47-97tns     0/1   CrashLoopBackOff   1 (2s ago)   7s
tekton-pipelines-controller-5965b8d5c-drwfm   0/1   CrashLoopBackOff   1 (4s ago)   7s
tekton-pipelines-webhook-7bd87bfbd6-z87cz     0/1   CrashLoopBackOff   1 (2s ago)   7s'

@pritidesai since it happens systematically on all CI jobs, it probably is related to the PR, unless there was a temporary issue in GitHub runners. You can use /retest to re-trigger.
We run the kind-diag action after each CI run, which collects all logs from the cluster, you should be able to get the crashing controller logs from there.

@vdemeester
Copy link
Member

@pritidesai I have the same crashloop locally, the logs

{"severity":"fatal","timestamp":"2025-07-24T08:57:48.855Z","logger":"tekton-pipelines-controller","caller":"sharedmain/main.go:284","message":"Failed to start configuration manager","commit":"cfe8c9c-dirty","error":"configmap \"config-wait-exponential-backoff\" not found","stacktrace":"knative.dev/pkg/injection/sharedmain.MainWithConfig\n\tknative.dev/[email protected]/injection/sharedmain/main.go:284\nmain.main\n\tgithub.com/tektoncd/pipeline/cmd/controller/main.go:104\nruntime.main\n\truntime/proc.go:283"}

The createTaskRun and createCustomRun now uses wait.ExponentialBackoff to retry
the creation of a taskRun or customRun when certain errors occur, specifically
webhook timeouts.

The function isWebhookTimeout checks if an error is a mutating adminssion
webhook timeout, by looking for HTTP 500 and the phrase "timeout" in the error
message.

If a webhook timeout is detected, the backoff loop will retry the creation
up to a configured number of steps, with increasing delay between attempts.

if the error is not a webhook timeout, the function will not retry and will
return the error immediately.

Errors that not webhook timeouts, e.g. HTTP 400 bad request, validation errors,
etc. are not retried and will cause the taskRun creation to fail as expected.

By default, the exponential backoff strategy is disabled. To enable this
feature, set the `enable-wait-exponential-backoff` to `true` in
feature-flags config map.

When enabled, the controller will use an exponential backoff strategy to retry
taskRun and customRun creation if it encounters transient errors such as
admission webhook timeouts.

This improves robustness against temporary webhook issues. If the feature flag
is set to false, the controller will not retry and will fail immediately on
such errors.

Configuration for the backoff parameters (duration, factor, steps, etc) can be
set in the wait-exponential-backoff config map.

Signed-off-by: Priti Desai <[email protected]>
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/config/feature_flags.go 94.8% 94.0% -0.8
pkg/apis/config/store.go 93.3% 93.9% 0.6
pkg/apis/config/wait_exponential_backoff.go Do not exist 87.0%
pkg/reconciler/pipelinerun/pipelinerun.go 91.6% 91.8% 0.2
test/controller.go 29.5% 30.1% 0.6

@pritidesai
Copy link
Member Author

pritidesai commented Jul 24, 2025

Thank you @afrittoli and @vdemeester. I had a mismatch in the ConfigMap names - its fixed now 🤞

@pritidesai
Copy link
Member Author

/retest

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pritidesai - this looks good to me, limiting the retry to the 500 - timeout case makes this a solid option. Would it make sense to do the same in the TaskRun controller for Pod creations as well?
/approve

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 24, 2025
@pritidesai
Copy link
Member Author

Would it make sense to do the same in the TaskRun controller for Pod creations as well?

Yes, definitely. I will create a new PR to update the taskRun controller once this one is merged.

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 28, 2025
@tekton-robot tekton-robot merged commit a0aa88a into tektoncd:main Jul 28, 2025
49 of 50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants