exponential backoff for `taskRun` and `customRun` creation #8902

pritidesai · 2025-07-22T00:09:02Z

Changes

The createTaskRun and createCustomRun functions now use wait.ExponentialBackoff to retry creation when specific transient errors occur - particularly admission webhook timeouts.

The helper function isWebhookTimeout determines whether an error is due to a webhook timeout. It checks for:

HTTP 500 status codes
The presence of the word "timeout" in the error message

If a webhook timeout is detected, the backoff loop will retry the creation up to a configured number of steps, with increasing delay between attempts.

If the error is not a webhook timeout, the function will not retry and will return the error immediately.

Errors that are not webhook timeouts, e.g. HTTP 400, validation errors, etc. are not retried and will cause the taskRun creation to fail as expected.

By default, the exponential backoff strategy is disabled. To enable this feature, set the enable-wait-exponential-backoff to true in feature-flags config map.

When enabled, the controller will use an exponential backoff strategy to retry taskRun and customRun creation if it encounters transient errors such as admission webhook timeouts.

This improves robustness against temporary webhook issues by allowing the controller to gracefully retry instead of failing immediately.

Configuration for the backoff parameters (duration, factor, steps, etc) can be set in the wait-exponential-backoff config map.

/kind feature

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
Has Tests included if any functionality added or changed
pre-commit Passed
Follows the commit message standard
Meets the Tekton contributor standards (including functionality, content, code)
Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes


- Introduced **exponential backoff retry** mechanism for `createTaskRun` and `createCustomRun` functions.
- Retries are triggered only on **mutating admission webhook timeouts** (HTTP 500 with "timeout" in the error message).
- Non-retryable errors (e.g., HTTP 400, validation failures) continue to fail immediately.
- Feature is **disabled by default**. To enable, set `enable-wait-exponential-backoff: "true"` in the `feature-flags` ConfigMap.
- Backoff parameters (duration, factor, steps) are configurable via the `wait-exponential-backoff` ConfigMap.
- Improves robustness against transient webhook issues in a heavy cluster during resource creation.

tekton-robot · 2025-07-22T00:22:31Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/feature_flags.go	94.8%	94.0%	-0.8
pkg/apis/config/store.go	93.3%	93.9%	0.6
pkg/apis/config/wait_exponential_backoff.go	Do not exist	87.0%
pkg/reconciler/pipelinerun/pipelinerun.go	91.6%	91.8%	0.2
test/controller.go	29.5%	30.1%	0.6

tekton-robot · 2025-07-22T00:46:55Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/feature_flags.go	94.8%	94.0%	-0.8
pkg/apis/config/store.go	93.3%	93.9%	0.6
pkg/apis/config/wait_exponential_backoff.go	Do not exist	87.0%
pkg/reconciler/pipelinerun/pipelinerun.go	91.6%	91.8%	0.2
test/controller.go	29.5%	30.1%	0.6

tekton-robot · 2025-07-22T01:40:08Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/feature_flags.go	94.8%	94.0%	-0.8
pkg/apis/config/store.go	93.3%	93.9%	0.6
pkg/apis/config/wait_exponential_backoff.go	Do not exist	87.0%
pkg/reconciler/pipelinerun/pipelinerun.go	91.6%	91.8%	0.2
test/controller.go	29.5%	30.1%	0.6

tekton-robot · 2025-07-22T01:41:43Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/feature_flags.go	94.8%	94.0%	-0.8
pkg/apis/config/store.go	93.3%	93.9%	0.6
pkg/apis/config/wait_exponential_backoff.go	Do not exist	87.0%
pkg/reconciler/pipelinerun/pipelinerun.go	91.6%	91.8%	0.2
test/controller.go	29.5%	30.1%	0.6

tekton-robot · 2025-07-23T23:16:03Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/feature_flags.go	94.8%	94.0%	-0.8
pkg/apis/config/store.go	93.3%	93.9%	0.6
pkg/apis/config/wait_exponential_backoff.go	Do not exist	87.0%
pkg/reconciler/pipelinerun/pipelinerun.go	91.6%	91.8%	0.2
test/controller.go	29.5%	30.1%	0.6

pritidesai · 2025-07-24T00:35:46Z

/retest

pritidesai · 2025-07-24T01:52:42Z

Please help me with the CI. I'm not sure if the issue is related to the changes in the PR:

++ kubectl get pods --no-headers -n tekton-pipelines
+ local 'pods=tekton-events-controller-6b98c4df47-97tns     0/1   CrashLoopBackOff   1 (2s ago)   7s
tekton-pipelines-controller-5965b8d5c-drwfm   0/1   CrashLoopBackOff   1 (4s ago)   7s
tekton-pipelines-webhook-7bd87bfbd6-z87cz     0/1   CrashLoopBackOff   1 (2s ago)   7s'
++ echo 'tekton-events-controller-6b98c4df47-97tns     0/1   CrashLoopBackOff   1 (2s ago)   7s
tekton-pipelines-controller-5965b8d5c-drwfm   0/1   CrashLoopBackOff   1 (4s ago)   7s
tekton-pipelines-webhook-7bd87bfbd6-z87cz     0/1   CrashLoopBackOff   1 (2s ago)   7s'

afrittoli · 2025-07-24T08:52:46Z

Please help me with the CI. I'm not sure if the issue is related to the changes in the PR:

++ kubectl get pods --no-headers -n tekton-pipelines
+ local 'pods=tekton-events-controller-6b98c4df47-97tns     0/1   CrashLoopBackOff   1 (2s ago)   7s
tekton-pipelines-controller-5965b8d5c-drwfm   0/1   CrashLoopBackOff   1 (4s ago)   7s
tekton-pipelines-webhook-7bd87bfbd6-z87cz     0/1   CrashLoopBackOff   1 (2s ago)   7s'
++ echo 'tekton-events-controller-6b98c4df47-97tns     0/1   CrashLoopBackOff   1 (2s ago)   7s
tekton-pipelines-controller-5965b8d5c-drwfm   0/1   CrashLoopBackOff   1 (4s ago)   7s
tekton-pipelines-webhook-7bd87bfbd6-z87cz     0/1   CrashLoopBackOff   1 (2s ago)   7s'

@pritidesai since it happens systematically on all CI jobs, it probably is related to the PR, unless there was a temporary issue in GitHub runners. You can use /retest to re-trigger.
We run the kind-diag action after each CI run, which collects all logs from the cluster, you should be able to get the crashing controller logs from there.

vdemeester · 2025-07-24T08:59:54Z

@pritidesai I have the same crashloop locally, the logs

{"severity":"fatal","timestamp":"2025-07-24T08:57:48.855Z","logger":"tekton-pipelines-controller","caller":"sharedmain/main.go:284","message":"Failed to start configuration manager","commit":"cfe8c9c-dirty","error":"configmap \"config-wait-exponential-backoff\" not found","stacktrace":"knative.dev/pkg/injection/sharedmain.MainWithConfig\n\tknative.dev/[email protected]/injection/sharedmain/main.go:284\nmain.main\n\tgithub.com/tektoncd/pipeline/cmd/controller/main.go:104\nruntime.main\n\truntime/proc.go:283"}

The createTaskRun and createCustomRun now uses wait.ExponentialBackoff to retry the creation of a taskRun or customRun when certain errors occur, specifically webhook timeouts. The function isWebhookTimeout checks if an error is a mutating adminssion webhook timeout, by looking for HTTP 500 and the phrase "timeout" in the error message. If a webhook timeout is detected, the backoff loop will retry the creation up to a configured number of steps, with increasing delay between attempts. if the error is not a webhook timeout, the function will not retry and will return the error immediately. Errors that not webhook timeouts, e.g. HTTP 400 bad request, validation errors, etc. are not retried and will cause the taskRun creation to fail as expected. By default, the exponential backoff strategy is disabled. To enable this feature, set the `enable-wait-exponential-backoff` to `true` in feature-flags config map. When enabled, the controller will use an exponential backoff strategy to retry taskRun and customRun creation if it encounters transient errors such as admission webhook timeouts. This improves robustness against temporary webhook issues. If the feature flag is set to false, the controller will not retry and will fail immediately on such errors. Configuration for the backoff parameters (duration, factor, steps, etc) can be set in the wait-exponential-backoff config map. Signed-off-by: Priti Desai <[email protected]>

tekton-robot · 2025-07-24T15:56:06Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/feature_flags.go	94.8%	94.0%	-0.8
pkg/apis/config/store.go	93.3%	93.9%	0.6
pkg/apis/config/wait_exponential_backoff.go	Do not exist	87.0%
pkg/reconciler/pipelinerun/pipelinerun.go	91.6%	91.8%	0.2
test/controller.go	29.5%	30.1%	0.6

pritidesai · 2025-07-24T16:51:28Z

Thank you @afrittoli and @vdemeester. I had a mismatch in the ConfigMap names - its fixed now 🤞

pritidesai · 2025-07-24T16:52:18Z

/retest

afrittoli

Thanks @pritidesai - this looks good to me, limiting the retry to the 500 - timeout case makes this a solid option. Would it make sense to do the same in the TaskRun controller for Pod creations as well?
/approve

tekton-robot · 2025-07-24T17:22:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [afrittoli]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pritidesai · 2025-07-24T19:23:21Z

Would it make sense to do the same in the TaskRun controller for Pod creations as well?

Yes, definitely. I will create a new PR to update the taskRun controller once this one is merged.

vdemeester

/lgtm

github-project-automation bot added this to Tekton Community Roadmap Jul 22, 2025

github-project-automation bot moved this to Todo in Tekton Community Roadmap Jul 22, 2025

tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Jul 22, 2025

tekton-robot requested review from afrittoli and jerop July 22, 2025 00:09

tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 22, 2025

pritidesai force-pushed the backoff branch from 79b8639 to 7bf4d76 Compare July 22, 2025 00:34

pritidesai force-pushed the backoff branch 2 times, most recently from f4a09d4 to c2b4967 Compare July 22, 2025 01:28

pritidesai added this to the v1.3.0 (LTS) milestone Jul 22, 2025

pritidesai force-pushed the backoff branch from c2b4967 to cfe8c9c Compare July 23, 2025 23:04

pritidesai force-pushed the backoff branch from cfe8c9c to 49139f4 Compare July 24, 2025 15:46

afrittoli reviewed Jul 24, 2025

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 24, 2025

vdemeester reviewed Jul 28, 2025

View reviewed changes

tekton-robot assigned vdemeester Jul 28, 2025

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 28, 2025

tekton-robot merged commit a0aa88a into tektoncd:main Jul 28, 2025
49 of 50 checks passed

github-project-automation bot moved this from Todo to Done in Tekton Community Roadmap Jul 28, 2025

pritidesai mentioned this pull request Jul 29, 2025

exponential backoff in taskRun controller #8926

Merged

8 tasks

exponential backoff for taskRun and customRun creation #8902

exponential backoff for taskRun and customRun creation #8902

Uh oh!

Conversation

pritidesai commented Jul 22, 2025

Changes

Submitter Checklist

Release Notes

Uh oh!

tekton-robot commented Jul 22, 2025

Uh oh!

tekton-robot commented Jul 22, 2025

Uh oh!

tekton-robot commented Jul 22, 2025

Uh oh!

tekton-robot commented Jul 22, 2025

Uh oh!

tekton-robot commented Jul 23, 2025

Uh oh!

pritidesai commented Jul 24, 2025

Uh oh!

pritidesai commented Jul 24, 2025

Uh oh!

afrittoli commented Jul 24, 2025

Uh oh!

vdemeester commented Jul 24, 2025

Uh oh!

tekton-robot commented Jul 24, 2025

Uh oh!

pritidesai commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pritidesai commented Jul 24, 2025

Uh oh!

afrittoli left a comment

Choose a reason for hiding this comment

Uh oh!

tekton-robot commented Jul 24, 2025

Uh oh!

pritidesai commented Jul 24, 2025

Uh oh!

vdemeester left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

exponential backoff for `taskRun` and `customRun` creation #8902

exponential backoff for `taskRun` and `customRun` creation #8902

pritidesai commented Jul 24, 2025 •

edited

Loading