Skip to content

Conversation

aThorp96
Copy link
Member

@aThorp96 aThorp96 commented May 6, 2025

Changes

During resource creation/validation, retryale errors such as a k8s timeout should not result in the resource being marked as failed. To ensure this is the case, these errors must either be bubbled up or wrapped and cannot be un-wrapped into a new error using retryableErr.Error().

Additionally, IsTooManyRequests is now considered Transient (retryable).

/kind bug

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Retryable errors during dry-run Task validation will no longer cause a PipelineRun to be failed.

@tekton-robot tekton-robot added the release-note-none Denotes a PR that doesnt merit a release note. label May 6, 2025
@tekton-robot tekton-robot requested review from abayer and afrittoli May 6, 2025 14:55
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 6, 2025
@aThorp96 aThorp96 force-pushed the retryable-validation-error branch from e4afe99 to 2e6116e Compare May 6, 2025 14:59
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.2% 96.5% 0.3
pkg/resolution/common/errors.go 13.0% 14.3% 1.2

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesnt merit a release note. labels May 6, 2025
@aThorp96
Copy link
Member Author

aThorp96 commented May 6, 2025

/cc @vdemeester

@tekton-robot tekton-robot requested a review from vdemeester May 6, 2025 19:18
Copy link
Member

@waveywaves waveywaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this, can we also ensure an e2e test for this where a Run might fail deliberately on the first 2 tries but passes on the third try by design? Also it might be good to document this new transient error.

@aThorp96
Copy link
Member Author

aThorp96 commented May 7, 2025

can we also ensure an e2e test for this where a Run might fail deliberately on the first 2 tries but passes on the third try by design?

@waveywaves this is difficult but if there is prior art on it I can take a look. Would that be blocking?

It might be good to document this new transient error.

Yes, that's a good idea. Do you know where would be appropriate to document this, outside of release notes?

@waveywaves
Copy link
Member

A test for this would be hard, and the change IS a small one. The test is not blocking I believe. I couldn't find any docs related to transient errors during pipeline resolution. There are multiple resolvers documented under docs/ but this behavior is not documented anywhere. There isn't any documentation related to remote resolution which can help the user understand which errors are considered transient in case their pipeline might be facing these which also communicate the robustness of remote resolution. We should add this documentation under each resolver or at the end of the resolved reference doc. Something like

# Transient errors during resolution

Runs don't fail in case of transient errors. The following are a exhaustive list of errors which are considered as transient. 

~ note which transient errors exactly ~ 

~write in brief what exactly happens during these transient errors~

I would prefer putting it under the docs/resolver-reference.md towards the end of the docover writing it in every resolver doc.

@aThorp96 aThorp96 force-pushed the retryable-validation-error branch from 2e6116e to 7df2978 Compare May 8, 2025 11:52
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.2% 96.5% 0.3
pkg/resolution/common/errors.go 13.0% 14.3% 1.2

@waveywaves
Copy link
Member

/kind bug

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 8, 2025
During resource creation/validation, retryale errors such as a k8s
timeout should not result in the resource being marked as failed. To
ensure this is the case, these errors must either be bubbled up or
wrapped and cannot be un-wrapped into a new error using `retryableErr.Error()`.

Additionally, IsTooManyRequests is now considered Transient (retryable).
@aThorp96 aThorp96 force-pushed the retryable-validation-error branch from 7df2978 to db12db2 Compare May 8, 2025 23:46
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 96.2% 96.5% 0.3
pkg/resolution/common/errors.go 13.0% 14.3% 1.2

@aThorp96 aThorp96 requested a review from waveywaves May 9, 2025 01:46
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 😬

errType = ErrCouldntValidateObjectRetryable
}
return fmt.Errorf("%w %s: %s", errType, objectName, err.Error())
return fmt.Errorf("%w %s: %w", errType, objectName, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2025
@waveywaves
Copy link
Member

thank you for the changes @aThorp96

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 12, 2025
@tekton-robot tekton-robot merged commit 7b3e478 into tektoncd:main May 12, 2025
20 checks passed
@l-qing
Copy link
Member

l-qing commented Sep 14, 2025

@vdemeester @aThorp96 Can we cherry-pick this bug to the release-1.0.x branch? I recently encountered a similar issue on Tekton Pipeline v1.0. Furthermore, this v1.0 LTS will be maintained for some time to come. 😆

@vdemeester
Copy link
Member

/cherry-pick release-v1.0.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants