Skip to content

Conversation

jerop
Copy link
Member

@jerop jerop commented Jun 9, 2022

TEP-0090: Matrix proposed executing a PipelineTask in
parallel TaskRuns and Runs with substitutions from combinations
of Parameters in a Matrix.

In this change, we propose a fail-fast stategy for handling failures
in TaskRuns and Runs created from a Matrix. This makes it easier
to control the execution of the many TaskRuns or Runs created from
a Matrix - up to 256.

This approach also aligns with the default behavior of Matrix in
other Continuous Delivery systems e.g GitHub Actions.

If needed, we can explore supporting other failure strategies later.

/kind tep

@tekton-robot tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Jun 9, 2022
@tekton-robot tekton-robot requested review from khrm and vincent-pli June 9, 2022 23:33
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 9, 2022
@jerop
Copy link
Member Author

jerop commented Jun 10, 2022

cc @abayer @dibyom @lbernick @pritidesai please take a look

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 10, 2022
created from a `Matrix` (up to 256 for now). Moreover, failing fast is the default behavior of `Matrix`
in other Continuous Delivery systems - see [GitHub Actions - Handling Failures in Matrix][ghm-failfast].

If needed, we can explore supporting other failure strategies later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the right call - in Jenkins Pipeline, we inherited certain assumptions around failure behavior for parallel executions (both in a matrix and not) which resulted in us having to support making fail-fast configurable, and I don't think it was worth doing that in the end. When we're starting from scratch, like here, we have the opportunity to say "this is how things work" and adjust if sufficient demand comes up in the future, rather than prematurely optimizing. So, yeah. 👍

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abayer, dibyom, lbernick

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


In failure scenarios, the `TaskRuns` or `Runs` created from a `Matrix` will fail fast. That is, when
any `TaskRun` or `Run` from a given fanned-out `PipelineTask` fails or is cancelled then the other
`TaskRuns` or `Runs` from the same `PipelineTask` will be cancelled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jerop, looking for more clarification, how are we failing already running taskRuns or runs from the same pipelineTask? Sending termination signal?

Can we implement something like stopping mode by default, the similar approach we take when a task in a pipeline fails?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason I am asking is I am assuming stopping is easier to implement 🤣 And does not result in partial execution ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritidesai yes, similarly to how we cancel TaskRuns for example - it may be easier to implement stopping mode but wondering if that's the behavior we want for Matrix separate from the implementation - maybe partial execution could be the reason not to do this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopping will be consistent to how we address failure of a non-matrix pipelineTask.

Just an example, if building an image of one application fails, I do not want that failure to stop building images of 20 other applications.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense - added this to the API WG agenda on Monday so we can decide on the best way forward

@lbernick
Copy link
Member

/assign @abayer @lbernick @pritidesai @dibyom

[TEP-0090: Matrix][tep-0090] proposed executing a `PipelineTask` in
parallel `TaskRuns` and `Runs` with substitutions from combinations
of `Parameters` in a `Matrix`.

In this change, we propose a fail-fast stategy for handling failures
in `TaskRuns` and `Runs` created from a `Matrix`. This makes it easier
to control the execution of the many `TaskRuns` or `Runs` created from
a `Matrix` - up to 256.

This approach also aligns with the default behavior of `Matrix` in
other Continuous Delivery systems e.g [GitHub Actions][ghm-failfast].

If needed, we can explore supporting other failure strategies later.

[tep-0090]: https://github.com/tektoncd/community/blob/main/teps/0090-matrix.md
[ghm-failfast]: https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs#handling-failures
@jerop jerop force-pushed the tep-0090-failfast branch from c76d569 to b333fa0 Compare June 13, 2022 18:44
@tekton-robot tekton-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 13, 2022
@jerop
Copy link
Member Author

jerop commented Jun 13, 2022

Discussed at the API WG on 06/13 and we agreed on the following:

  • There are use cases for both terminating early (failing fast) and allowing completing execution of running taskruns (stopping mode)
  • Terminating early upon failure is applies more broadly beyond matrix; this is something we can address separately for all taskruns
  • The initial release of matrix will implement the "stopping mode" that the rest of the pipeline uses; we will revisit failure strategies later if needed

/close

@tekton-robot
Copy link
Contributor

@jerop: Closed this PR.

In response to this:

Discussed at the API WG on 06/13 and we agreed on the following:

  • There are use cases for both terminating early (failing fast) and allowing completing execution of running taskruns (stopping mode)
  • Terminating early upon failure is applies more broadly beyond matrix; this is something we can address separately for all taskruns
  • The initial release of matrix will implement the "stopping mode" that the rest of the pipeline uses; we will revisit failure strategies later if needed

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot pushed a commit to tektoncd/pipeline that referenced this pull request Jun 14, 2022
In #4951, we implemented
`isFailure` for matrixed `TaskRuns` where we applied fail-fast
failure strategy. We discussed failure strategies further in this
PR - tektoncd/community#724 - and API WG
on 13 June 2022. We agreed to leave early termination upon failure
out of scope for the initial release of Matrix. We plan to explore
failure strategies, including fail-fast, in future work. And these
failure strategies may apply more broadly beyond Matrix.

In this change, we update `isFailure` to evaluate to `true` only when
there's a failure and there are no running `TaskRuns` in the `rprt`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

Status: UnAssigned

Development

Successfully merging this pull request may close these issues.

6 participants