-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Feature request
A way of signaling an expected failure (e.g. source code does not compile) from any other kind of error (e.g. some external service was unreachable), as well as integrations of this to metrics and retries.
Use case
We use tekton in our company platform to provide generic CI/CD pipelines as service for other teams, e.g. a "build .net backend" pipeline, and are experiencing some troubles:
When a pipeline run fails, this can be either
- expected behavior
- some problem with external services / kubernetes, some error in our pipeline / task definition or anything else.
As platform team we need to monitor failures of type 2 independently of type 1, so that we can get alarms and look into why the pipeline run failed. Failures of type 1 are the responsibility of the user of the pipeline.
So, depending on the type of failure, we want to do different notifications to different people as well as different error handling in the pipeline:
On an expected failure in some task, the pipeline should fail right away, and via events we notify the pipeline user and update e.g. some pull request.
If the failure if of the 2nd kind, the usual solution is to rerun the pipeline, and often enough this rerun is successful. However this is tiresome and we would like the pipeline to do the retries itself.
When thinking about different solutions to this problem we realized that adjusting the tasks cannot solve all kinds of problems: If the pod is evicted for some reason, there is no way how we could fix this from inside, this error needs to be handled by the pipelines controller. Of course retries for tasks can help in this case, but defining retries is not really an option for tasks where failure (of type 1) is expected, as this wastes time and resources unnecessarily.
The only actual solution we could think of, was writing tasks in an idempotent way, defining retries, and making them succeed even in the cases of failures of type 1. We could then indicate this failure as a task result, and prevent further tasks from executing via when-expressions. We could use the finally task or events to act on this outcome correctly.
This makes writing pipelines more cumbersome, because of the when conditions and processing results in the finally task. As far as we know there is no way of querying a certain result for all tasks of the pipeline, thus we would need to (and not forget to) make changes in the finally task whenever new tasks are added.
This approach also makes the visual experience worse, as you do not have a visual indication of which task failed in the tekton dashboard.
Summing up, we think that a lot of tekton users probably also have issues with sporadic failures, where they would profit from a generally applied retry mechanism to improve pipeline robustness, even though users would have to take extra effort to write tasks idempotently as well as signal expected failures. Especially for teams providing tekton pipelines as a service for others, it would be really beneficial to have an easy way of monitoring the different kinds of failures, as different people are then responsible for the resolution.