Rollout operator should stop rollout if a zone produces a high error rate

During previous rollouts, we saw that the errors towards customers started when the second zone (zone-b) was rolled by the rollout operator.

If we stopped the rollout after the first zone because we observed a high error rate on ingesters, we could have avoided customer impact.

We also need to pause for a minute or so after the first zone to check for errors. We have see, errors starting immediately as zone-a goes online and zone-b is terminated.

Some more thoughts from a meeting where we talked about this issue:
- The problem might not be in ingesters at all, so they might not even know that something's broken and will continue to rollout.
- Not sure if rollout-operator should be able to query and understand Prometheus metrics
- Something that rollout-operator already does is checking whether all pods are ready/healthy thus another proposal: what if we add another annotation? E.g. grafana.com/rollout-operator/must-be-healthy: deploy/foo, which would indicate that deploy/foo should have all its pods ready & healthy in order to proceed.

Then we can run a cell-health-check deployment that would do the necessary checks: read status from memberlist, run promql, etc., and would just expose everything through it's readiness/healthiness endpoint. We could also use that deployment to export metrics that would unblock the CD process and rollout the next cell.

Original authors of this issue: @krajorama, @bboreham, @colega.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rollout operator should stop rollout if a zone produces a high error rate #178

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rollout operator should stop rollout if a zone produces a high error rate #178

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions