-
Notifications
You must be signed in to change notification settings - Fork 27
Description
During previous rollouts, we saw that the errors towards customers started when the second zone (zone-b) was rolled by the rollout operator.
If we stopped the rollout after the first zone because we observed a high error rate on ingesters, we could have avoided customer impact.
We also need to pause for a minute or so after the first zone to check for errors. We have see, errors starting immediately as zone-a goes online and zone-b is terminated.
Some more thoughts from a meeting where we talked about this issue:
- The problem might not be in ingesters at all, so they might not even know that something's broken and will continue to rollout.
- Not sure if rollout-operator should be able to query and understand Prometheus metrics
- Something that rollout-operator already does is checking whether all pods are ready/healthy thus another proposal: what if we add another annotation? E.g. grafana.com/rollout-operator/must-be-healthy: deploy/foo, which would indicate that deploy/foo should have all its pods ready & healthy in order to proceed.
Then we can run a cell-health-check deployment that would do the necessary checks: read status from memberlist, run promql, etc., and would just expose everything through it's readiness/healthiness endpoint. We could also use that deployment to export metrics that would unblock the CD process and rollout the next cell.
Original authors of this issue: @krajorama, @bboreham, @colega.