-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Checklist:
- I've included steps to reproduce the bug.
- I've included the version of argo rollouts.
Describe the bug
When you are in the middle of a release with Canary
(eg 70% stable, 30% canary) but have another release overwriting that argo-rollouts
tries to start over with the new version (eg 95% stable, 5% canary)
During this flip to the new version we check the health of stable
in UpdateHash but it has a specific case where it doesn't return an error but also didn't do what we thought it did.
This causes us to continue in trafficrouting
and starting the scale down with reconcileOtherReplicaSets()
.
This causes us to end up in a situation where:
- DestinationRule is still pointing to old-canary
- old-canary scale down is triggered
- new-canary is healthy but not getting traffic
- after scale down is done, 503/UH due to no endpoints being left
This will get eventually fixed, eg as stable RS recovers.
To Reproduce
- Trigger a deployment with
Istio
andCanary
mode. Usesubset
for traffic routing - Let it run to a phase where roll-out is mid way (eg. 70/30 split)
- Continuously cause stable RS to be not fully up by eg deleting pods randomly in a loop
- Trigger a new deployment
This should show logs like
New weights: &TrafficWeights{Canary:WeightDestination{Weight:5,ServiceName:,PodTemplateHash:<new-canary-rs>,},Stable:WeightDestination{Weight:95,ServiceName:,PodTemplateHash:<stable-rs,},Additional:[]WeightDestination{},Verified:nil,}
Previous weights: &TrafficWeights{Canary:WeightDestination{Weight:70,ServiceName:,PodTemplateHash:<old-canary-rs>,},Stable:WeightDestination{Weight:30,ServiceName:,PodTemplateHash:<stable-rs>,},Additional:[]WeightDestination{},Verified:nil,}
delaying destination rule switch: ReplicaSet <stable-rs> not fully available
And key here is that we do not see:
DestinationRule <rule> subset updated (canary: <new-rs>, stable: <stable-rs>)
Expected behavior
Theres two routes we could take
- Delay the whole process until
stable
is fully healthy. Eg return error fromupdateHash()
- This would be essentially reverting fix: abort scenario where canary/stable service is not provided #4299
- Ignore
stable
not being fully up and continue in cases where there is already a stable+canary in existence
Screenshots


6c89fd
is stable777998
isold-canary
6595cc
isnew-canary
Version
v1.8.3 of argo-rollouts
Logs
I've had to cut out some logs to reduce noise etc but this is the key. Event though still referenced
is noticed during original run on the next sync the Rollout
is updated to point to new canary
even though we havent updated DestinationRule
for subset
2025-07-31T16:47:58Z
: New weights: &TrafficWeights{Canary:WeightDestination{Weight:5,ServiceName:,PodTemplateHash:6595cc886,},Stable:WeightDestination{Weight:95,ServiceName:,PodTemplateHash:6c89fd9477,},Additional:[]WeightDestination{},Verified:nil,}
2025-07-31T16:47:58Z
: Skip scale down of older RS 'xxx-7779987b4d': still referenced
2025-07-31T16:47:58Z
: Rollout step 1/9 completed (setWeight: 5)
2025-07-31T16:47:58Z
: Reconciliation completed
2025-07-31T16:47:58Z
: Started syncing rollout
2025-07-31T16:47:58Z
: Reconciling 1 old ReplicaSets (total pods: n)
2025-07-31T16:47:58Z
: scaling down intermediate RS 'xxx-7779987b4d'
2025-07-31T16:47:58Z
: delaying destination rule switch: ReplicaSet xxx-6c89fd9477 not fully available
2025-07-31T16:48:32Z
: DestinationRule wealthsimple subset updated (canary: 6595cc886, stable: 6c89fd9477)
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.