Skip to content

If stable is not ready a new canary can cause traffic to be still on old RS while its being scaled down #4390

@n1koo

Description

@n1koo

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug

When you are in the middle of a release with Canary (eg 70% stable, 30% canary) but have another release overwriting that argo-rollouts tries to start over with the new version (eg 95% stable, 5% canary)

During this flip to the new version we check the health of stable in UpdateHash but it has a specific case where it doesn't return an error but also didn't do what we thought it did.

This causes us to continue in trafficrouting and starting the scale down with reconcileOtherReplicaSets().

This causes us to end up in a situation where:

  • DestinationRule is still pointing to old-canary
  • old-canary scale down is triggered
  • new-canary is healthy but not getting traffic
  • after scale down is done, 503/UH due to no endpoints being left

This will get eventually fixed, eg as stable RS recovers.

To Reproduce

  • Trigger a deployment with Istio and Canary mode. Use subset for traffic routing
  • Let it run to a phase where roll-out is mid way (eg. 70/30 split)
  • Continuously cause stable RS to be not fully up by eg deleting pods randomly in a loop
  • Trigger a new deployment

This should show logs like

  • New weights: &TrafficWeights{Canary:WeightDestination{Weight:5,ServiceName:,PodTemplateHash:<new-canary-rs>,},Stable:WeightDestination{Weight:95,ServiceName:,PodTemplateHash:<stable-rs,},Additional:[]WeightDestination{},Verified:nil,}
  • Previous weights: &TrafficWeights{Canary:WeightDestination{Weight:70,ServiceName:,PodTemplateHash:<old-canary-rs>,},Stable:WeightDestination{Weight:30,ServiceName:,PodTemplateHash:<stable-rs>,},Additional:[]WeightDestination{},Verified:nil,}
  • delaying destination rule switch: ReplicaSet <stable-rs> not fully available

And key here is that we do not see:

DestinationRule <rule> subset updated (canary: <new-rs>, stable: <stable-rs>)

Expected behavior

Theres two routes we could take

Screenshots

Image Image
  • 6c89fd is stable
  • 777998 is old-canary
  • 6595cc is new-canary

Version

v1.8.3 of argo-rollouts

Logs

I've had to cut out some logs to reduce noise etc but this is the key. Event though still referenced is noticed during original run on the next sync the Rollout is updated to point to new canary even though we havent updated DestinationRule for subset

2025-07-31T16:47:58Z: New weights: &TrafficWeights{Canary:WeightDestination{Weight:5,ServiceName:,PodTemplateHash:6595cc886,},Stable:WeightDestination{Weight:95,ServiceName:,PodTemplateHash:6c89fd9477,},Additional:[]WeightDestination{},Verified:nil,}
2025-07-31T16:47:58Z: Skip scale down of older RS 'xxx-7779987b4d': still referenced
2025-07-31T16:47:58Z: Rollout step 1/9 completed (setWeight: 5)
2025-07-31T16:47:58Z: Reconciliation completed
2025-07-31T16:47:58Z: Started syncing rollout
2025-07-31T16:47:58Z: Reconciling 1 old ReplicaSets (total pods: n)
2025-07-31T16:47:58Z: scaling down intermediate RS 'xxx-7779987b4d'
2025-07-31T16:47:58Z: delaying destination rule switch: ReplicaSet xxx-6c89fd9477 not fully available
2025-07-31T16:48:32Z: DestinationRule wealthsimple subset updated (canary: 6595cc886, stable: 6c89fd9477)


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions