Skip to content

Conversation

pjiang-dev
Copy link
Contributor

@pjiang-dev pjiang-dev commented May 28, 2025

fixes #4128

When introducing ping-pong support for Istio https://github.com/argoproj/argo-rollouts/pull/3371/files this PR removed validation for canary/stable service for rollouts when using DestinationRule. This caused a deadlock scenario when aborting a rollout that has a replica not ready:

  1. The Istio reconciler won't update the DestinationRule because the ReplicaSet isn't available
  2. The ReplicaSet cleanup logic is waiting for the traffic routing to complete: "delaying destination rule switch: ReplicaSet my-srvc-64b664b6f not fully available"
  3. The rollout gets stuck in "Progressing" state and cannot abort

With this fix:

  1. Check only relevant ReplicaSets: Instead of checking all ReplicaSets for availability, only verify ReplicaSets that will actually receive traffic (stable and canary hashes), ignoring irrelevant ReplicaSets that don't impact traffic routing. This will allow aborts to happen. Abort clears the canary hash with the unavailable RS.
  2. Return nil instead of error: When unavailable ReplicaSets are detected, return nil to delay the DestinationRule update rather than returning an error, allowing the reconciliation process to continue and enabling proper abort functionality.
  3. Added tests for these scenarios where no canary/stable service are defined and there is an abort

This will fix both scenarios for manual abort or progressDeadlineAbort for rollouts.

kubectl argo rollouts get rollout my-srvc -n my-namespace 
Name:            my-srvc
Namespace:       my-namespace
Status:          ✖ Degraded
Message:         RolloutAborted: Rollout aborted update to revision 2
Strategy:        Canary
  Step:          0/3
  SetWeight:     0
  ActualWeight:  0
Images:          nginx:alpine (stable)
Replicas:
  Desired:       3
  Current:       3
  Updated:       0
  Ready:         3
  Available:     3

NAME                                 KIND        STATUS        AGE  INFO
⟳ my-srvc                            Rollout     ✖ Degraded    34m  
├──# revision:2                                                     
│  └──⧉ my-srvc-64b664b6f            ReplicaSet  • ScaledDown  33m  canary,delay:passed
└──# revision:1                                                     
   └──⧉ my-srvc-7bfd4dd8b6           ReplicaSet  ✔ Healthy     34m  stable
      ├──□ my-srvc-7bfd4dd8b6-5tzqg  Pod         ✔ Running     34m  ready:1/1
      ├──□ my-srvc-7bfd4dd8b6-c2rlx  Pod         ✔ Running     34m  ready:1/1
      └──□ my-srvc-7bfd4dd8b6-sltlg  Pod         ✔ Running     34m  ready:1/1

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@pjiang-dev pjiang-dev changed the title fix fix abort scenario where canary/stable service is not provided fix: abort scenario where canary/stable service is not provided May 28, 2025
Copy link
Contributor

github-actions bot commented May 28, 2025

Published E2E Test Results

  4 files    4 suites   3h 16m 18s ⏱️
115 tests 106 ✅  7 💤 2 ❌
462 runs  432 ✅ 28 💤 2 ❌

For more details on these failures, see this check.

Results for commit 515799e.

♻️ This comment has been updated with latest results.

Copy link
Contributor

github-actions bot commented May 28, 2025

Published Unit Test Results

2 311 tests   2 311 ✅  3m 1s ⏱️
  129 suites      0 💤
    1 files        0 ❌

Results for commit 515799e.

♻️ This comment has been updated with latest results.

Signed-off-by: Peter Jiang <[email protected]>
Signed-off-by: Peter Jiang <[email protected]>
Copy link

@zachaller zachaller merged commit ccf0be9 into argoproj:master May 30, 2025
27 checks passed
zachaller pushed a commit that referenced this pull request Jun 2, 2025
* fix: fix abort scenario where canary/stable service is not provided

Signed-off-by: Peter Jiang <[email protected]>

* remove not needed code

Signed-off-by: Peter Jiang <[email protected]>

* Modularize tests

Signed-off-by: Peter Jiang <[email protected]>

* fix tests

Signed-off-by: Peter Jiang <[email protected]>

* Refactor code to handle only checking relavant RS and return nil

Signed-off-by: Peter Jiang <[email protected]>

* update test

Signed-off-by: Peter Jiang <[email protected]>

* update test

Signed-off-by: Peter Jiang <[email protected]>

---------

Signed-off-by: Peter Jiang <[email protected]>
@zachaller zachaller added the cherry-pick-completed Used once we have cherry picked the PR to all requested releases label Jun 2, 2025
heshamelsherif97 pushed a commit to heshamelsherif97/argo-rollouts that referenced this pull request Jul 6, 2025
…proj#4299)

* fix: fix abort scenario where canary/stable service is not provided

Signed-off-by: Peter Jiang <[email protected]>

* remove not needed code

Signed-off-by: Peter Jiang <[email protected]>

* Modularize tests

Signed-off-by: Peter Jiang <[email protected]>

* fix tests

Signed-off-by: Peter Jiang <[email protected]>

* Refactor code to handle only checking relavant RS and return nil

Signed-off-by: Peter Jiang <[email protected]>

* update test

Signed-off-by: Peter Jiang <[email protected]>

* update test

Signed-off-by: Peter Jiang <[email protected]>

---------

Signed-off-by: Peter Jiang <[email protected]>
Signed-off-by: heshamelsherif97 <[email protected]>
n1koo added a commit to n1koo/argo-rollouts that referenced this pull request Aug 21, 2025
… canary on istio but do start scaling down old canaries RS
n1koo added a commit to n1koo/argo-rollouts that referenced this pull request Aug 21, 2025
… canary on istio but do start scaling down old canaries RS

Signed-off-by: Niko Kurtti <[email protected]>
n1koo added a commit to n1koo/argo-rollouts that referenced this pull request Aug 22, 2025
… canary on istio but do start scaling down old canaries RS

Signed-off-by: Niko Kurtti <[email protected]>
n1koo added a commit to n1koo/argo-rollouts that referenced this pull request Aug 22, 2025
…nsitions when stable RS not ready

Fixes issue argoproj#4390 where traffic routing rules could get out of sync with
scaling operations during canary deployment transitions, causing 503/UH errors.

When a new canary deployment starts while stable ReplicaSet is not fully
available, the UpdateHash method now properly returns an error to prevent
the DestinationRule update, ensuring old canary ReplicaSets are not scaled
down while still receiving traffic.

Key changes:
- Enhanced UpdateHash logic to distinguish between no-services and services-defined scenarios
- For rollouts with defined services: only block on stable RS during normal transitions
- For rollouts without services: preserve original argoproj#2507 behavior
- For abort scenarios: allow updates to proceed to avoid deadlock (argoproj#4299)
- Added comprehensive test coverage for all scenarios

Fixes argoproj#4390
Preserves fixes for argoproj#2507 and argoproj#4299

Signed-off-by: Niko Kurtti <[email protected]>
n1koo added a commit to n1koo/argo-rollouts that referenced this pull request Aug 22, 2025
…nsitions when stable RS not ready

Fixes issue argoproj#4390 where traffic routing rules could get out of sync with
scaling operations during canary deployment transitions, causing 503/UH errors.

When a new canary deployment starts while stable ReplicaSet is not fully
available, the UpdateHash method now properly returns an error to prevent
the DestinationRule update, ensuring old canary ReplicaSets are not scaled
down while still receiving traffic.

Key changes:
- Enhanced UpdateHash logic to distinguish between no-services and services-defined scenarios
- For rollouts with defined services: only block on stable RS during normal transitions
- For rollouts without services: preserve original argoproj#2507 behavior
- For abort scenarios: allow updates to proceed to avoid deadlock (argoproj#4299)
- Added comprehensive test coverage for all scenarios

Fixes argoproj#4390
Preserves fixes for argoproj#2507 and argoproj#4299

Signed-off-by: Niko Kurtti <[email protected]>
n1koo added a commit to n1koo/argo-rollouts that referenced this pull request Aug 27, 2025
… canary on istio but do start scaling down old canaries RS
Hariharasuthan99 pushed a commit to Hariharasuthan99/argo-rollouts that referenced this pull request Sep 3, 2025
…proj#4299)

* fix: fix abort scenario where canary/stable service is not provided

Signed-off-by: Peter Jiang <[email protected]>

* remove not needed code

Signed-off-by: Peter Jiang <[email protected]>

* Modularize tests

Signed-off-by: Peter Jiang <[email protected]>

* fix tests

Signed-off-by: Peter Jiang <[email protected]>

* Refactor code to handle only checking relavant RS and return nil

Signed-off-by: Peter Jiang <[email protected]>

* update test

Signed-off-by: Peter Jiang <[email protected]>

* update test

Signed-off-by: Peter Jiang <[email protected]>

---------

Signed-off-by: Peter Jiang <[email protected]>
Signed-off-by: phraajes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick/release-1.8 cherry-pick-completed Used once we have cherry picked the PR to all requested releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rollout stuck indefinitely when new pods fail to come up – abort and progressDeadline NOT triggering
2 participants