Skip to content

[FS-280753] - redis lock enabled for decide method #78

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: journey_prestaging
Choose a base branch
from

Conversation

akashvenkatesan0
Copy link
Collaborator

Pull Request type

  • Bugfix
  • Feature
  • Refactoring (no functional changes, no api changes)
  • Build related changes (Please run ./gradlew generateLock saveLock to refresh dependencies)
  • WHOSUSING.md
  • Other (please describe):

NOTE: Please remember to run ./gradlew spotlessApply to fix any format violations.

Changes in this PR

Describe the new behavior from this PR, and why it's needed
Issue #

When executing workflows that utilize Fork/Join constructs where multiple sub-workflows run in parallel, we're encountering an issue where a single task (whether it's a system task or a custom task) is being scheduled multiple times with the same attempt number (attempt 0). This issue is easily reproducible in a local environment.
The root cause lies in Conductor's DeciderService (DeciderService#decide(com.netflix.conductor.model.WorkflowModel)), which is responsible for determining and scheduling the next set of tasks by placing them in a queue for task workers to pick up.
The current implementation schedules the next set of tasks based on their statuses retrieved from the workflow entity in the database, which reflects the most recent execution state. Once a task is selected for scheduling, it's marked as executed to prevent duplicate execution.
This decide method is invoked from multiple places such as after the workflow is initiated, upon task completion, and by the sweeper service (which handles rescheduling of timed-out tasks). In scenarios where two tasks (within a fork/join structure) complete concurrently on separate threads and invoke the decide method, the same task can be scheduled multiple times with the same attempt number.
The above decide method is called from WorkflowExecutor#decide(java.lang.String) which includes a locking mechanism to prevent such race conditions, we had disabled it under the assumption that only sequential workflows would be executed—where this issue doesn't occur.
However, after enabling the localOnly lock (already implemented in Conductor for single-instance deployments), the issue no longer reproduces locally. In production environments, we may need to rely on the Redis-based lock (also implemented and currently in use for task status updates). The Netflix Conductor community also strongly recommends enabling locking when working with parallel workflows (see link).

image

Fix:
Enabled redis lock as suggested by the community.

Alternatives considered

Describe alternative implementation you have considered

narasimhanft
narasimhanft previously approved these changes Jun 11, 2025
ramratanjava
ramratanjava previously approved these changes Jun 23, 2025
Base automatically changed from FS-239161 to journey_prestaging June 25, 2025 06:32
@Ramya-NiranjanKumar Ramya-NiranjanKumar dismissed stale reviews from ramratanjava, logeshkumar-ramar, and narasimhanft June 25, 2025 06:32

The base branch was changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants