Skip to content

Conversation

@sophiatev
Copy link
Contributor

@sophiatev sophiatev commented Dec 9, 2025

This PR introduces the ability to use etags when attempting to update the instance table in Azure Storage upon completion of a work item. This behavior will be "off" by default (elaborated on below). This is to help detect the following scenario.

  1. Worker A attempts to complete a work item for an orchestration: it commits all outgoing messages, successfully updates the history table, then stalls.
  2. Worker B picks up the next work item, which ends the orchestrations, and successfully completes it (updates the history and instance tables to reflect the completed status of the orchestration and deletes all control queue messages for the orchestration).
  3. Worker A then resumes and updates the instance table to have status "Running".

Since the orchestration was completed in step 2 and all control queue messages for it deleted, there is no way to detect this scenario (i.e., no future messages will "retrigger" this orchestration to run). The only way to prevent this from happening, as far as I can tell, is to introduce etag usage for the instance table. Then, when worker A attempts to update the instance table in step 3, it will fail due to an etag mismatch.

This new behavior would require doing a read on the instance table to get the latest instance table etag for every new orchestration work item (assuming extended sessions are not enabled). After running some performance tests to validate the impact of this new I/O, I discovered that:

  1. When running 1000 fanout orchestrations in parallel, each of which does a Task.WhenAll on 10 activity calls, the existing code without the instance table etag usage took around 14.5 minutes to complete across 3 trials whereas this new code took around 17.5 minutes.
  2. When running 1000 sequential orchestrations in parallel, each of which sequentially awaits the result of 10 activity calls, the existing code took around 22 minutes to complete across 3 trials and the new code took around 25.

Given the negative performance impact of enabling this new etag usage, this PR hides it behind a feature flag in the AzureStorageOrchestrationServiceSettings which is off by default.

Copilot AI review requested due to automatic review settings December 9, 2025 07:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces ETag-based concurrency control for the Azure Storage instance table to prevent race conditions where a stalled worker incorrectly updates instance status after another worker has already completed the orchestration. The implementation uses ETags to ensure that instance table updates fail if the instance has been modified since it was last read.

Key Changes

  • Added OrchestrationETags class to track both instance and history table ETags separately
  • Modified tracking store interfaces and implementations to use ETags when updating the instance table
  • Added comprehensive tests covering both regular orchestrations and suborchestrations scenarios

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
test/DurableTask.AzureStorage.Tests/AzureStorageScaleTests.cs Reorganized imports and added two new test methods to verify proper exception handling when stalled workers attempt to update instance table
src/DurableTask.Core/OrchestrationState.cs Added internal Etag property to OrchestrationState class
src/DurableTask.AzureStorage/Tracking/TrackingStoreBase.cs Updated method signature to use OrchestrationETags and changed return type from Task<ETag?> to Task
src/DurableTask.AzureStorage/Tracking/InstanceStoreBackedTrackingStore.cs Updated UpdateStateAsync to use OrchestrationETags parameter and removed ETag return logic
src/DurableTask.AzureStorage/Tracking/ITrackingStore.cs Updated interface signature for UpdateStateAsync and renamed GetStateAsync to FetchInstanceStatusAsync
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs Implemented TryUpdateInstanceTableAsync method with ETag-based update logic and split-brain detection logging
src/DurableTask.AzureStorage/OrchestrationSessionManager.cs Updated to use FetchInstanceStatusAsync and propagate instance ETags through message metadata
src/DurableTask.AzureStorage/OrchestrationETags.cs New class to encapsulate both instance and history table ETags
src/DurableTask.AzureStorage/Messaging/OrchestrationSession.cs Changed from single ETag property to OrchestrationETags object
src/DurableTask.AzureStorage/MessageData.cs Added MessageMetadata property to store instance ETag information
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs Updated to pass OrchestrationETags instead of single ETag to tracking store

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sophiatev sophiatev merged commit 5eb2643 into main Dec 16, 2025
46 checks passed
@sophiatev sophiatev deleted the stevosyan/add-etag-to-instance-table branch December 16, 2025 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants