-
Notifications
You must be signed in to change notification settings - Fork 320
Adding ETag Usage to the Instance Table #1280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces ETag-based concurrency control for the Azure Storage instance table to prevent race conditions where a stalled worker incorrectly updates instance status after another worker has already completed the orchestration. The implementation uses ETags to ensure that instance table updates fail if the instance has been modified since it was last read.
Key Changes
- Added
OrchestrationETagsclass to track both instance and history table ETags separately - Modified tracking store interfaces and implementations to use ETags when updating the instance table
- Added comprehensive tests covering both regular orchestrations and suborchestrations scenarios
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
test/DurableTask.AzureStorage.Tests/AzureStorageScaleTests.cs |
Reorganized imports and added two new test methods to verify proper exception handling when stalled workers attempt to update instance table |
src/DurableTask.Core/OrchestrationState.cs |
Added internal Etag property to OrchestrationState class |
src/DurableTask.AzureStorage/Tracking/TrackingStoreBase.cs |
Updated method signature to use OrchestrationETags and changed return type from Task<ETag?> to Task |
src/DurableTask.AzureStorage/Tracking/InstanceStoreBackedTrackingStore.cs |
Updated UpdateStateAsync to use OrchestrationETags parameter and removed ETag return logic |
src/DurableTask.AzureStorage/Tracking/ITrackingStore.cs |
Updated interface signature for UpdateStateAsync and renamed GetStateAsync to FetchInstanceStatusAsync |
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs |
Implemented TryUpdateInstanceTableAsync method with ETag-based update logic and split-brain detection logging |
src/DurableTask.AzureStorage/OrchestrationSessionManager.cs |
Updated to use FetchInstanceStatusAsync and propagate instance ETags through message metadata |
src/DurableTask.AzureStorage/OrchestrationETags.cs |
New class to encapsulate both instance and history table ETags |
src/DurableTask.AzureStorage/Messaging/OrchestrationSession.cs |
Changed from single ETag property to OrchestrationETags object |
src/DurableTask.AzureStorage/MessageData.cs |
Added MessageMetadata property to store instance ETag information |
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs |
Updated to pass OrchestrationETags instead of single ETag to tracking store |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This PR introduces the ability to use etags when attempting to update the instance table in Azure Storage upon completion of a work item. This behavior will be "off" by default (elaborated on below). This is to help detect the following scenario.
Since the orchestration was completed in step 2 and all control queue messages for it deleted, there is no way to detect this scenario (i.e., no future messages will "retrigger" this orchestration to run). The only way to prevent this from happening, as far as I can tell, is to introduce etag usage for the instance table. Then, when worker A attempts to update the instance table in step 3, it will fail due to an etag mismatch.
This new behavior would require doing a read on the instance table to get the latest instance table etag for every new orchestration work item (assuming extended sessions are not enabled). After running some performance tests to validate the impact of this new I/O, I discovered that:
Task.WhenAllon 10 activity calls, the existing code without the instance table etag usage took around 14.5 minutes to complete across 3 trials whereas this new code took around 17.5 minutes.Given the negative performance impact of enabling this new etag usage, this PR hides it behind a feature flag in the
AzureStorageOrchestrationServiceSettingswhich is off by default.