Skip to content

Conversation

@sophiatev
Copy link
Contributor

@sophiatev sophiatev commented Oct 21, 2025

If a process ends right after the history table is updated but before the instance table is updated in a call to AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync for a terminal orchestration, then storage is left in an inconsistent state. The instance table shows the orchestration as running while the history table shows the orchestration completed with an ExecutionCompletedEvent at the end. For non-terminal orchestrations, this is not a problem because the call to complete the next work item will reconcile these two tables. However, in the current implementation, the call to AzureStorageOrchestrationService.LockNextTaskOrchestrationWorkItemAsync will recognize that the orchestration is in a terminal state based on its history and simply discard the work item, meaning the instance table is never updated. To fix this, we add a check in the case that a work item is received for an orchestration in a terminal state to confirm that the instance table has been correctly updated. If not, it is updated, and any orphaned blobs deleted.
This PR also adds an integration test for this scenario.

Fixes #1252

@sophiatev sophiatev requested a review from cgillum October 21, 2025 05:52
Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the test you added! A couple comments:

["ExecutionId"] = executionId,
["LastUpdatedTime"] = runtimeState.Events.Last().Timestamp,
["RuntimeStatus"] = runtimeState.OrchestrationStatus.ToString(),
["CompletedTime"] = runtimeState.Events.Last().Timestamp // do we want to do this as a rough proxy or DateTime.UtcNow?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about runtimeState.CompletedTime?

One slightly quirky thing about this method is that there's an implicit assumption that the orchestration is complete, but technically somebody could call this for a non-completed orchestration. It might be worth either a) adding some comments indicating that this should only be called for completed orchestrations (maybe even having an explicit check for this) or b) making the implementation defensive so that it works regardless of the orchestration state. In that case, we'd have to check to confirm that the orchestration is complete before attempting to set the "CompletedTime" field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I went with option a) (comments + method name change + method does nothing if the orchestration is not in a terminal state)

@sophiatev sophiatev merged commit d305ba3 into main Oct 24, 2025
44 checks passed
@sophiatev sophiatev deleted the stevosyan/fixing-azure-storage-instance-history-table-inconsistency branch October 24, 2025 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Completed orchestrations sometimes get permanently stuck in the Running status

3 participants