Adding Rewind to the SDK Layer #1250

sophiatev · 2025-10-15T18:26:36Z

This PR adds a new rewind strategy that will be leveraged by the DTS backend. Rather than manually deleting the failed task/suborchestration rows from the history table (as is done by the current rewind implementation), a new history event ExecutionRewoundEvent is introduced that acts in a similar way to other orchestration management actions (for example, termination, suspend, or resume requests). A user's request to rewind the orchestration will send this event to the orchestration. Upon receiving it, the SDK layer (specifically TaskOrchestrationDispatcher) will generate a new history for the orchestration with a new execution ID and the corresponding failed rows removed. It will create ExecutionRewoundEvents for all the failed suborchestrations as well, thereby recursively rewinding to the deepest failed leaves. It is the backend's responsibility to correctly process this new history and handle dispatching the ExecutionRewoundEvents.

Open questions

In the way that the SDK currently alters the history, we could end up with an OrchestratorStartedEvent immediately followed by an OrchestratorCompletedEvent if all the events in between got deleted. Are we okay with this?
Do we want to update the Timestamp of the new ExecutionStartedEvent in the new history to be the time it is actually created, or leave it as the time the original ExecutionStartedEvent was created?

Tracing

There are two options for how tracing will look for a rewound orchestration. One is to just append the rerun Activities/suborchestrations to the existing orchestration span, and the other is to create a new span. (The second option came about because when I enabled distributed tracing, without changing any code yet, I realized that a new span was being created when the orchestration was rerun. I think this might be because the trace activity for the original span is disposed upon orchestration completion, not quite sure). After discussion with team members, we decided to go with option two. This is an example of what it looks like (note that both spans appear under the original "create_orchestration" request):

The first span is the original run of the orchestration with the failed suborchestrations/Activities. The second span is the rewind which contains only the rerun failed suborchestrations/Activties (note, for example, that "RunSucceedSubOrchestrator" does not appear in the second span, since this suborchestration was successful and not rerun). This felt like a better approach because

It is very easy to see what exactly is rerun when the orchestration is rewound
The duration of the second span reflects the time it takes the rewound orchestration to complete
In some sense, rewinding the orchestration is sort of like "re-executing" it, just with a subset of the original suborchestrations/Activities

cgillum

Added some comments. Were you planning on adding some tests for this new behavior?

src/DurableTask.Core/History/ExecutionRewoundEvent.cs

src/DurableTask.Core/OrchestrationStatus.cs

cgillum · 2025-10-23T22:20:10Z

src/DurableTask.Core/TaskOrchestrationDispatcher.cs

+                isRewinding = true;
+                if (rewindEvent.ParentTraceContext != null)
+                {
+                    startEvent.ParentTraceContext = rewindEvent.ParentTraceContext;


Why do we change the trace context for the rewound orchestration?

Yeah, this deserves some more explanation. I updated the PR summary with an example.

src/DurableTask.Core/TaskOrchestrationDispatcher.cs

cgillum · 2025-10-23T22:26:26Z

src/DurableTask.Core/TaskOrchestrationDispatcher.cs

-                workItem.OrchestrationRuntimeState.AddEvent(message.Event);
+                // In this case, the ExecutionRewoundEvent has already been added to the history and is just sent as a way to trigger the failed deepest suborchestrations to rerun.
+                // We do not redundantly add it to the history in this situation.
+                if (!(message.Event is ExecutionRewoundEvent executionRewoundEvent && workItem.OrchestrationRuntimeState.OrchestrationStatus == OrchestrationStatus.Running))


What is the OrchestrationStatus.Running check for? I'm having a hard time understanding the reason for this check.

So there are two situations (+ another edge case one not worth mentioning here) where an orchestration will receive a rewind request. The first is a customer-triggered rewind. In this case this is a genuinely new history event that we want to add to the orchestration history. The second scenario is if an orchestration has no failed suborchestrations, it will resend itself the rewind request to "jumpstart" re-execution. (Previously, "generic events" were being used for this "jumpstarting" purpose). This second request is strictly to force a re-execution so it shouldn't be added to the history. For the first scenario, the orchestration will be in a terminal state, but for the second, it will be "running", which is why that check is there.

I tried to explain this in the comment but let me know if there's a way to make it more clear.

cgillum · 2025-10-23T22:31:37Z

src/DurableTask.Core/TaskOrchestrationDispatcher.cs

+                        if (runtimeState.ExecutionStartedEvent.TryGetParentTraceContext(out ActivityContext parentTraceContext))
+                        {
+                            // We set a new client span ID here so that the execution of the rewound suborchestration is not tied to the 
+                            // old parent.


This goes back to my other question: why do we want to remove the old trace associations? In some ways, I think of "rewind" as another form of suspend/resume, in which case we don't change the trace ID. I'm wondering why we'd treat rewind any differently.

As above, check out the new PR description.

sophiatev · 2025-10-24T21:08:04Z

Were you planning on adding some tests for this new behavior?

These are all in the DTS repo since that will be the only backend for now that leverages this new rewind strategy (I think they're pretty extensive: testing failed orchestrations with a fanout activity pattern to make sure all failed activities are rerun, testing a fanout pattern with failed suborchestrations as well, testing rewinding with a purged failed suborchestration, etc.)

cgillum · 2025-10-30T15:50:51Z

Responding to the open questions:

In the way that the SDK currently alters the history, we could end up with an OrchestratorStartedEvent immediately followed by an OrchestratorCompletedEvent if all the events in between got deleted. Are we okay with this?

This should be fine. I don't think it will materially affect anything.

Do we want to update the Timestamp of the new ExecutionStartedEvent in the new history to be the time it is actually created, or leave it as the time the original ExecutionStartedEvent was created?

I think you'll want to leave it as the time the original ExecutionStartedEvent was created. Otherwise, all the visualization will get messed up since the timestamps will be out of order relative to the rest of the history.

Sophia Tevosyan added 6 commits September 9, 2025 14:34

committing what i have for now

b1f485f

finishing up the initial implementation

86954ac

Merge branch 'main' into stevosyan/adding-new-rewind-strategy

aa2d518

everything is working

f0b18e7

removed unused usings

0de4fe6

Merge branch 'main' into stevosyan/adding-new-rewind-strategy

9857a01

sophiatev marked this pull request as ready for review October 15, 2025 19:12

sophiatev requested a review from cgillum October 15, 2025 19:12

Sophia Tevosyan added 4 commits October 17, 2025 11:34

fixed distributed tracing

3ee0a73

added some comments

43e26c7

added changes to avoid altering the original runtime state

5ab8370

changed a comment

14fe651

cgillum reviewed Oct 23, 2025

View reviewed changes

Sophia Tevosyan added 2 commits October 24, 2025 13:59

addressing PR comments

8a6f075

fixing the build errors

feb1ac7

fixing another build error

b462b9d

cgillum approved these changes Oct 30, 2025

View reviewed changes

Sophia Tevosyan added 4 commits October 31, 2025 13:53

updating a comment

4b3660f

Merge branch 'main' into stevosyan/adding-new-rewind-strategy

5ff64c1

updating package versions

6a14145

updated patch version not minor version

9eaa09a

bachuv approved these changes Nov 3, 2025

View reviewed changes

sophiatev merged commit 8576bc5 into main Nov 3, 2025
44 checks passed

sophiatev deleted the stevosyan/adding-new-rewind-strategy branch November 3, 2025 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding Rewind to the SDK Layer #1250

Adding Rewind to the SDK Layer #1250

Uh oh!

sophiatev commented Oct 15, 2025 •

edited

Loading

Uh oh!

cgillum left a comment

Uh oh!

Uh oh!

Uh oh!

cgillum Oct 23, 2025

Uh oh!

sophiatev Oct 24, 2025

Uh oh!

Uh oh!

Uh oh!

cgillum Oct 23, 2025

Uh oh!

sophiatev Oct 24, 2025

Uh oh!

cgillum Oct 23, 2025

Uh oh!

sophiatev Oct 24, 2025

Uh oh!

sophiatev commented Oct 24, 2025

Uh oh!

cgillum commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Adding Rewind to the SDK Layer #1250

Adding Rewind to the SDK Layer #1250

Uh oh!

Conversation

sophiatev commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Open questions

Tracing

Uh oh!

cgillum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cgillum Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

sophiatev Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cgillum Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

sophiatev Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

cgillum Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

sophiatev Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

sophiatev commented Oct 24, 2025

Uh oh!

cgillum commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sophiatev commented Oct 15, 2025 •

edited

Loading