-
Notifications
You must be signed in to change notification settings - Fork 320
Adding Rewind to the SDK Layer #1250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cgillum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments. Were you planning on adding some tests for this new behavior?
| isRewinding = true; | ||
| if (rewindEvent.ParentTraceContext != null) | ||
| { | ||
| startEvent.ParentTraceContext = rewindEvent.ParentTraceContext; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we change the trace context for the rewound orchestration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this deserves some more explanation. I updated the PR summary with an example.
| workItem.OrchestrationRuntimeState.AddEvent(message.Event); | ||
| // In this case, the ExecutionRewoundEvent has already been added to the history and is just sent as a way to trigger the failed deepest suborchestrations to rerun. | ||
| // We do not redundantly add it to the history in this situation. | ||
| if (!(message.Event is ExecutionRewoundEvent executionRewoundEvent && workItem.OrchestrationRuntimeState.OrchestrationStatus == OrchestrationStatus.Running)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the OrchestrationStatus.Running check for? I'm having a hard time understanding the reason for this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are two situations (+ another edge case one not worth mentioning here) where an orchestration will receive a rewind request. The first is a customer-triggered rewind. In this case this is a genuinely new history event that we want to add to the orchestration history. The second scenario is if an orchestration has no failed suborchestrations, it will resend itself the rewind request to "jumpstart" re-execution. (Previously, "generic events" were being used for this "jumpstarting" purpose). This second request is strictly to force a re-execution so it shouldn't be added to the history. For the first scenario, the orchestration will be in a terminal state, but for the second, it will be "running", which is why that check is there.
I tried to explain this in the comment but let me know if there's a way to make it more clear.
| if (runtimeState.ExecutionStartedEvent.TryGetParentTraceContext(out ActivityContext parentTraceContext)) | ||
| { | ||
| // We set a new client span ID here so that the execution of the rewound suborchestration is not tied to the | ||
| // old parent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes back to my other question: why do we want to remove the old trace associations? In some ways, I think of "rewind" as another form of suspend/resume, in which case we don't change the trace ID. I'm wondering why we'd treat rewind any differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, check out the new PR description.
These are all in the DTS repo since that will be the only backend for now that leverages this new rewind strategy (I think they're pretty extensive: testing failed orchestrations with a fanout activity pattern to make sure all failed activities are rerun, testing a fanout pattern with failed suborchestrations as well, testing rewinding with a purged failed suborchestration, etc.) |
|
Responding to the open questions:
This should be fine. I don't think it will materially affect anything.
I think you'll want to leave it as the time the original ExecutionStartedEvent was created. Otherwise, all the visualization will get messed up since the timestamps will be out of order relative to the rest of the history. |
This PR adds a new rewind strategy that will be leveraged by the DTS backend. Rather than manually deleting the failed task/suborchestration rows from the history table (as is done by the current rewind implementation), a new history event
ExecutionRewoundEventis introduced that acts in a similar way to other orchestration management actions (for example, termination, suspend, or resume requests). A user's request to rewind the orchestration will send this event to the orchestration. Upon receiving it, the SDK layer (specificallyTaskOrchestrationDispatcher) will generate a new history for the orchestration with a new execution ID and the corresponding failed rows removed. It will createExecutionRewoundEventsfor all the failed suborchestrations as well, thereby recursively rewinding to the deepest failed leaves. It is the backend's responsibility to correctly process this new history and handle dispatching theExecutionRewoundEvents.Open questions
OrchestratorStartedEventimmediately followed by anOrchestratorCompletedEventif all the events in between got deleted. Are we okay with this?Timestampof the newExecutionStartedEventin the new history to be the time it is actually created, or leave it as the time the originalExecutionStartedEventwas created?Tracing
There are two options for how tracing will look for a rewound orchestration. One is to just append the rerun Activities/suborchestrations to the existing orchestration span, and the other is to create a new span. (The second option came about because when I enabled distributed tracing, without changing any code yet, I realized that a new span was being created when the orchestration was rerun. I think this might be because the trace activity for the original span is disposed upon orchestration completion, not quite sure). After discussion with team members, we decided to go with option two. This is an example of what it looks like (note that both spans appear under the original "create_orchestration" request):
The first span is the original run of the orchestration with the failed suborchestrations/Activities. The second span is the rewind which contains only the rerun failed suborchestrations/Activties (note, for example, that "RunSucceedSubOrchestrator" does not appear in the second span, since this suborchestration was successful and not rerun). This felt like a better approach because