-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Enable polling for durable jobs #9802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Change return type from Task to Task<DurableJobRunResult> - Update all mock setups to return appropriate DurableJobRunResult - Failed jobs now return DurableJobRunResult.Failed(exception) instead of throwing - All 8 tests passing
- Add test for job that returns PollAfter and eventually completes - Add test for job that polls then fails with retry
8f191aa to
eb71c66
Compare
| /// Creates a result indicating the job completed successfully. | ||
| /// </summary> | ||
| /// <returns>A completed job result.</returns> | ||
| public static DurableJobRunResult Completed() => new(DurableJobRunStatus.Completed, null, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a property, IMO. It can return static readonly instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request introduces a polling mechanism for durable jobs by implementing a result-based pattern that allows jobs to signal completion, request status polling, or indicate failure. The changes enable more flexible job execution patterns where jobs can remain in an inline polling loop rather than completing immediately.
- Introduces
DurableJobRunResultclass andDurableJobRunStatusenum to represent job execution outcomes - Updates job execution flow in
ShardExecutorto handle polling loops and structured failure reporting - Adds comprehensive test coverage for polling scenarios
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Orleans.DurableJobs/DurableJobRunResult.cs | New file defining result types for job execution with Completed, PollAfter, and Failed statuses |
| src/Orleans.DurableJobs/IDurableJobReceiverExtension.cs | Updated interface and implementation to return DurableJobRunResult and added CheckJobStatusAsync method for polling |
| src/Orleans.DurableJobs/ShardExecutor.cs | Refactored job execution logic to handle polling loops, structured failures, and unified retry handling |
| src/Orleans.DurableJobs/ShardExecutor.Log.cs | Added logging methods for polling and failed job results |
| test/NonSilo.Tests/DurableJobs/ShardExecutorTests.cs | Added comprehensive test coverage for polling behaviors and updated existing tests to use new result pattern |
| public Task<DurableJobRunResult> CheckJobStatusAsync(IDurableJobContext context, CancellationToken cancellationToken) | ||
| { | ||
| // TODO: Need to properly check job status and return appropriate result (Completed, PollAfter, or Failed) | ||
| // For now, always return Completed | ||
| return Task.FromResult(DurableJobRunResult.Completed()); |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CheckJobStatusAsync method has a TODO comment indicating incomplete implementation. This implementation always returns Completed, which means polling will never continue beyond the first check. Consider either implementing the proper status checking logic before merging, or documenting this limitation more explicitly in the PR description and method documentation, as this could lead to unexpected behavior when jobs request polling.
| public Task<DurableJobRunResult> CheckJobStatusAsync(IDurableJobContext context, CancellationToken cancellationToken) | |
| { | |
| // TODO: Need to properly check job status and return appropriate result (Completed, PollAfter, or Failed) | |
| // For now, always return Completed | |
| return Task.FromResult(DurableJobRunResult.Completed()); | |
| public async Task<DurableJobRunResult> CheckJobStatusAsync(IDurableJobContext context, CancellationToken cancellationToken) | |
| { | |
| if (_grain.GrainInstance is IDurableJobHandler handler) | |
| { | |
| try | |
| { | |
| return await handler.CheckJobStatusAsync(context, cancellationToken); | |
| } | |
| catch (Exception ex) | |
| { | |
| LogErrorCheckingJobStatus(ex, context.Job.Id, _grain.GrainId); | |
| return DurableJobRunResult.Failed(ex); | |
| } | |
| } | |
| else | |
| { | |
| LogGrainDoesNotImplementHandler(_grain.GrainId); | |
| throw new InvalidOperationException($"Grain {_grain.GrainId} does not implement IDurableJobHandler"); | |
| } |
| [LoggerMessage(Level = LogLevel.Error, Message = "Error checking status of durable job {JobId} on grain {GrainId}")] | ||
| private partial void LogErrorCheckingJobStatus(Exception exception, string jobId, GrainId grainId); |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logger method LogErrorCheckingJobStatus is defined but never used in the codebase. The CheckJobStatusAsync method has no try-catch block and doesn't invoke this logger. Consider adding error handling in CheckJobStatusAsync to use this logger, or remove the unused logger method definition.
| LogJobFailedNoRetry(_logger, jobContext.Job.Id, jobContext.Job.Name, jobContext.DequeueCount); | ||
| // Handle failed result through retry policy | ||
| LogJobFailedWithResult(_logger, jobContext.Job.Id, jobContext.Job.Name); | ||
| failureException = result.Exception ?? new Exception("Job failed without exception"); |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a generic Exception with a hardcoded message is not ideal for debugging. Consider using a more specific custom exception type (e.g., DurableJobExecutionException) or including contextual information such as job ID and name in the exception message to aid in troubleshooting.
| /// The job will remain in an inline polling loop without being re-queued. | ||
| /// The polling loop will hold a concurrency slot until the job completes or fails. | ||
| /// TODO: Add validation for minimum/maximum poll delays to prevent abuse. | ||
| /// TODO: Consider concurrency slot management for long-running polls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. We could let the job indicate whether the job is:
- Running
- actively executing
- blocks a slot
- poll after short delay
- Waiting
- not actively executing,
- doesn't block a slot
- should be re-queued, potentially on a different shard)
'Waiting' could be (mis-)used to provide periodic execution, like reminders do today, but it would be the same logical 'run', just delayed/rescheduled.
We could call the methods/statuses:
InProgress(TimeSpan estimatedCompletion)Suspended(DateTimeOffset dueTime)
What do you think? In this PR, we only need the first. We might not need the second method at all since it can already be implemented by a combination of scheduling a new run + completing this run.
| /// <returns>A task that represents the asynchronous operation.</returns> | ||
| Task DeliverDurableJobAsync(IDurableJobContext context, CancellationToken cancellationToken); | ||
| /// <returns>A task that represents the asynchronous operation and contains the job execution result.</returns> | ||
| Task<DurableJobRunResult> DeliverDurableJobAsync(IDurableJobContext context, CancellationToken cancellationToken); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should be AlwaysInterleave
|
|
||
| public Task<DurableJobRunResult> CheckJobStatusAsync(IDurableJobContext context, CancellationToken cancellationToken) | ||
| { | ||
| // TODO: Need to properly check job status and return appropriate result (Completed, PollAfter, or Failed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this to be implemented before merging, right?
| LogRetryingJob(_logger, jobContext.Job.Id, jobContext.Job.Name, retryTime.Value, jobContext.DequeueCount); | ||
| await shard.RetryJobLaterAsync(jobContext, retryTime.Value, cancellationToken); | ||
| // Enter polling loop | ||
| LogPollingJob(_logger, jobContext.Job.Id, jobContext.Job.Name, result.PollAfterDelay!.Value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to get fancy, you can have an IsPending property with [MemberNotNullWhen(true, nameof(PollAfterDelay)] and then you can remove the ! here. Similar for the other properties.
| LogJobFailedNoRetry(_logger, jobContext.Job.Id, jobContext.Job.Name, jobContext.DequeueCount); | ||
| // Handle failed result through retry policy | ||
| LogJobFailedWithResult(_logger, jobContext.Job.Id, jobContext.Job.Name); | ||
| failureException = result.Exception ?? new Exception("Job failed without exception"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is failing without an exception valid? We should enforce that there is an exception of some sort if the job fails. It would be nice if we had union types already so we could make it a cheap, lightweight string instead of an Exception object...
| failureException = result.Exception ?? new Exception("Job failed without exception"); | ||
| } | ||
| } | ||
| catch (Exception ex) when (ex is not TaskCanceledException) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use OperationCanceledException instead
|
|
||
| await Task.Delay(result.PollAfterDelay.Value, cancellationToken); | ||
|
|
||
| result = await target.CheckJobStatusAsync(jobContext, cancellationToken); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be DeliverDurableJobAsync instead, since they have the same signatures. The implementation can decide whether it needs to start the job or if it already has the job running.
This pull request introduces a new result-based pattern for handling durable job execution in the Orleans.DurableJobs system, enabling jobs to signal completion, request polling for status, or indicate failure. The changes update interfaces, implementations, and tests to support this richer execution model, and add new logging and test coverage for polling scenarios.
The most important changes are:
Durable Job Execution Result Pattern
DurableJobRunResultclass andDurableJobRunStatusenum to represent the outcome of job execution, supporting statuses for completed, poll-after (with delay), and failed (with optional exception). This enables jobs to request polling for their status or signal failure in a structured way. (src/Orleans.DurableJobs/DurableJobRunResult.cs)Interface and Implementation Updates
Updated the
IDurableJobReceiverExtensioninterface and its implementation so thatDeliverDurableJobAsyncand the newCheckJobStatusAsyncmethod return aDurableJobRunResult, allowing the job runner to react to job completion, polling requests, or failures. (src/Orleans.DurableJobs/IDurableJobReceiverExtension.cs) [1] [2] [3]Refactored
ShardExecutor.RunJobAsyncto handle the new result pattern: it now enters a polling loop if the job requests it, removes the job on completion, or handles retries/failures as appropriate. (src/Orleans.DurableJobs/ShardExecutor.cs) [1] [2]Logging Improvements
src/Orleans.DurableJobs/ShardExecutor.Log.cs)Test Enhancements
test/NonSilo.Tests/DurableJobs/ShardExecutorTests.cs) [1] [2] [3] [4] [5]These changes collectively provide a more robust and flexible framework for durable job execution, allowing for asynchronous polling and improved error handling.
Microsoft Reviewers: Open in CodeFlow