Enable polling for durable jobs #9802

benjaminpetit · 2025-11-21T15:09:42Z

This pull request introduces a new result-based pattern for handling durable job execution in the Orleans.DurableJobs system, enabling jobs to signal completion, request polling for status, or indicate failure. The changes update interfaces, implementations, and tests to support this richer execution model, and add new logging and test coverage for polling scenarios.

The most important changes are:

Durable Job Execution Result Pattern

Introduced the DurableJobRunResult class and DurableJobRunStatus enum to represent the outcome of job execution, supporting statuses for completed, poll-after (with delay), and failed (with optional exception). This enables jobs to request polling for their status or signal failure in a structured way. (src/Orleans.DurableJobs/DurableJobRunResult.cs)

Interface and Implementation Updates

Updated the IDurableJobReceiverExtension interface and its implementation so that DeliverDurableJobAsync and the new CheckJobStatusAsync method return a DurableJobRunResult, allowing the job runner to react to job completion, polling requests, or failures. (src/Orleans.DurableJobs/IDurableJobReceiverExtension.cs) [1] [2] [3]
Refactored ShardExecutor.RunJobAsync to handle the new result pattern: it now enters a polling loop if the job requests it, removes the job on completion, or handles retries/failures as appropriate. (src/Orleans.DurableJobs/ShardExecutor.cs) [1] [2]

Logging Improvements

Added new logging for polling and failed job results to assist with debugging and monitoring job execution flows. (src/Orleans.DurableJobs/ShardExecutor.Log.cs)

Test Enhancements

Updated and expanded tests to cover the new polling and failure behaviors, including tests for jobs that require polling before completion or failure, and helper methods to simulate these scenarios. (test/NonSilo.Tests/DurableJobs/ShardExecutorTests.cs) [1] [2] [3] [4] [5]

These changes collectively provide a more robust and flexible framework for durable job execution, allowing for asynchronous polling and improved error handling.

Microsoft Reviewers: Open in CodeFlow

…tates

…dling

- Change return type from Task to Task<DurableJobRunResult> - Update all mock setups to return appropriate DurableJobRunResult - Failed jobs now return DurableJobRunResult.Failed(exception) instead of throwing - All 8 tests passing

- Add test for job that returns PollAfter and eventually completes - Add test for job that polls then fails with retry

ReubenBond · 2025-11-21T15:11:29Z

src/Orleans.DurableJobs/DurableJobRunResult.cs

+    /// Creates a result indicating the job completed successfully.
+    /// </summary>
+    /// <returns>A completed job result.</returns>
+    public static DurableJobRunResult Completed() => new(DurableJobRunStatus.Completed, null, null);


This should be a property, IMO. It can return static readonly instance.

Copilot

Pull request overview

This pull request introduces a polling mechanism for durable jobs by implementing a result-based pattern that allows jobs to signal completion, request status polling, or indicate failure. The changes enable more flexible job execution patterns where jobs can remain in an inline polling loop rather than completing immediately.

Introduces DurableJobRunResult class and DurableJobRunStatus enum to represent job execution outcomes
Updates job execution flow in ShardExecutor to handle polling loops and structured failure reporting
Adds comprehensive test coverage for polling scenarios

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/Orleans.DurableJobs/DurableJobRunResult.cs	New file defining result types for job execution with Completed, PollAfter, and Failed statuses
src/Orleans.DurableJobs/IDurableJobReceiverExtension.cs	Updated interface and implementation to return `DurableJobRunResult` and added `CheckJobStatusAsync` method for polling
src/Orleans.DurableJobs/ShardExecutor.cs	Refactored job execution logic to handle polling loops, structured failures, and unified retry handling
src/Orleans.DurableJobs/ShardExecutor.Log.cs	Added logging methods for polling and failed job results
test/NonSilo.Tests/DurableJobs/ShardExecutorTests.cs	Added comprehensive test coverage for polling behaviors and updated existing tests to use new result pattern

Copilot · 2025-11-21T15:19:45Z

src/Orleans.DurableJobs/IDurableJobReceiverExtension.cs

+    public Task<DurableJobRunResult> CheckJobStatusAsync(IDurableJobContext context, CancellationToken cancellationToken)
+    {
+        // TODO: Need to properly check job status and return appropriate result (Completed, PollAfter, or Failed)
+        // For now, always return Completed
+        return Task.FromResult(DurableJobRunResult.Completed());


The CheckJobStatusAsync method has a TODO comment indicating incomplete implementation. This implementation always returns Completed, which means polling will never continue beyond the first check. Consider either implementing the proper status checking logic before merging, or documenting this limitation more explicitly in the PR description and method documentation, as this could lead to unexpected behavior when jobs request polling.

Suggested change

public Task<DurableJobRunResult> CheckJobStatusAsync(IDurableJobContext context, CancellationToken cancellationToken)

{

// TODO: Need to properly check job status and return appropriate result (Completed, PollAfter, or Failed)

// For now, always return Completed

return Task.FromResult(DurableJobRunResult.Completed());

public async Task<DurableJobRunResult> CheckJobStatusAsync(IDurableJobContext context, CancellationToken cancellationToken)

{

if (_grain.GrainInstance is IDurableJobHandler handler)

{

try

{

return await handler.CheckJobStatusAsync(context, cancellationToken);

}

catch (Exception ex)

{

LogErrorCheckingJobStatus(ex, context.Job.Id, _grain.GrainId);

return DurableJobRunResult.Failed(ex);

}

}

else

{

LogGrainDoesNotImplementHandler(_grain.GrainId);

throw new InvalidOperationException($"Grain {_grain.GrainId} does not implement IDurableJobHandler");

}

Copilot · 2025-11-21T15:19:46Z

src/Orleans.DurableJobs/IDurableJobReceiverExtension.cs

+    [LoggerMessage(Level = LogLevel.Error, Message = "Error checking status of durable job {JobId} on grain {GrainId}")]
+    private partial void LogErrorCheckingJobStatus(Exception exception, string jobId, GrainId grainId);


The logger method LogErrorCheckingJobStatus is defined but never used in the codebase. The CheckJobStatusAsync method has no try-catch block and doesn't invoke this logger. Consider adding error handling in CheckJobStatusAsync to use this logger, or remove the unused logger method definition.

Copilot · 2025-11-21T15:19:46Z

src/Orleans.DurableJobs/ShardExecutor.cs

-                LogJobFailedNoRetry(_logger, jobContext.Job.Id, jobContext.Job.Name, jobContext.DequeueCount);
+                // Handle failed result through retry policy
+                LogJobFailedWithResult(_logger, jobContext.Job.Id, jobContext.Job.Name);
+                failureException = result.Exception ?? new Exception("Job failed without exception");


Creating a generic Exception with a hardcoded message is not ideal for debugging. Consider using a more specific custom exception type (e.g., DurableJobExecutionException) or including contextual information such as job ID and name in the exception message to aid in troubleshooting.

ReubenBond · 2025-11-21T15:26:35Z

src/Orleans.DurableJobs/DurableJobRunResult.cs

+    /// The job will remain in an inline polling loop without being re-queued.
+    /// The polling loop will hold a concurrency slot until the job completes or fails.
+    /// TODO: Add validation for minimum/maximum poll delays to prevent abuse.
+    /// TODO: Consider concurrency slot management for long-running polls.


This is a good point. We could let the job indicate whether the job is:

Running

actively executing

blocks a slot

poll after short delay

Waiting

not actively executing,

doesn't block a slot

should be re-queued, potentially on a different shard)

'Waiting' could be (mis-)used to provide periodic execution, like reminders do today, but it would be the same logical 'run', just delayed/rescheduled.

We could call the methods/statuses:

InProgress(TimeSpan estimatedCompletion)

Suspended(DateTimeOffset dueTime)

What do you think? In this PR, we only need the first. We might not need the second method at all since it can already be implemented by a combination of scheduling a new run + completing this run.

ReubenBond · 2025-11-21T15:27:20Z

src/Orleans.DurableJobs/IDurableJobReceiverExtension.cs

-    /// <returns>A task that represents the asynchronous operation.</returns>
-    Task DeliverDurableJobAsync(IDurableJobContext context, CancellationToken cancellationToken);
+    /// <returns>A task that represents the asynchronous operation and contains the job execution result.</returns>
+    Task<DurableJobRunResult> DeliverDurableJobAsync(IDurableJobContext context, CancellationToken cancellationToken);


I think these should be AlwaysInterleave

ReubenBond · 2025-11-21T15:28:06Z

src/Orleans.DurableJobs/IDurableJobReceiverExtension.cs


+    public Task<DurableJobRunResult> CheckJobStatusAsync(IDurableJobContext context, CancellationToken cancellationToken)
+    {
+        // TODO: Need to properly check job status and return appropriate result (Completed, PollAfter, or Failed)


We need this to be implemented before merging, right?

ReubenBond · 2025-11-21T15:30:44Z

src/Orleans.DurableJobs/ShardExecutor.cs

-                LogRetryingJob(_logger, jobContext.Job.Id, jobContext.Job.Name, retryTime.Value, jobContext.DequeueCount);
-                await shard.RetryJobLaterAsync(jobContext, retryTime.Value, cancellationToken);
+                // Enter polling loop
+                LogPollingJob(_logger, jobContext.Job.Id, jobContext.Job.Name, result.PollAfterDelay!.Value);


If you want to get fancy, you can have an IsPending property with [MemberNotNullWhen(true, nameof(PollAfterDelay)] and then you can remove the ! here. Similar for the other properties.

ReubenBond · 2025-11-21T15:36:49Z

src/Orleans.DurableJobs/ShardExecutor.cs

-                LogJobFailedNoRetry(_logger, jobContext.Job.Id, jobContext.Job.Name, jobContext.DequeueCount);
+                // Handle failed result through retry policy
+                LogJobFailedWithResult(_logger, jobContext.Job.Id, jobContext.Job.Name);
+                failureException = result.Exception ?? new Exception("Job failed without exception");


Is failing without an exception valid? We should enforce that there is an exception of some sort if the job fails. It would be nice if we had union types already so we could make it a cheap, lightweight string instead of an Exception object...

ReubenBond · 2025-11-21T15:37:09Z

src/Orleans.DurableJobs/ShardExecutor.cs

+                failureException = result.Exception ?? new Exception("Job failed without exception");
            }
        }
+        catch (Exception ex) when (ex is not TaskCanceledException)


Use OperationCanceledException instead

ReubenBond · 2025-11-21T15:38:59Z

src/Orleans.DurableJobs/ShardExecutor.cs

+
+                await Task.Delay(result.PollAfterDelay.Value, cancellationToken);
+
+                result = await target.CheckJobStatusAsync(jobContext, cancellationToken);


This could be DeliverDurableJobAsync instead, since they have the same signatures. The implementation can decide whether it needs to start the job or if it already has the job running.

Copilot AI review requested due to automatic review settings November 21, 2025 15:09

Copilot started reviewing on behalf of benjaminpetit November 21, 2025 15:10 View session

benjaminpetit added 5 commits November 21, 2025 16:10

Add DurableJobRunResult class with Completed, PollAfter, and Failed s…

dca29cc

…tates

Update IDurableJobReceiverExtension to return DurableJobRunResult

be3acec

Implement inline polling loop in ShardExecutor with unified error han…

7ecd903

…dling

Add polling behavior tests for ShardExecutor

eb71c66

- Add test for job that returns PollAfter and eventually completes - Add test for job that polls then fails with retry

benjaminpetit force-pushed the dj/pollafter branch from 8f191aa to eb71c66 Compare November 21, 2025 15:10

Copilot finished reviewing on behalf of benjaminpetit November 21, 2025 15:10

ReubenBond reviewed Nov 21, 2025

View reviewed changes

Copilot AI reviewed Nov 21, 2025

View reviewed changes

ReubenBond reviewed Nov 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable polling for durable jobs #9802

Enable polling for durable jobs #9802

benjaminpetit commented Nov 21, 2025 •

edited by dotnet-policy-service bot

Loading

Uh oh!

ReubenBond Nov 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 21, 2025

Uh oh!

Copilot AI Nov 21, 2025

Uh oh!

Copilot AI Nov 21, 2025

Uh oh!

ReubenBond Nov 21, 2025

Uh oh!

ReubenBond Nov 21, 2025

Uh oh!

ReubenBond Nov 21, 2025

Uh oh!

ReubenBond Nov 21, 2025

Uh oh!

ReubenBond Nov 21, 2025

Uh oh!

ReubenBond Nov 21, 2025

Uh oh!

ReubenBond Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    public Task<DurableJobRunResult> CheckJobStatusAsync(IDurableJobContext context, CancellationToken cancellationToken)
-    {
-        // TODO: Need to properly check job status and return appropriate result (Completed, PollAfter, or Failed)
-        // For now, always return Completed
-        return Task.FromResult(DurableJobRunResult.Completed());
+    public async Task<DurableJobRunResult> CheckJobStatusAsync(IDurableJobContext context, CancellationToken cancellationToken)
+    {
+        if (_grain.GrainInstance is IDurableJobHandler handler)
+        {
+            try
+            {
+                return await handler.CheckJobStatusAsync(context, cancellationToken);
+            }
+            catch (Exception ex)
+            {
+                LogErrorCheckingJobStatus(ex, context.Job.Id, _grain.GrainId);
+                return DurableJobRunResult.Failed(ex);
+            }
+        }
+        else
+        {
+            LogGrainDoesNotImplementHandler(_grain.GrainId);
+            throw new InvalidOperationException($"Grain {_grain.GrainId} does not implement IDurableJobHandler");
+        }

		[LoggerMessage(Level = LogLevel.Error, Message = "Error checking status of durable job {JobId} on grain {GrainId}")]
		private partial void LogErrorCheckingJobStatus(Exception exception, string jobId, GrainId grainId);


		await Task.Delay(result.PollAfterDelay.Value, cancellationToken);

		result = await target.CheckJobStatusAsync(jobContext, cancellationToken);

Enable polling for durable jobs #9802

Are you sure you want to change the base?

Enable polling for durable jobs #9802

Conversation

benjaminpetit commented Nov 21, 2025 • edited by dotnet-policy-service bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Microsoft Reviewers: Open in CodeFlow

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benjaminpetit commented Nov 21, 2025 •

edited by dotnet-policy-service bot

Loading