Skip to content

Conversation

@benjaminpetit
Copy link
Member

@benjaminpetit benjaminpetit commented Nov 21, 2025

This pull request introduces a new result-based pattern for handling durable job execution in the Orleans.DurableJobs system, enabling jobs to signal completion, request polling for status, or indicate failure. The changes update interfaces, implementations, and tests to support this richer execution model, and add new logging and test coverage for polling scenarios.

The most important changes are:

Durable Job Execution Result Pattern

  • Introduced the DurableJobRunResult class and DurableJobRunStatus enum to represent the outcome of job execution, supporting statuses for completed, poll-after (with delay), and failed (with optional exception). This enables jobs to request polling for their status or signal failure in a structured way. (src/Orleans.DurableJobs/DurableJobRunResult.cs)

Interface and Implementation Updates

  • Updated the IDurableJobReceiverExtension interface and its implementation so that DeliverDurableJobAsync and the new CheckJobStatusAsync method return a DurableJobRunResult, allowing the job runner to react to job completion, polling requests, or failures. (src/Orleans.DurableJobs/IDurableJobReceiverExtension.cs) [1] [2] [3]

  • Refactored ShardExecutor.RunJobAsync to handle the new result pattern: it now enters a polling loop if the job requests it, removes the job on completion, or handles retries/failures as appropriate. (src/Orleans.DurableJobs/ShardExecutor.cs) [1] [2]

Logging Improvements

  • Added new logging for polling and failed job results to assist with debugging and monitoring job execution flows. (src/Orleans.DurableJobs/ShardExecutor.Log.cs)

Test Enhancements

  • Updated and expanded tests to cover the new polling and failure behaviors, including tests for jobs that require polling before completion or failure, and helper methods to simulate these scenarios. (test/NonSilo.Tests/DurableJobs/ShardExecutorTests.cs) [1] [2] [3] [4] [5]

These changes collectively provide a more robust and flexible framework for durable job execution, allowing for asynchronous polling and improved error handling.

Microsoft Reviewers: Open in CodeFlow

Copilot AI review requested due to automatic review settings November 21, 2025 15:09
- Change return type from Task to Task<DurableJobRunResult>
- Update all mock setups to return appropriate DurableJobRunResult
- Failed jobs now return DurableJobRunResult.Failed(exception) instead of throwing
- All 8 tests passing
- Add test for job that returns PollAfter and eventually completes
- Add test for job that polls then fails with retry
Copilot finished reviewing on behalf of benjaminpetit November 21, 2025 15:10
/// Creates a result indicating the job completed successfully.
/// </summary>
/// <returns>A completed job result.</returns>
public static DurableJobRunResult Completed() => new(DurableJobRunStatus.Completed, null, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a property, IMO. It can return static readonly instance.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces a polling mechanism for durable jobs by implementing a result-based pattern that allows jobs to signal completion, request status polling, or indicate failure. The changes enable more flexible job execution patterns where jobs can remain in an inline polling loop rather than completing immediately.

  • Introduces DurableJobRunResult class and DurableJobRunStatus enum to represent job execution outcomes
  • Updates job execution flow in ShardExecutor to handle polling loops and structured failure reporting
  • Adds comprehensive test coverage for polling scenarios

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/Orleans.DurableJobs/DurableJobRunResult.cs New file defining result types for job execution with Completed, PollAfter, and Failed statuses
src/Orleans.DurableJobs/IDurableJobReceiverExtension.cs Updated interface and implementation to return DurableJobRunResult and added CheckJobStatusAsync method for polling
src/Orleans.DurableJobs/ShardExecutor.cs Refactored job execution logic to handle polling loops, structured failures, and unified retry handling
src/Orleans.DurableJobs/ShardExecutor.Log.cs Added logging methods for polling and failed job results
test/NonSilo.Tests/DurableJobs/ShardExecutorTests.cs Added comprehensive test coverage for polling behaviors and updated existing tests to use new result pattern

Comment on lines +65 to +69
public Task<DurableJobRunResult> CheckJobStatusAsync(IDurableJobContext context, CancellationToken cancellationToken)
{
// TODO: Need to properly check job status and return appropriate result (Completed, PollAfter, or Failed)
// For now, always return Completed
return Task.FromResult(DurableJobRunResult.Completed());
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CheckJobStatusAsync method has a TODO comment indicating incomplete implementation. This implementation always returns Completed, which means polling will never continue beyond the first check. Consider either implementing the proper status checking logic before merging, or documenting this limitation more explicitly in the PR description and method documentation, as this could lead to unexpected behavior when jobs request polling.

Suggested change
public Task<DurableJobRunResult> CheckJobStatusAsync(IDurableJobContext context, CancellationToken cancellationToken)
{
// TODO: Need to properly check job status and return appropriate result (Completed, PollAfter, or Failed)
// For now, always return Completed
return Task.FromResult(DurableJobRunResult.Completed());
public async Task<DurableJobRunResult> CheckJobStatusAsync(IDurableJobContext context, CancellationToken cancellationToken)
{
if (_grain.GrainInstance is IDurableJobHandler handler)
{
try
{
return await handler.CheckJobStatusAsync(context, cancellationToken);
}
catch (Exception ex)
{
LogErrorCheckingJobStatus(ex, context.Job.Id, _grain.GrainId);
return DurableJobRunResult.Failed(ex);
}
}
else
{
LogGrainDoesNotImplementHandler(_grain.GrainId);
throw new InvalidOperationException($"Grain {_grain.GrainId} does not implement IDurableJobHandler");
}

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +76
[LoggerMessage(Level = LogLevel.Error, Message = "Error checking status of durable job {JobId} on grain {GrainId}")]
private partial void LogErrorCheckingJobStatus(Exception exception, string jobId, GrainId grainId);
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logger method LogErrorCheckingJobStatus is defined but never used in the codebase. The CheckJobStatusAsync method has no try-catch block and doesn't invoke this logger. Consider adding error handling in CheckJobStatusAsync to use this logger, or remove the unused logger method definition.

Copilot uses AI. Check for mistakes.
LogJobFailedNoRetry(_logger, jobContext.Job.Id, jobContext.Job.Name, jobContext.DequeueCount);
// Handle failed result through retry policy
LogJobFailedWithResult(_logger, jobContext.Job.Id, jobContext.Job.Name);
failureException = result.Exception ?? new Exception("Job failed without exception");
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a generic Exception with a hardcoded message is not ideal for debugging. Consider using a more specific custom exception type (e.g., DurableJobExecutionException) or including contextual information such as job ID and name in the exception message to aid in troubleshooting.

Copilot uses AI. Check for mistakes.
/// The job will remain in an inline polling loop without being re-queued.
/// The polling loop will hold a concurrency slot until the job completes or fails.
/// TODO: Add validation for minimum/maximum poll delays to prevent abuse.
/// TODO: Consider concurrency slot management for long-running polls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. We could let the job indicate whether the job is:

  • Running
    • actively executing
    • blocks a slot
    • poll after short delay
  • Waiting
    • not actively executing,
    • doesn't block a slot
    • should be re-queued, potentially on a different shard)

'Waiting' could be (mis-)used to provide periodic execution, like reminders do today, but it would be the same logical 'run', just delayed/rescheduled.

We could call the methods/statuses:

  • InProgress(TimeSpan estimatedCompletion)
  • Suspended(DateTimeOffset dueTime)

What do you think? In this PR, we only need the first. We might not need the second method at all since it can already be implemented by a combination of scheduling a new run + completing this run.

/// <returns>A task that represents the asynchronous operation.</returns>
Task DeliverDurableJobAsync(IDurableJobContext context, CancellationToken cancellationToken);
/// <returns>A task that represents the asynchronous operation and contains the job execution result.</returns>
Task<DurableJobRunResult> DeliverDurableJobAsync(IDurableJobContext context, CancellationToken cancellationToken);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be AlwaysInterleave


public Task<DurableJobRunResult> CheckJobStatusAsync(IDurableJobContext context, CancellationToken cancellationToken)
{
// TODO: Need to properly check job status and return appropriate result (Completed, PollAfter, or Failed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this to be implemented before merging, right?

LogRetryingJob(_logger, jobContext.Job.Id, jobContext.Job.Name, retryTime.Value, jobContext.DequeueCount);
await shard.RetryJobLaterAsync(jobContext, retryTime.Value, cancellationToken);
// Enter polling loop
LogPollingJob(_logger, jobContext.Job.Id, jobContext.Job.Name, result.PollAfterDelay!.Value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to get fancy, you can have an IsPending property with [MemberNotNullWhen(true, nameof(PollAfterDelay)] and then you can remove the ! here. Similar for the other properties.

LogJobFailedNoRetry(_logger, jobContext.Job.Id, jobContext.Job.Name, jobContext.DequeueCount);
// Handle failed result through retry policy
LogJobFailedWithResult(_logger, jobContext.Job.Id, jobContext.Job.Name);
failureException = result.Exception ?? new Exception("Job failed without exception");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is failing without an exception valid? We should enforce that there is an exception of some sort if the job fails. It would be nice if we had union types already so we could make it a cheap, lightweight string instead of an Exception object...

failureException = result.Exception ?? new Exception("Job failed without exception");
}
}
catch (Exception ex) when (ex is not TaskCanceledException)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use OperationCanceledException instead


await Task.Delay(result.PollAfterDelay.Value, cancellationToken);

result = await target.CheckJobStatusAsync(jobContext, cancellationToken);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be DeliverDurableJobAsync instead, since they have the same signatures. The implementation can decide whether it needs to start the job or if it already has the job running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants