Skip to content

Conversation

@andystaples
Copy link
Contributor

This PR addresses a very specific bug with a race condition caused by improper handling of orchestration flow.
The bug exists with this kind of setup in Durable Functions using the dotnet-isolated programming model, and having more than one input to tags:

        [Function(nameof(EntityLockNullReferenceOrchestation))]
        public static async Task<string> RunOrchestrator([OrchestrationTrigger] TaskOrchestrationContext context, string[] tags)
        {
            var tagsData = new ConcurrentQueue<Tuple<string, string>>();
            var categoriesReportsTasks = tags.Select(tag => GetOrBuildTagData(context, tag, tagsData)).ToArray();
            await Task.WhenAll(categoriesReportsTasks);
            return string.Join(", ", tagsData.Select(t => $"{t.Item1}: {t.Item2}"));
        }

        private static async Task GetOrBuildTagData(TaskOrchestrationContext context, string tag, ConcurrentQueue<Tuple<string, string>> tagsData)
        {
            var dataSet1Id = new EntityInstanceId(nameof(DummyTagCacheEntity), tag);
            var cachedData = await context.Entities.CallEntityAsync<string>(dataSet1Id, nameof(DummyTagCacheEntity.Get));
            await using (await context.Entities.LockEntitiesAsync(dataSet1Id))
            {
                await context.Entities.CallEntityAsync(dataSet1Id, nameof(DummyTagCacheEntity.Set), "Hello");
            }
            tagsData.Enqueue(new(tag, "Hello"));
        }

This is non-supported behavior, as awaiting WhenAll on tasks that are comprised of multiple Durable operations is likely to cause non-determinism exceptions, and also has issues with entity locks in particular as one orchestration should only hold one entity lock at a time.

However, this pattern today throws a NullReferenceException in OrchestrationEntityContext.RecoverLockAfterCall as RecoverLockAfterCall gets called while the lock is still pending due to the race condition, so availableLocks has not been initialized. The fix in this PR causes the orchestrator to instead throw System.InvalidOperationException: Must not enter another critical section from within a critical section. which is more helpful in identifying the race condition in the tasks.

- Fix making a specific entity locking race condition error correctly
@mllab-nl
Copy link

mllab-nl commented Oct 9, 2025

Am I the only one who feels like a test is missing ? 🙃

@andystaples
Copy link
Contributor Author

Am I the only one who feels like a test is missing ? 🙃

We will want to add testing for this change, but since DurableTask.Core does not expose entity functionality (and even if it did, I don't think the orchestration APIs don't even allow the same kind of race condition here) we will add the tests in the E2E test suite in azure-functions-durable-extension.
For now, I have tested the change locally and verified that it has the intended effect

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m good with this change based on manual testing. Integration testing will need to be done separately in another repo.

@andystaples andystaples merged commit 00f70b7 into main Oct 9, 2025
44 checks passed
@andystaples andystaples deleted the andystaples/bugfix-critical-section-race-condition branch October 9, 2025 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants