-
Notifications
You must be signed in to change notification settings - Fork 524
Description
Pull Request Template
Description
Problem: Recently, in the encryption test project, I started observing some strange behavior, where few of the tests started failing when the replica validation is turned on. The exception is below:
It appears that a System.InvalidOperationException Collection was modified was thrown while calling LoadBalancingPartition.Dispose(). I analyzed this and figured out that a potential cause could be a thread contention while accessing and modifying the this.openChannelsat the same time:
public void Dispose()
{
this.capacityLock.EnterWriteLock();
try
{
foreach (LbChannelState channelState in this.openChannels)
{
channelState.Dispose();
}
}
finally
{
this.capacityLock.ExitWriteLock();
}
try
{
this.capacityLock.Dispose();
}
catch(SynchronizationLockException)
{
// SynchronizationLockException is thrown if there are inflight requests during the disposal of capacityLock
// suspend this exception to avoid crashing disposing other partitions/channels in hierarchical calls
return;
}
}
My Analysis and Solution: Why this is happening, is because it is possible that during the OpenChannelAsync stage, we exit the write lock too early, before the connection opening task is finished and thus it gives the opportunity to modify the collection by another thread, potentially the disposing thread.
Please take a look at my proposed code changes below, while the left is the existing code and the right one is my new code changes:
In theory, instead of returning the return this.OpenChannelAndIncrementCapacity if we could do an await this.OpenChannelAndIncrementCapacity, that should make sure that we wait on the operation to finish before releasing the write lock. But, this brings in another problem.
The new problem it creates is that, this.capacityLock.EnterWriteLock() or any other operations on ReaderWriterLockSlim is not applicable for asynchronous operations.
Our call to OpenChannelAndIncrementCapacity() during the rntbd connection opening (both replica validation + create and initialize async) is deterministic in nature thus we wait on the background channel opening task await newChannel.OpenChannelAsync(activityId); . I have observed that, a thread which is acquiring the write lock, might get lost because of the background task, and a second thread tried to release the lock which never acquired it at the first place, which ends up creating a dead lock situation.
To better understand this, please take a look at the trace logs below:
You will notice that the Exiting write lock statement was printed once that is because the thread id (i.e. 8) that acquired the lock was same that tried to release it. So, it appears that the ReaderWriterLockSlim is not helpful in this case where we have the async call.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
