Skip to content

Conversation

@KKhurin
Copy link
Contributor

@KKhurin KKhurin commented Mar 11, 2022

This fixes the longstanding problem with sporadically having ack timeouts when adding connections to groups.

  • the culprit behind the timeouts is a race condition when sending service message to several endpoints simultaneously which may lead to overriding AckId of a message that has not been sent yet
  • the stickiness to the service connection (introduced in SDK 1.7.1) makes this race condition happen a lot more often due to increased likelihood of going async (and increased chance of the race condition) because many scenarios would use the same connection with higher concurrency.

Fixes #745
Fixes #1225

@vicancy
Copy link
Member

vicancy commented Mar 14, 2022

Quote @vwxyzh 's proposal below which avoids changing the protocol

Change https://github.com/Azure/azure-signalr/blob/dev/src/Microsoft.Azure.SignalR.Common/ServiceConnections/ServiceConnectionContainerBase.cs#L245 to

RawMessage raw = null;

lock (ackableMessage)
{
  ackableMessage.AckId = id;
  raw = WriteRawMessage(ackableMessage);
}

@KKhurin
Copy link
Contributor Author

KKhurin commented Mar 14, 2022

Quote @vwxyzh 's proposal below which avoids changing the protocol

Change https://github.com/Azure/azure-signalr/blob/dev/src/Microsoft.Azure.SignalR.Common/ServiceConnections/ServiceConnectionContainerBase.cs#L245 to

RawMessage raw = null;

lock (ackableMessage)
{
  ackableMessage.AckId = id;
  raw = WriteRawMessage(ackableMessage);
}

Didn't originally choose this because of:

  1. it allocates bytes for the raw message much earlier than it currently does - e.g. increases memory pressure (although making a copy also adds an extra allocation which by its nature should be smaller than the serialized one).
  2. introduces a lock which is arguably heavier than the shallow member-wise clone.

I like the simplicity of @yzt's approach if we can accept the new method in ServiceMessage.

@KKhurin KKhurin merged commit a452bd1 into dev Mar 15, 2022
@vicancy vicancy deleted the AckTimeoutFix branch November 30, 2023 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ack-able message Microsoft.Azure.SignalR.Protocol.JoinGroupWithAckMessage waiting for ack timed out. Timeout when add to group

4 participants