v2.5 work: BacklogPolicy #1912

NickCraver · 2021-11-16T14:17:52Z

This is a work in progress of the Backlog Policy bits to help address #1864. The overall change here is to make the message pathway backlog commands when an endpoint is down. It does not hand off to another viable endpoint in this iteration (future plans there). If the policy is set to FailFast, we should get the old behavior not queueing anything. In the new path:

Handshake commands go straight to the IO pipe (the only way this would fail is if we lost a pipe during a handshake, in which case things are ill and let's backoff until the next heartbeat anyway).
If connected and there is no backlog queue, commands go into the pipe
If disconnected or there is a backlog queue, non-internal and non-handshake commands go into the backlog, so that they are played to the pipe in order

TODO List:

Implement simplified backlog queue (3 queues -> 1, no handoffs)
Implement backlog policy and config
Add initial testing
Find intermittent async hang in local test runner (test: QueuesAndFlushesAfterReconnecting) - this may be boneheaded or a fundamental flaw
Add test for sync commands (e.g. .Ping()) and that they recover appropriately

We got a little heavy with stats over the years and it's cumbersome to add anything here (e.g. another `out` parameter). Instead of adding yet more stats in upcoming bits as-is, I decided to take a stab at simplifying this with `readonly struct` passes. We're not _super_ concerned with efficiency in the exception path but hey, why not. This should simplify maintenance/additions and clarify what each property is. Note that we have some static defaults here because `default` does _not_ run property initializers where a `new()` does.

This isn't working as we want yet in several regards but putting a progress commit in here. Overall: - Needs offload to a sibling connection (and option governing this) - Needs to fail much faster in the initial connection scenario - Lots of test love around the new functionality

@mgravell

default subtly differs here in that the reader/writer states will default to int/byte 0 which is not the same as NA. Perhaps we should just make NA be the 0 state though, which would simplify all use cases... @mgravell @philon-msft thoughts there? Talking PhysicalConnection Reader/Writer status enums.

The failure tests need not retry loop n times. Also: - Removes the handshake queue, relying on the pipe directly instead. - Formats TestBase for easier maintenance

This gets us going to queue commands while disconnected. We still need handoff to sibling connections but this gets us going.

src/StackExchange.Redis/PhysicalBridge.cs

src/StackExchange.Redis/PhysicalConnection.cs

tests/StackExchange.Redis.Tests/BacklogTests.cs

We observe a race here between the connection state saying we're connected and the connection actually being selectable, this tightens that window tremendously.

philon-msft · 2021-11-16T15:39:05Z

src/StackExchange.Redis/PhysicalBridge.cs

                        }
-                        else
+                        else if (physical?.HasOuputPipe == true)
                        {


Do we need to re-queue the message if !physical.HasOutputPipe?

src/StackExchange.Redis/Message.cs

@TimLovellSmith

Good catch by @TimLovellSmith, this indeed should have a more constrained set of code being checked - we can still safely exit on the null and FireAndForget cases, only needing to wrap around the throws.

This is in seconds (I now loathe all non-timespan time APIs!), the whole point was for the keepalive to fire and detect failure.

These could race between test runs - let's not do that.

We're working on pub/sub - breaking it out explicitly from #1912. This relates to several issues and in general handling resubscriptions on reconnect. Issues: #1110, #1586, #1830 #1835 There are a few things in play we're investigating: - [x] Subscription heartbeat not going over the subscription connection (due to `PING` and `GetBridge`) - [x] Subscriptions not reconnecting at all (or potentially doing to and unsubscribing according to some issues) - [x] Subscriptions always going to a single cluster node (due to `default(RedisKey)`) Overall this set of changes: - Completely restructures how RedisSubscriber works - No more `PendingSubscriptionState` (`Subscription` has the needed bits to reconnect) - Cleaner method topology (in `RedisSubscriber`, rather than `Subscriber`, `RedisSubscriber`, and `ConnectionMultiplexer`) - By placing these on `RedisSubscriber`, we can cleanly use `ExecuteSync/Async` bits, get proper profiling, etc. - Proper sync/async split (rather than `Wait()` in sync paths) - Changes how subscriptions work - The `Subscription` object is added to the `ConnectionMultiplexer` tracking immediately, but the command itself actually goes to the server and back (unless FireAndForget) before returning for proper ordering like other commands. - No more `Task.Run()` loop - we now ensure reconnects as part of the handshake - Subscriptions are marked as not having a server the moment a disconnect is fired - Question: Should we have a throttle around this for massive numbers of connections, or async it? - Changes how connecting works - The connection completion handler will now fire when the _second_ bridge/connection completes, this means we won't have `interactive` connected but `subscription` in an unknown state - both are connected before we fire the handler meaning the moment we come back from connect, subscriptions are in business. - Moves to a `ConcurrentDictionary` since we only need limited locking around this and we only have it once per multiplexer. - TODO: This needs eyes, we could shift it - implementation changed along the way where this isn't a critical detail - Fixes the `TrackSubscriptionsProcessor` - this was never setting the result but didn't notice in 8 years because downstream code never cared. - Note: each `Subscription` has a processor instance (with minimal state) because when the subscription command comes back _then_ we need to decide if it successfully registered (if it didn't, we need to maintain it has no successful server) - `ConnectionMultiplexer` grew a `DefaultSubscriber` for running some commands without lots of method duplication, e.g. ensuring servers are connected. - Overrides `GetHashSlot` on `CommandChannelBase` with the new `RedisChannel`-based methods so that operates correctly Not directly related changes which helped here: - Better profiler helpers for tests and profiler logging in them - Re-enables a few `PubSub` tests that were unreliable before...but correctly so. TODO: I'd like to add a few more test scenarios here: - [x] Simple Subscribe/Publish/await Until/check pattern to ensure back-to-back subscribe/publish works well - [x] Cluster connection failure and subscriptions moving to another node To consider: - [x] Subscription await loop from EnsureSubscriptionsAsync and connection impact on large reconnect situations - In a reconnect case, this is background and only the nodes affected have any latency...but still. - [ ] TODOs in code around variadic commands, e.g. re-subscribing with far fewer commands by using `SUBSCRIBE <key1> <key2>...` - In cluster, we'd have to batch per slot...or just go for the first available node - ...but if we go for the first available node, the semantics of `IsConnected` are slightly off in the not connected (`CurrentServer is null`) case, because we'd say we're connected to where it _would_ go even though that'd be non-deterministic without hashslot batching. I think this is really minor and shouldn't affect our decision. - [x] `ConcurrentDictionary` vs. returning to locks around a `Dictionary` - ...but if we have to lock on firing consumption of handlers anyway, concurrency overhead is probably a wash.

mgravell · 2022-02-04T15:45:19Z

src/StackExchange.Redis/BacklogPolicy.cs

+    /// or it could choose to fail fast and throw ASAP. Different apps desire different behaviors with backpressure and how to handle
+    /// large amounts of load, so this is configurable to optimize the happy path but avoid spiral-of-death queue scenarios for others.
+    /// </summary>
+    public class BacklogPolicy


is this designed for inheritance? if not, recommend sealed

Based on #1912 (otherwise tests fail those PRs are working on, but this is good against `main` too) - Shortens the names in the check list - Removes .NET 5.0 SDK we don't need the runtime for anymore - Removes `netcoreapp3.1` from StackExchange.Redis.Tests (running `net472` and `net6.0` now)

src/StackExchange.Redis/ConfigurationOptions.cs

src/StackExchange.Redis/ServerSelectionStrategy.cs

deepakverma · 2022-02-06T23:50:20Z

This looks great and thank you for incorporating it as a single queue implementation :)

deepakverma · 2022-02-07T00:02:37Z

src/StackExchange.Redis/ConnectionMultiplexer.cs

                // Infer a server automatically
                server = SelectServer(message);
+
+                // If we didn't find one successfully, and we're allowed, queue for any viable server


If I am understand it correctly, I think it will be interesting to test this logic with a clustered cache. In case of a failure, the logic will select any viable server. This viable server might return a moved exception and the message will get redirected to the right endpoint. It should be fine if eventually it gets to the right endpoint. Does that sound correct?

Correct! Let's make sure in tests though - I'll add some cluster tests around this to the PR.

Added additional testing in 2c2b0ea!

Ensures cluster is also cooperating as expected.

Note: this is based on #1912, so let's wait until that's in. Changeset overall: - Add `MakePrimaryAsync` (deprecate `MakeMaster`) - This yanks the code and does an evil .Wait() - better ideas? - Add `ReplicaOfAsync` (deprecate `ReplicaOf`) - Remove the last usages of `CommandFlags.HighPriority` - Remove `ServerEndPoint.WriteDirectFireAndForgetSync<T>` path Co-authored-by: mgravell <[email protected]>

NickCraver added 19 commits October 26, 2021 14:05

Bump SDK versions

b3038df

Dammit

50def47

Let's try this...

69d907c

C# 10 is too hard.

133c4bd

Fully revert to non-branch

4f4be8b

Woops, good catch!

f692312

Merge branch 'craver/stats-cleanup' into craver/backlog-v3

a207d8b

Merge branch 'craver/stats-cleanup' into craver/backlog-v3

0084522

Fix tests (token disposal oops) and tweak for speed

75545f0

The failure tests need not retry loop n times. Also: - Removes the handshake queue, relying on the pipe directly instead. - Formats TestBase for easier maintenance

Revert ServerEndPoint (bad commit split)

6c8edf6

Merge remote-tracking branch 'origin/main' into craver/backlog-v3

c86c7d2

Backlog add hook points past .IsConnected and initial tests

076bc00

This gets us going to queue commands while disconnected. We still need handoff to sibling connections but this gets us going.

Merge branch 'main' into craver/backlog-v3

3353ce2

Remove server-specific for now

9f86830

Cleanup

f7b0975

Update ServerEndPoint.cs

eb9c0c3