Backlog, Backpressure, and Retry - Overview and Planning

In #1856 we’ve been trying to access what’s needed in a retry policy to help handle interruption cases users see today in a more graceful and hopefully often more transparent way. However, in exploring the realities of this at all layers, it’s far more nuanced than has ever been considered in this library before. This issue is to round up discovery and thinking of these fronts.

I am _not_ discussing specific implementation of solutions here, more rounding up “what problems are we trying to solve?” and the details of that question. The most important step of any solution is understanding the problem - and the group working on this thus far think we’re about to that point: understanding the intricacies of what’s going on. This issue is to detail all that info and discuss/clarify anything further.

Note: I'm using the term "command" generically here - this is concretely a `Message` in our library.

### Overview/Context
The fundamental architecture of StackExchange.Redis is a pipelined approach. For a single `ServerEndPoint`, all commands are in a conveyor-belt style pipeline. One command can block others if it takes a long time to send or receive, and when a connection failure happens where it happened, when it happened, what it affects, and what’s behind that are all interesting combinations of things.

The overall hierarchy of the structure in StackExchange.Redis is:
```
- ConnectionMultiplexer
  - ServerEndPoint
    - PhysicalBridge
      - PhysicalConnection
        - IDuplexPipe
        - Socket (used in the pipe as well)
```

When a command is sent through the multiplexer, it may specify a particular endpoint (rare) or auto-select an endpoint (common) based on preference. This can have some steering by `CommandFlags` (e.g. `CommandFlags.DemandReplica`).

A server endpoint could not be connected in a few ways:
1. It hasn’t connected yet (initial connection)
2. It has disconnected and we don’t realize it yet
3. It was disconnected and hasn’t reconnected yet (failure)

It’s worth noting that the first case (not connected yet) is the simplest and generally well handled and seems solid today. It’s simple to reason about everything involved and I think Tim and I solved the remaining issues there. Reconnects though, there be our dragons.

Importantly, we don’t always know when a socket dies. It can take anywhere from milliseconds to minutes to determine this happens. This could be for a few primary reasons:
1. Network error, we lost the socket connection for some reason
2. Unknown failover of the server (not graceful, reactive)
3. Known failover of the server (graceful, proactive)

The timing of realizing this is unfortunately highly variable. There could be a command in progress taking a long time (e.g. huge payload coming back from Redis), so while we have a heartbeat, we can’t jump to conclusions about a delay immediately (e.g. it could be behind it). There are literally thousands of network configuration nuances that can affect how long before a server/OS and a client app running on it realize a socket is no longer healthy. If it helps, think of a socket like 2 people shouting between mountains. There is no constant connection, just an open agreement to speak and listen. It’s not apparent that the person on the other side has gone away, eventually you have thresholds that convince you this has happened and the conversation stops, e.g. “they haven’t answered in 2 minutes”.


The flow for a command going through is:
```
Command (sync or async) -> ServerEndPoint selection -> PhysicalBridge -> Pipeline -> Socket
```

The most important factor here is a disconnect can happen at any time. The realization of this disconnect will happen at an indefinitely later time. If we know of a failover for example, this is likely fast and we’re proactively checking. If we don’t know, it may wait for a heartbeat to realize something has gone wrong. We need to realize there has been a connection death and then update our state to reflect that. This does not happen instantly, nor can it. We have likely issued anywhere from a few commands to hundreds of thousands of commands (depending on client volume) before realizing the pipe has broken. This is akin to pumping water from a station before realizing a backhoe has dug up the water pipe 3 miles away - we’ve likely continued to pump a lot of water in before someone realizes it and shuts off the valve.

The above flow is inside StackExchange.Redis, but beyond that, there’s a bigger flow:
```
User code -> StackExchange.Redis -> Network -> Redis -> Network -> StackExchange.Redis -> User code
```
At any given time, there are likely commands in all of these steps, flowing along the pipe. A break in the pipe could happen anywhere. It could be network commands leaving the client or the network to Redis, or the network coming back, or Redis could have restarted in the middle...or something else. Importantly: _we don’t know_. All we know at the command level is: we either didn’t get to send it (good, we know what’s what) or we have sent it and a response never came back (bad, we have no clue). This is akin to giving a stranger $100 to pick up some delicious doughnuts (they damn well better be at that price) and never coming back. Did they make it to the doughnut store? Did they make it back? Was there an accident? Did they have a sugar coma? No one knows, all we know if they never made the trip back, D.B. Cooper style. For any command that left the library and onto the socket, we’re in the unknown state.

### Retry

For commands that haven’t been sent yet, we’ve been referring to this as “retry” because today they’d get “No connection available”, but that’s really semantics compared to current behavior. For these commands we’re really talking about the first try, rather than failing fast.

Because we don’t know where a failed command fell off once it’s gone to the socket, reissuing mutating commands there is dangerous, e.g. incrementing an account balance: if it made it to Redis the first time, we’d be doing it twice. Commands that only read data are safer and could be retried.

### Order

In our discussions thus far, we’ve generally concluded that order cannot be maintained in the failure state, if for no other reason that we lost `n` commands when the pipe failed and we don’t know where. Short of awaiting every command and not sending the next until one is back (a HUGE performance penalty), we can’t guarantee order.

### Buffering

This is an interesting area because it has a lot of consequences depending on user load. Today what happens is a fast failure if we don’t have an endpoint in the ServerEndPoint selection process above. For example if we `CommandFlags.DemandPrimary` or issue a mutation (write) command that needs a primary, but one isn’t connected we’ll fail immediately with a “No connection available”.


## Problems

In digging through this, we have identified several problems with the current behavior. Chiefly in the `PhysicalBridge` processing of the backlog is broken. It was originally intended to queue/order “connecting” commands - the initial set of commands StackExchange.Redis issues to establish and authenticate a connection to Redis. In this respect, it works well. However, when applied to the _reconnect_ case, it has fundamental flaws.
1. In a reconnect scenario, calls via `WriteDirectOrQueueFireAndForgetAsync` from `HandshakeAsync` and `AutoConfigureAsync` are queued behind other commands, improperly. These need to go first, since we aren’t lit up as active and available until their result handlers complete.
2.We retry _immediately_ when a command is backlogged. In the reconnect case this means we’re dealing with a dead socket (almost certainly) and immediately failing the commands on that instant retry.
3. Commands that passed the `.IsConnected` state check in the `ServerSelectionStrategy` before a failure was detected flow into the backlog. Since this backlog is also used by the `AUTH` command for connection completion (used in most cloud scenarios), it means random commands are in an ordered queue potentially between some connection leaders and `AUTH`...which will result in a disconnect. This is masked by number 1 above because that instant retry mechanism is rapidly flushing this queue, spamming failures and making the _next_ attempt cleared to succeed. The net result is not connecting as fast, though.
4. Since we’re only retrying _on this bridge_, commands that could be shuffled to another acceptable endpoint aren’t, and just fail instead.

In the above, changes in #1779 subtly _would_ make this worse, if not for number 1 as well, because the failed commands are being flushed out. However, practically speaking we are probably eating a few `AUTH` commands out of order hosing the connection and returning to another configure and handshake to resolve.

The approach to `IRetryPolicy` in #1856 has many issues that we didn’t realize until fully going through the exercise of that approach. Most importantly, handling retries at the top level has 3 problems that aren’t solvable due to locality of the solution:
1. We still need _some_ ordered mechanism at the `PhysicalBridge` layer for on-connect commands for connections to establish, which means dealing with and reasoning about 2 queues in a cascading fashion overall.
2. It’s ultimately exception-based handling, though it’s not thrown in the traditional sense we are getting an exception and absorbing most of the cost therein to even get to the “should we retry this?” phase.
3. An extension of number 1 is that commands that got past `.IsConnected` checks and into the backlog buffer would retry differently and have a different `ServerEndPoint` re-resolution for validity (as in, they wouldn’t - they’re not bound to the specific bridge’s queue) vs. those in the policy which would potentially find another avenue to go.

### Ordering

There are trade-off choices between order of commands and resiliency. Do clients care more than commands succeeded, or did so in the correct order? The nature of a pipeline system aimed at high efficiency means we must choose on a spectrum between these two in how to handle things. We can go anywhere between retrying anything at any time for maximum success to maintaining order completely by awaiting each command before the next is issued (this would be _horribly_ slow - scaling with network latency). There are also multiple layers of order to consider and weigh against guarantees that do or don’t exist today.

Under normal circumstances, order on a particular pipe (`ServerEndPoint`) is in order, because we’re taking commands and going through a semaphore around the pipe writer with a dedicated mutex. However, order across multiple endpoints has no guarantees at all - commands are handed off to their selected endpoints and pipes ASAP and then in that ordered silo.

In the exception case, the pipe has burst. We’re leaking all over Main Street. We don’t know how many commands were just lost on the way to or from Redis - they were lost _at some point along the way_. They were sent in order, but we don’t know which made it to Redis and which did not. When you consider the timing elements and the far-from-immediate recognition we died, it’s almost certain we piped more commands to Redis and out of the client machine before the break was recognized. Because of this we can continue order with `n` failed along the way (let’s say we failed for 4 seconds - we lost 4 seconds), but in doing so, we can’t retry. Because those retries would be after additional commands we already sent and may have succeeded. That’s the choice between retrying or not - we have to choose order or max per-command success. The general consensus is people would prefer success to ordering. If order is that important, it can be controlled in user code by awaiting each result before issuing the next.

## Proposals

There are a few ideas here for how to go about improving things. I’ll try and break these our by category:

### Buffering Changes

If instead of immediately issuing errors of “no connection available” we can buffer those commands to issue when we are reconnected, it’s a potentially more graceful experience for the consumer, but at a potentially higher system risk. For example if we did not immediately error and instead buffered, we’d keep the Tasks for sending alive for the duration of their associated (sync or async) timeout (plus up to approximately an additional second). This can have downstream backpressure consequences in thread pool usage, connections, or any other resources...whereas today they’d fail immediately and free up those resources. However, for brief interruptions, this buffering is likely desirable for most applications (this is admittedly a guess) that are issuing commands and then succeed, taking an extra second or a few, rather than failing.

If a system is “close to redline” though, as in it can barely keep up with the load it’s sending to Redis and back today, this queue can be fatal or debilitating. Commonly referred to as “the spiral of death”, having work, time, or resources aimed at a sunk cost that inhibits system recovery can cause an indefinite or prolonged outage.

Let’s say a busy client has a timeout of 5 seconds. If the system is a high frequency, low latency system: that’s an eternity. We could exhaust upstream resources (threads, connection pools, etc. - this is especially dangerous if sync over async is happening). If it’s used to processing say 90,000 commands per second but the maximum throughput is 100,000 commands per second (for _whatever_ reason - any bottleneck wins), we don’t have a lot of headroom. In 5 seconds we would buffer 450,000 commands. It’s going to take 4.5 seconds to clear that backlog, which means requests that could have been taking 1ms (for simplicity) are taking 4,501ms. They could have timed out and be discarded by then (upstream in the calling code), so now we’re just doing junk work no one cares about, and relatively slow getting back to realtime and low latency. Compared to today, commands would instantly fail once we realized the disconnect and when we’re back in action, would immediately resume. This is easier to reason about and has less of a net impact (total failed requests).

The above is for a high traffic client, by comparison a low traffic client would be happy buffering, have less exhaustion, and be happy with fewer errors and have a burst of slow requests in there.

In either case, if the client <-> Redis outage lasted more than the timeout duration, those commands would still fail.

For all of the above, the client library “admin” commands from `HandshakeAsync` and `AutoConfigureAsync` need to _not_ get behind other things in a retry queue. They need to go ASAP both from an immediacy and espediancy standpoint.

Open questions:
- Should this be default opt-out or in?

### Buffering Options

Should users have control here? How much? This gets murky really fast when you consider the variants involved:
1. Given the information earlier in the issue, we already buffer, sometimes, for some duration. 
2. If we stopped buffering, we introduce a new known and torrent of errors to users more so than ever seen before - would this be wanted?
3. Count-based limits and _any_ default isn’t going to suit the vast majority of the clientele by default.

I think that if we allow options here, a duration-based approach (which is also consistent with how timeouts in general work now) is the best unit-of-threshold mechanism we can offer. If we said 1,000 commands or 1,000,000 commands, that could be seconds or years depending on the app - counts and any default there just don’t cover a large swath well. It also makes it harder to debug. Given that most of the upstream resource exhaustion governors (e.g. HTTP timeouts) are also controlled by time, this seems like a clear path...but maybe we’re missing something.

It’s important to realize that any count-based governor is fundamentally harder to reason about and requires more spiral-of-death downstream culling decisions. “Why did this command fail?” goes from something you can ascertain from characteristics about the command (e.g. “How old was it? More than 5 seconds?”) to something that requires a much more holistic and point-in-time view of how the system was behaving and how many total commands were in play, what they were doing, etc.

### Retry

This term is a bit overloaded, but these are 2 main states for a command in play here:
1. Not sent
2. Sent awaiting response

We can only realistically care about the first globally - the second is in an unknown state (did it get to Redis or not? - covered above). However, we could retry read-only commands potentially reasonably in the second case.

The interesting part here is how things happen. Because a failure of command can happen at any point in the command above, it may or may not have gotten all the way to a PhysicalBridge before we realized it’s not a connected endpoint. To clarify, the order of that flow is:

1. Command created/issued (e.g. `ExecuteAsync`)
2. `ServerEndPoint` selection (it reads as connected at this point)
3. `PhysicalBridge` handoff
4. Socket failure detected
5. Write to socket fails, it’s known to be in a bad state

Later commands will have the benefit of knowing the connection is down and won’t get queued to that `ServerEndPoint`/`PhysicalBridge` at all (earlier in the pipe).

This becomes interesting when you consider a multiplexer might have more than 1 ServerEndPoint. Today, that “retry” queue is the backlog on the PhysicalBridge, for commands that make it this far (getting past the `.IsConnected` check before it went `false`). This means commands are only retrying on that specific endpoint’s bridge. For _most_ commands, this isn’t ideal, because we could retry them on _another_ connection that’s valid. 

In the path to choosing an endpoint, the notion that a ServerEndPoint was connected is a determining factor. If we knew then that it wasn’t connected, we could have chosen another endpoint. For example `CommandFlags.Prefer*` only _prefer_ a server, they will happily accept another. For these cases, we should re-evaluate and see if we have another path. However, a trade-off to this approach for maximum success has a fundamental ordering tradeoff where we’re throwing deterministic order (in either direction) to the wind.

A proposal thus far has been “maybe we don’t care about order - if a consumer does, they should be awaiting each command”. Whether this is a valid assumption, we don’t know - it is a guess and we need some more ecosystem input. Note: this is only within a specific endpoint, order across endpoints is in no way guaranteed - commands are issued to the wire as fast as we can and some are going to be busier than others - a non-congruent system is an unspoken but baseline assumption here when considering multiple endpoints. The endpoints in play may not even be related to each other on the Redis server side at all (e.g. they could be 2 random instances).

Open questions:
- Order - where do we stand?
- How do we exclude on-connect commands from any mechanism
- Do we give users control/pluggability in what is retried?
  - ...and we can do that without the exception generation, perhaps via enum?

### Timeouts

If we do the buffering changes above to allow for buffering by default, we need to have an “out” for the high traffic/redline clients. By default this is the sync or async timeout configurations, but given the spiral of death implications, we may need to have an additional, explicitly lower threshold here. We could have a nullable configuration timeout value that’s used and falls back to the above. High throughput clients needing a low threshold could set that, and potentially we could ever treat it being set to `0` as a “don’t queue/retry at all” check, failing immediately rather than queueing. The check wouldn’t be that expensive, so...maybe.

Open questions:
- Do we add an additional config option for timeouts?
- Do we ever offer a retry for timeouts in any capacity? (thus far considered out of scope)

### Failover

The biggest problem we see in the wild with failovers that are instant is coordination. Long failovers where Redis is down or StackExchange.Redis cannot connect to it, those are what they are: we have a gap. We buffer, retry, or fail in some way across that gap. For instant gaps, we have a lot more play in keeping everything successful. Most of the time, it’s better for a stream of commands that normally take 2ms to suddenly take 500ms and return back to normal. Graphs see a slight elevation for a moment, but users don’t get errors and most people are happy. The alternative: hard failing to an error (which we do often today) is less desired.

So the question is how can we coordinate? The most common controllable failover is when patching servers, as most people do on some schedule. When we promoted or replicated a server at Stack Overflow, we actually used this library to do it. A pub/sub event happens immediately following any server topology change so that all clients can re-evaluate their view of server topology (e.g. who’s a primary and who’s a replica) ASAP, minimizing impact. This approach generally works well and minimizes the amount of errors encountered.

In discussing this with the Azure Cache for Redis team, they’re open to adding a pub/sub event at a few stages of their patch cycles as well, notably right before a failover happens (making another server take over the primary role) and immediately after it happens. We can listen for both of these and re-evaluate our server endpoint status the same way, minimizing the delay between when server topology changes and when the client realizes it. There’s no need for configuration here - it’s a net win for everyone, so as soon as Azure defines this pub/sub broadcast (the channel name, really), we can put it in the library and consumers will be better off - opt-in would be extraneous and counter-productive for a global quality of life improvement.

This closer-to-instant notification story helps greatly in how many commands and for what duration we’d need to buffer in the normal patching situation. It doesn’t improve every case, but it drastically improves the most common case.

### Testing

To do anything here, we need a better testing path. I discussed this with Marc and an exciting option to explore is also the simplest: terminate the socket. We have a few ways to go about this, we can:
- Terminate the `Socket` directly
- Terminate the reader pipe (`.Input`) via `.Complete(Exception)`
- Terminate the writer pipe (`.Output`) via `.Complete(Exception)`

We have a `ServerEndPoint.SimulateConnectionFailure()` method today but...it just doesn’t work. It’s terrible in flaky ways because of all the races involved (detailed earlier) - the above options should give us much more control. In short, it was never properly adapted to a v2 pipeline world. Since we have these references from the `IDuplexPipe` on the `PhysicalConnection` already, we should be able to expose these to tests with no `System.Pipelines.Unofficial` changes, but are open to changes there if needed. Note that `OnReaderComplete` is not a viable path (discussion in https://github.com/dotnet/runtime/issues/29818).

Open questions:
- Does this test what we need?
- Does it give the fidelity we need?

cc @mgravell @philon-msft @deepakverma @carldc @TimLovellSmith 

Overall: we want to use this issue to get the problem set locked down and discuss approaches & solutions. If I've missed something please let me know and I'll update the summary here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Backlog, Backpressure, and Retry - Overview and Planning #1864

Overview/Context

Retry

Order

Buffering

Problems

Ordering

Proposals

Buffering Changes

Buffering Options

Retry

Timeouts

Failover

Testing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Backlog, Backpressure, and Retry - Overview and Planning #1864

Description

Overview/Context

Retry

Order

Buffering

Problems

Ordering

Proposals

Buffering Changes

Buffering Options

Retry

Timeouts

Failover

Testing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions