-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
This issue is meant to serve as a single "design document"-like description of our current understanding of how QQ and stream replicas could be grown in an automated way, without breaking installations where replica management is performed using CLI tools, or by other system parts (such as peer discovery or the stream coordinator).
Original issues
Problem definition
Because quorum queues and streams depend on a majority of replicas being online via Raft, many
users would like to maintain a certain minimal number of replicas (in most cases, 3 or 5). Going
above the limit is considered to be a non-issue, at least in this particular "design document".
Originally QQ replica management is only available via CLI tools because dynamic replica management
is a much more complex issue than it seems and not everyone agrees on how it should be done.
However, 3+ years later and with at least four RabbitMQ-as-a-Service offerings on the market, it
is time to make this more automatic.
Similar problems already solved in RabbitMQ
There is a number of similar problems that RabbitMQ already has solutions for:
- Cluster formation deals with cluster membership (not QQ or stream replica membership), including a feature of automated node removal after a configurable timeout
- Every stream and superstream have a "manager" component that is responsible for replica placement, primarily during initial declaration of the stream
- Same goes for quorum queues: a newly declared quorum queue has to have N replicas started across the cluster according to the
x-quorum-initial-group-size
optional queue argument which defaults to 3
The set of features outlined in this document ideally should not require significant changes to
the behavior of the features above.
How to configure the limit: policy? x-args (optional argument)? rabbitmq.conf
?
Policies is a way to configure things in RabbitMQ
dynamically, for a group of objects such as queues, streams, and so on.
Runtime parameters is a way to configure global things dynamically for the lifetime of the node
process.
rabbitmq.conf
settings are (almost all) static and global.
Optional queue arguments is a way for clients
to configure things statically (for the lifetime of a queue or stream).
While all of these options could work for determining a minimum number of replicas a QQ should have,
optional arguments were selected for initial QQ replica count because this setting seems
naturally static to the core team members who mostly work on QQs.
However, policies match a group of QQs, as would a rabbitmq.conf
default, which is a nice
property for RabbitMQ-as-a-Service scenarios.
As a result, both optional arguments and policies can be useful, and like with many other settings, both can be supported, with optional arguments taking precedence.
Continuously maintained minimum replica count (CMMRC)
Quorum queues, streams and superstreams already have the idea of initial replica count.
What we'd like to achieve here is a continuously maintained minimum replica count, which
in practice will be the same number as the initial count but is not the same thing semantically.
The continuously maintained minimum replica count (CMMRC) is both a number and a set of internal mechanisms that would add replicas should a cluster node fail, and some replicas become unavailable.
Specifically we seek to have a mechanism which would add a new replica after a period of
time from detecting node or QQ/stream replica failure, using a function that would suggest
most suitable nodes for the placement (similar to how peer discovery/cluster formation use a pluggable function).
Going above the limit
It's fine for QQs and streams to have one or multiple extra replicas. Those can be manually removed, as long as the total count does not dip below the CMMRC.
Triggering events
RabbitMQ nodes observe peer node failures, including connectivity link failures. Automatic
replica addition therefore can be implemented in reaction to them, or using periodic checks.
What is important is to have a failure recovery interval. Stream coordinator only kicks in some 30-60 seconds after it detects a replica failure. This interval should be configurable.
Once the interval has passed, a placement node can be selected using a pluggable function,
and a new replica can be added there using the existing mechanisms.
Event-driven or periodic async checks
The CMMRC checks can be performed in response to various cluster events or periodically.
The former would avoid extra periodic load. The latter would be easier to implement.
Both can be combined if necessary.
When a cluster even such as node failure or permanent removal is triggered, this can trigger
a mass addition of replicas to all QQs and streams, consuming all CPU, network link and potentially disk I/O resources on several nodes for a period of time.
A "periodic async" check performed by queues individually will naturally spread this load over time. This periodic async option seems more appealing to the core team for that reason.
Scenarios and Examples
One
A quorum queue qq.1
with initial replica count and CMMRC = 3 is declared in a five node cluster with nodes A, B, C, D, and E. Its replicas are placed on nodes A, B, C.
Node C has suffered an unrecoverable failure and was deleted from the cluster (say, by an operator). In 60 seconds, an internal QQ monitoring check
decided that a new replica must be added to adhere to the configured CMMRC. Node D was selected as for new replica placement by a function that was aware of the deployment infrastructure AZs.
A new replica was added so qq.1
was now hosted on nodes A, B and D. Future CMMRC checks
discovered that no action was necessary.
Two
A quorum queue qq.2
with initial replica count and CMMRC = 3 is declared in a five node cluster with nodes A, B, C, D, and E. Its replicas are placed on nodes A, B, C.
Node C has suffered a permanent hardware failure. In 60 seconds, an internal QQ monitoring check
decided that a new replica must be added to adhere to the configured CMMRC. Node D was selected.
However, at the same time, an external monitoring tool also detected the failure and a human
operator decided to manually add a new replica.
A new replica was added to qq.2
manually and another one by the automatic CMMRC check to node E. So qq.2
was now hosted on nodes A, B, D, and E. Future CMMRC checks
discovered that no action was necessary.
Three
A quorum queue qq.3
with initial replica count and CMMRC = 3 is declared in a five node cluster with nodes A, B, C, D, and E. Its replicas are placed on nodes A, B, C.
The CMMRC check interval was configured to 90 seconds.
Node C was rebooted and came back in 60 seconds, in another 10 qq.3
's failed replica rejoined its peers and caught up (performed Raft log delta reinstallation).
In 90 seconds, an internal QQ monitoring check
discovered that no action was necessary.
qq.3
was now hosted on nodes A, B, and C. Future CMMRC checks
discovered that no action was necessary.