You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Internal] Circuit Breaker: Adds Code to Implement Per Partition Circuit Breaker (#5023)
# Pull Request Template
## Description
The idea of having a per partition circuit breaker (aka PPCB) is to
optimize **a)** read availability , in a single master account and
**b)** read + write availability in a multi master account, during a
time when a specific partition in one of the regions is experiencing an
outage/ quorum loss. This feature is independent of the partition level
failover triggered by the backend. The per partition circuit breaker is
developed behind a feature flag `AZURE_COSMOS_CIRCUIT_BREAKER_ENABLED`.
However, when the partition level failover is enabled, we will enable
the PPCB by default so that the reads can benefits from it.
## Scope
- For single master, only the read requests will use the circuit breaker
to add the pk-range to region override mapping, and use this mapping as
a source of truth to route the read requests.
- For multi master, both the read and write requests will use the
circuit breaker to add the pk-range to region override mapping and use
this mapping as a source of truth to route both read and write requests.
## Understanding the Configurations exposed by the environment
variables:
- `AZURE_COSMOS_CIRCUIT_BREAKER_ENABLED`: This environment variable is
used to enable/ disable the partition level circuit breaker feature. The
default value is `false`.
-
`AZURE_COSMOS_PPCB_STALE_PARTITION_UNAVAILABILITY_REFRESH_INTERVAL_IN_SECONDS`:
This environment variable is used to set the background periodic address
refresh task interval. The default value for this interval is `60
seconds`.
-
`AZURE_COSMOS_PPCB_ALLOWED_PARTITION_UNAVAILABILITY_DURATION_IN_SECONDS`:
This environment variable is used to set the partition unavailability
time duration in seconds. The unavailability time indicates how long a
partition can remain unhealthy, before it can re-validate it's
connection status. The default value for this property is `5 seconds`.
- `AZURE_COSMOS_PPCB_CONSECUTIVE_FAILURE_COUNT_FOR_READS`: This
environment variable is used to set the consecutive failure count for
reads, before triggering per partition level circuit breaker flow. The
default value for this flag is `10` consecutive failures within 1 min
window.
- `AZURE_COSMOS_PPCB_CONSECUTIVE_FAILURE_COUNT_FOR_WRITES`: This
environment variable is used to set the consecutive failure count for
writes, before triggering per partition level circuit breaker flow. The
default value for this flag is `5` consecutive failures within 1 min
window.
## Understanding the Working Principle:
On a high level, there are three parts of the circuit breaker
implementation:
- **Short Circuit and Failover detection:** The failover detection logic
will reside in the SDK ClientRetryPolicy, just like we have for PPAF.
Ideally the detection logic is based on the below two principles:
- **Status Codes:** The status codes that are indicative of partition
level circuit breaker would be the following: a) `503` Service
Unavailable, b) `408` Request Timeout, c) cancellation token expired.
- **Threshold:** Once the failover condition is met, the SDK will look
for some consecutive failures, until it hits a particular threshold.
Once this threshold is met, the SDK will fail over the read requests to
the next preferred region for that offending partition. For example, if
the threshold value for read requests is `10`, then the SDK will look
for `10` consecutive failures. If the threshold is met/ exceeded, the
SDK will add the region failover information for that partition.
- **Failover a faulty partition to the next preferred region:** Once the
failover conditions are met, the `ClientRetryPolicy` will trigger a
partition level override using
`GlobalPartitionEndpointManagerCore.TryMarkEndpointUnavailableForPartitionKeyRange`
to the next region in the preferred region list. This failover
information will help the current, as well as the subsequent requests
(reads in single master and both reads and writes in multi master) to
route the request to the next region.
- **Failback the faulty partition to it's original first preferred
region:** With PPAF enabled, ideally the write requests will rely on
403.3 (Write Forbidden) signal to fail the partition back to the primary
write region. However, this is not true for reads. That means SDK
doesn’t have a definitive signal to identify when to initiate a failback
for read requests.
Hence, the idea is to create a background task during the time of read
failover, which will keep track of the pk-range and region mapping. The
task will periodically fetch the address from the gateway address cache
for those pk ranges in the faulty region, and it will try to initiate
Rntbd connection to all 4 replicas of that partition. The RNTBD open
connection attempt will be made similar to that of the replica
validation flow. The life cycle of the background task will get
initiated during a failover and will remain until the SDK is disposed.
If the attempt to make the connection to all 4 replicas is successful,
then the task will remove/ override the entry with the primary region,
resulting the SDK to failback the read requests.
## Type of change
- [x] New feature (non-breaking change which adds functionality)
## Closing issues
To automatically close an issue: closes#4981
/// Enable partition level circuit breaker (aka PPCB). For compute gateway use case, by default per partition automatic failover will be disabled, so does the PPCB.
736
+
/// If compute gateway chooses to enable PPAF, then the .NET SDK will enable PPCB by default, which will improve the read availability and latency. This would mean
737
+
/// when PPAF is enabled, the SDK will automatically enable PPCB as well.
0 commit comments