Skip to content

Better Prevent Read Outliers during short-term Bookie Slow-Down #1489

@nicmichael

Description

@nicmichael

BUG REPORT

Issue #709 introduced a feature to put bookies which didn't respond to a read request for speculativeReadTimeout ms (default: 2000 ms) on a "slowBookie" list, on which they remain for bookieFailureHistoryExpirationMSec ms (default: 60000). During this period, bookies on the slowBookie list will be put behind any non-slow bookie in "writeSet" that determines the read order. As a consequence, if a bookie has been slow just once, it will not be considered for further reads for 60 seconds. This pushes more load (potentially unnecessarily) onto other bookies in the ensemble. In some cases, this may overload the other bookies, which might then as a result hit the speculativeReadTimeout as well. In some cases, I've seen all bookies end up on the slowBookie list though none of them had any longer lasting performance issues. At that point, the read order is decided based on the number of outstanding requests of those bookies.

This might work well if bookies are slow for an extended period of time, but not if they only face a brief hickup. For example, we're setting speculativeReadTimeout = 30 to improve read latency and prevent outliers in case of slow bookies, for example due to Java GC in bookies (on the order of tens of ms) or any other short hickups bookies they may face. The current implementation does not address these needs adequately since it only detects slow bookies after speculativeReadTimeout ms (but not while a queue is building up on a bookie), and also puts bookies that have once been slow on the slowBookie list needlessly long without detecting whether or not the problem (e.g. Java GC) has been resolved.

We should have a feature that quicker reacts to short latency spikes or blocking of individual bookies, and quickly resumes directing load to them once the situation is resolved.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions