-
Notifications
You must be signed in to change notification settings - Fork 525
Description
The experimental events-based cache relies on tracking change events for registration entries and attested nodes. The current algorithm relies on the monotonically increasing property of the event ID. Previous assumptions around arrival order of events and event ID increment stepping have contributed to a series of fixes to the current algorithm that have complicated the codebase (e.g. skipped event discovery and tracking) and contributed to pessimistic database query load, when one of the primary goals for the events-based cache is to reduce database load in the steady state. Organizations whose deployments tickle the conditions above are not realizing as much benefit from the events-based cache as should be possible.
I propose the following algorithm change to the events based cache:
- SPIRE Server retains a list of events named
processed_events
, which is initially empty. - Every cache poll cycle:
- all events over the last
N
minutes are retrieved, based on the event CreatedAt column, into a listrecent_events
- a diff is performed against
processed_events
andrecent_events
- events in
recent_events
but not inprocessed_events
(i.e. new events) are processed and added toprocessed_events
- events in
processed_events
but not inrecent_events
(i.e. old events) are removed fromprocessed_events
- events in both
processed_events
andrecent_events
(i.e. already processed events) are ignored
- events in
- all events over the last
We can handle failures to talk to the database in a given poll cycle by either:
- polling for max(
N
, last time we successfully polled), or - rehydrating the entire cache if it has been longer than some threshold since the last poll
This algorithm assumes a few things:
- Number of events in the last
N
minutes is manageably sized:
Seems reasonable if we keepN
small, e.g. 3 minutes - No mutating transaction lives longer than
N
minutes:
As far as I know, the only mutating operation that has the potential for a long-lived transaction is that which prunes registration entries, which executes a single statement to delete all registration entries older than some threshold. When the registration entry set is large, this operation can take some time. We can likely reduce the per-transaction time by doing the deletion in small batches and more frequently. - Event lookup by timestamp is efficient:
The created_at column is not currently indexed, but can be.
PROS:
- No more coupling with the event ID values. The value is now a simple lookup and only has to be unique across the
N
minute window. - Reduced database load for deployments that currently tickle large amounts skipped events in the current algorithm.
- Reaches the goal of low-to-no database load when there is no churn over registration entries/agent
- Code is much simpler.
CONS:
- Database load is not immediately reduced when there is registration entry/agent churn since we'll still poll for events for the
N
minute window, but will fully reduce afterN
minutes.
I have prototyped the above algorithm and so far have not witnessed split brain in the cache though more testing is required.