Proposal for an alternative algorithm for events-based cache event tracking

The experimental events-based cache relies on tracking change events for registration entries and attested nodes. The current algorithm relies on the monotonically increasing property of the event ID. Previous assumptions around arrival order of events and event ID increment stepping have contributed to a series of fixes to the current algorithm that have complicated the codebase (e.g. skipped event discovery and tracking) and contributed to pessimistic database query load, when one of the primary goals for the events-based cache is to reduce database load in the steady state. Organizations whose deployments tickle the conditions above are not realizing as much benefit from the events-based cache as should be possible.

I propose the following algorithm change to the events based cache:
- SPIRE Server retains a list of events named `processed_events`, which is initially empty.
- Every cache poll cycle:
  1.  all events over the last `N` minutes are retrieved, based on the event CreatedAt column, into a list `recent_events`
  2.  a diff is performed against `processed_events` and `recent_events`
      -  events in `recent_events` but not in`processed_events` (i.e. new events) are processed and added to `processed_events` 
      - events in `processed_events` but not in `recent_events` (i.e. old events) are removed from `processed_events`
      - events in both `processed_events` and `recent_events` (i.e. already processed events) are ignored

We can handle failures to talk to the database in a given poll cycle by either:
1. polling for max(`N`, last time we successfully polled), or
2. rehydrating the entire cache if it has been longer than some threshold since the last poll 

This algorithm assumes a few things:
- Number of events in the last `N` minutes is manageably sized:
    Seems reasonable if we keep `N` small, e.g. 3 minutes
- No mutating transaction lives longer than `N` minutes:
    As far as I know, the only mutating operation that has the potential for a long-lived transaction is that which prunes registration entries, which executes a single statement to delete all registration entries older than some threshold. When the registration entry set is large, this operation can take some time. We can likely reduce the per-transaction time by doing the deletion in small batches and more frequently.
- Event lookup by timestamp is efficient:
    The created_at column is not currently indexed, but can be.

PROS:
- No more coupling with the event ID values. The value is now a simple lookup and only has to be unique across the `N` minute window.
- Reduced database load for deployments that currently tickle large amounts skipped events in the current algorithm.
- Reaches the goal of low-to-no database load when there is no churn over registration entries/agent
- Code is much simpler.

CONS:
- Database load is not immediately reduced when there is registration entry/agent churn since we'll still poll for events for the `
N` minute window, but will fully reduce after `N` minutes.

I have prototyped the above algorithm and so far have not witnessed split brain in the cache though more testing is required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal for an alternative algorithm for events-based cache event tracking #5624

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal for an alternative algorithm for events-based cache event tracking #5624

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions