Skip to content

[FEATURE] 🏷️ Tagging #319

@jodydonetti

Description

@jodydonetti

The Need

Time and time again, there have been requests from the community to support some form of "grouping together cache entries", primarily for multi-invalidation purposes.

Here are some examples:

On top of this, the upcoming HybridCache from Microsoft that will tentatively be released at around the .net 9 timeframe and for which FusionCache want to be an available implementation of (and actually the first one!), will seemingly support tagging.

So, it seems the time has come to finally deal with this monumental beast.

image

Shadow of the Colossus
Copyright © Sony, All Rights Reserved

Scenario

As we all know, cache invalidation is in general an uber complex beast to approach, and this is true even with "just" an L1 cache (only the first local level, in memory).

Add to this the fact that when we talk about an hybrid cache like FusionCache, we can have 2 levels (L1 + L2, memory + distributed) and multi-node invalidation (see: horizontal scalability) and it's even worse.

Finally, as a cherry on top, in the case of FusionCache add the fact that it automatically handles transient errors thanks to features like fail-safe, soft timeouts, auto-recovery and more, and you have a seemingly insurmountable task ahead.

Or is it?

Limitations

Aside from the complexity of the problem itself, which as said is already kind of crazy hard, we need to deal first and foremost with a design decision that sits at the foundation of FusionCache itself since the beginning (and the same is also true for the upcoming HybridCache from Microsoft): the available abstractions to work with, for both L1 but even more so for L2, are quite limited.
In particular for L2 that means IDistributedCache, with its very limited set of available functionalities.

The design decision of using IDistributedCache paid a lot of dividends along the years, because any implementation of IDistributedCache is automatically usable with FusionCache: since there are a lot of them readily available covering very different needs, this is a very powerful characteristic to have available.

On the other hand, as said, we basically have at our disposal only 3 methods:

  • Set
  • Get
  • Remove

That's it, and it's not a lot to work with.

So, what can we do?

R&D Experiments

Since as we can see there's no native support for tagging, along the years I've experimented multiple times with an approach which I called "Client-Assisted Tag Invalidation" which basically means "do something client-side only, with no server-side support and that for all intents and purposes would get us the same result from the outside".

This, in turn, translates to not actually do a real "evict by tag" on the underlying caches (eg: on Redis, Memcached, etc) but instead keep track of "something" only on the client-side to check before returning data to callers.
This would logically work as a sort of "barrier" or "high-pass filter" to "hide" data that is logically expired because of one or more of the associated tags.

There are different ways to try to achieve this, but in general it would have consisted of something like:

  • locally (in memory), an extra dictionary of tags-related invalidations timestamps
  • remotely, in the L2, an extra "special" cache entry dedicated to contain the data about all expired tags, like the timestamp at which each tag-based expiration occurred

So, "removing by tag" then means getting the special entry, add the new bit of information, and saving it back.

But, as I explained in my comment regarding a similar approach for the upcoming HybridCache from Microsoft, this can have severe limitations and practical problems, like:

  • SIZE: eccessive growth of a single cache entry for any real-world usage (eg: a lot of tag-based invalidations would require a ton of data squeezed in a single cache entry). This can hit even harder on some cache backends like Redis with the notorious conveyor-belt problem (eg: one big entry would block a lot of subsequent smaller entries), but even on different cache backends it's generally a bad thing to not be able to handle growth for something that we already know will realistically grow over time. As a different scenario but with similar characteristics, think about an hypothetical HTTP API endpoint that returns the list of products (that naturally grows over time) but where it's not possible to do some sort of paging
  • CONCURRENCY: concurrency issues with multiple tag evictions happening at the same time on different nodes (and, maybe, even on the same node at the same time) where, beacuse of the limited methods available on IDistributedCache, it's not possible to concurrently add 2 different pieces of information to the same cache entry at the same time. This typically results in a last-wins effect, basically cancelling the previous ones done in the same "update time-window". Even accounting for some special data structure like hashsets on Redis (on server-side cache backends that support such a thing), concurrency would be theoretically solved but the first point (size) would still remain
  • COLD STARTS: issues with cold starts, where the initial loading of that special cache entry with tag eviction data would either be blocking (and therefore potentially slowing things down, but avoiding dirty reads of expired data) or non-blocking (and therefore not slowing things down, but allowing for dirty reads). This, along with the size issue would mean loading a lot of data all at once, again the "list endpoint without paging" issue mentioned above

All of this is why, after multiple experimentations along the years, I was basically convinced that the only way to add proper tagging support would've been to go with a "Server-Assisted Tag Invalidation" approach, meaning creating a new abstraction like IFusionCacheLevel or something (either an interface or an abstract class, it's not the point here) to model a generic and more powerful "cache level", with native support for tagging and more.

This would simplify a lot of things but, at the same time, would take away the ability to use any existing IDistributedCache implementation out there. FusionCache though already works with vanilla IDistributedCache and this should not go away, so it means FusionCache must be able to work with both vanilla and extended at the same time: this would not be a problem per se, since I can check at runtime which abstraction the L2 implements and act accordingly, but it also means that for users NOT using an extended L2 implementation, extra features like tagging would NOT be available.

And I don't like this.

And I would really really like to give tagging to all FusionCache users, all of them.

Epiphany

Recently I went to South Korea for my (very late) summer vacations.

In Seoul there's a good jazz scene, with multiple places that deserve a visit like Libro for some live performances which is really beautiful or the nice and cozy Coltrane for some vinyl listening, both highly recommended.

One evening, while drinking a glass of Ardbeg at Coltrane, Land of Make Believe by Chuck Mangione started playing.

And I suddenly had an epiphany.

Why not look at it from a different angle, get to a delicate balance between absolute performance and features available, think about how it would actually be used in the real world from a statistical perspective, and "simply" use the pieces already there to find an overall equilibrium?

By not using a single cache entry to store all tag invalidation infos we would be able to guarantee scalability with whatever number of tags, virtually without limits.

Solution

I'm proposing a solution I call "Client-Assisted Tag Invalidation", meaning it does not requires extra assistance from the server-side.

On one hand it's true that by looking at an entire system in production we'll probably have a lot of tag invalidations along time, and this is a given.

On the other hand it's also true that, by their own nature, a lot of tags will be shared between cache entries: this is the whole point of it anyway.

On top of this, we can set some reasonable limits: for example when working with metrics in OTEL systems, it is a known best practice to not have a huge amount of tags and to not have tags with a huge amount of different values (known as "high cardinality").
So we can say the same here.

By accepting this small fact, by understanding the probabilistic nature of tags usage and sharing and by most importantly relying on all the existing plumbing that FusionCache already provides (like L1+L2 support, fail-safe, non-blocking background distributed operations, auto-recovery, etc) we can "simply" say that, for each tag, we'll automatically handle a cache entry with the data needed, basically meaning the timestamp of when the expiration has been requested the last time for that tag.

Regarding the probabilistic nature: basically a lot of tags will be shared between multiple cache entries, think the Birthday Paradox.

So, a RemoveByTag("tag123") would simply set internally an entry with a key like "__fc:t:tag123" or something like that, containing the current timestamp. Also note that the concrete cache key will also consider any potential cache-key prefix, so mutliple named caches on shared cache backends would automatically be supported, too.

Then when getting a cache entry, after getting it from L1/L2 but before returning it to the outside world, FusionCache would see if it has tags attached to it and, in that case and only in thase case (so no extra costs when not used), it would get the expiration timestamp for each tag to see if it's expired and when.

For each related tag, if an expiration timestamp is present and that is greater than the timestamp ai which the cache entry has been created, it then should be considered expired.

Regarding the Duration of such special entries with tag expiration data, a value would be configurable via options but a sensible default (like 24h) would be provided that would cover most cases.

This can be considered a "passive" approach (waiting for each read to see if it's expired) instead of an "active" one (actually go and massively expire data immediately everywhere).

When get-only methods (eg: TryGet, GetOrDefault) are called and a cache entry is found to be expired because of tags, it not only hide it from the outside but FusionCache will effectively expire it which, thanks to FusionCache normal behaviour, means both locally in the L1, on L2 and on each other node's L1 remotely (thanks to the backplane).

When get-set methods (eg: GetOrSet) is called and a cache entry is found to be expired because of tags, it just skip it internally and call the factory, since that would produce a new value and resolve the problem anyway, just in a different way: the internal set will again automatically save the new value locally in the L1, on L2 and on each other node's L1 remotely (thanks again to the backplane).

So the system would automatically updates internally based on actual usage, only if and when needed, without massive updates to be made when expiring by a tag.

Nice.

What about app restarts? No big deal, since everything is based on the common plumbing of FusionCache, all will work normally and tag-eviction data will get re-populated again automatically, lazily and based on only the effective usage.

Performance considerations

But wait, this is probably ringing a bell for a lot of people reding this: isn't this a variation of the dreaded "SELECT N+1 problem"?

No, at least realistically that is not the case, mostly because of probabilistic theory and adaptive loading based on concrete usage.

Let me explain.

A typical SELECT N+1 problem happens when, to get a piece of data, we do a first select that returns N elements and then, for each element, we do an additional SELECT.

Here this does not happen, because:

  • as soon as a tag returns a timestamp that marks the entry as expired, the process stops, reducing the SELECT amount
  • because of how tags are used (shared between different entries), one load of tag expiration data will be used for multiple entries, reducing again the SELECT amount

As an example if we are loading, either concurrently or one after the other, these cache entries:

  • key "foo", tagged "tag1" and "tag2"
  • key "bar", tagged "tag2" and "tag3"
  • key "baz", tagged "tag1" and "tag3"

The expiration data for "tag1" will be loaded lazily (only when needed) and only once, and automatically shared between the processing of cache entries for both "foo" and "baz".
And since as said tags are frequently shared between different cache entries, this means that the system will automatically load only what's needed, when it's needed, and only once.

Some extra reads would be needed, yes, but deinitely not the SELECT N+1 case which would only remain as a worst case scenario, and not for every single cache read.

What about needing tag expiration for "tag1" by 2 difference cache entries at the same time? Will it be loaded multiple times?
Nope, we are covered, thanks to the Cache Stampede protection.

What about tag expiration data being propagated to other ones?
We are covered, thanks to the Backplane.

And what if tags are based on the data returned from the factory, so that it is not known upfront?
No worries, Adaptive Caching will be extended to support tagging, too.

What about potential transient errors?
We are covered, thanks to Fail-Safe.

What about slow distributed operations?
Again we are covered, thanks to advanced Timeouts and Background Distributed Operations.

What about recovering from distributed errors? Should users need to handle them manually?
Nope, also covered, thanks to Auto-Recovery.

All of this because of the solid foundations that have been built in FusionCache for years 💪

What about Clear() ?

If all of this works out, and up until now it seems so, this same approach may also be used to finally implement something else: a proper Clear() method, one that actually supports all scenarios:

  • can work on L1
  • can work on L2
  • can work on multiple nodes
  • can work with multiple named caches and cache-key prefix

But how?

By simply adding support for a special "*" tag (star, meaning "all") we can achieve that.

This tag can also receive a special treatment, like being immediately read from L2 when an update notification is received, for performance reasons.

Server-Assisted Tag Invalidation?

Does this approach exclude an hypothetical "Server-Assisted Tag Invalidation" with an extended IFusionCacheLevel or similar?

No, actually not! But supporting tagging without that means that the feature can be available right now, for any existing IDistributedCache implementation, without requiring any extra assistance from 3rd party packages, and with maybe a couple of extra reads here and there.

In the future though I think I will also explore the server-assisted route, because it can lead to a good perf boost: the nice thing about doing the client-assisted approach first though is that the feature will be available in both ways, and when using the eventual extended abstraction you'll "just" get an extra perf boost, but in both cases no limitations at all.

I think this is the best approach overall.

Where are we now?

Right now I have an implementation working on a local branch, which is already something damn awesome to be honest.

I'm currently in the process of fine tuning it, benchmarking it, test edge cases, trace/log the hell out of it to also see the extra work required while simulating real-world scenarios and so on.

If all goes well this feature will be included in FusionCache v2.0, which would be released at around the same time as .NET 9 , including support for the new HybridCache from Microsoft.

Your help is needed

But, honestly, it still seems too good to be true, and I may be missing something.

So, here we go: can you, dear user, please reason about
the approach, about pros/cons, and try to see
if you can spot any problem?

It would be a really invaluable thing for me to have, and I thank you for that in advance.

Thanks 🙏

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions