Skip to content

Expand and clarify consitency/durability docs in store.wit #56

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
84 changes: 68 additions & 16 deletions wit/store.wit
Original file line number Diff line number Diff line change
Expand Up @@ -7,22 +7,67 @@
/// ensuring compatibility between different key-value stores. Note: the clients will be expecting
/// serialization/deserialization overhead to be handled by the key-value store. The value could be
/// a serialized object from JSON, HTML or vendor-specific data types like AWS S3 objects.
///
/// ## Consistency
///
/// Data consistency in a key value store refers to the guarantee that once a write operation
/// completes, all subsequent read operations will return the value that was written.
///
/// Any implementation of this interface must have enough consistency to guarantee "reading your
/// writes." In particular, this means that the client should never get a value that is older than
/// the one it wrote, but it MAY get a newer value if one was written around the same time. These
/// guarantees only apply to the same client (which will likely be provided by the host or an
/// external capability of some kind). In this context a "client" is referring to the caller or
/// guest that is consuming this interface. Once a write request is committed by a specific client,
/// all subsequent read requests by the same client will reflect that write or any subsequent
/// writes. Another client running in a different context may or may not immediately see the result
/// due to the replication lag. As an example of all of this, if a value at a given key is A, and
/// the client writes B, then immediately reads, it should get B. If something else writes C in
/// quick succession, then the client may get C. However, a client running in a separate context may
/// still see A or B
/// Any implementation of this interface MUST have enough consistency to guarantee "reading your
/// writes" for read operations on the same `bucket` resource instance. Reads from `bucket`
/// resources other than the one used to write are _not_ guaranteed to return the written value
/// given that the other resources may be connected to other replicas in a distributed system, even
/// when opened using the same bucket identifier.
///
/// In particular, this means that a `get` call for a given key on a given `bucket`
/// resource MUST never return a value that is older than the the last value written to that key
/// on the same resource, but it MAY get a newer value if one was written around the same
/// time. These guarantees only apply to reads and writes on the same resource; they do not hold
/// across multiple resources -- even when those resources were opened using the same string
/// identifier by the same component instance.
///
/// The following pseudocode example illustrates this behavior. Note that we assume there is
/// initially no value set for any key and that no other writes are happening beyond what is shown
/// in the example.
///
/// bucketA = open("foo")
/// bucketB = open("foo")
/// bucketA.set("bar", "a")
/// // The following are guaranteed to succeed:
/// assert bucketA.get("bar").equals("a")
/// assert bucketB.get("bar").equals("a") or bucketB.get("bar") is None
/// // ...whereas this is NOT guaranteed to succeed immediately (but SHOULD eventually):
/// // assert bucketB.get("bar").equals("a")
///
/// Once a value is `set` for a given key on a given `bucket` resource, all subsequent `get`
/// requests on that same resource will reflect that write or any subsequent writes. `get` requests
/// using a different bucket may or may not immediately see the new value due to e.g. cache effects
/// and/or replication lag.
///
/// Continuing the above example:
///
/// bucketB.set("bar", "b")
/// bucketC = open("foo")
/// value = bucketC.get("bar")
/// assert value.equals("a") or value.equals("b") or value is None
///
/// In other words, the `bucketC` resource MAY reflect either the most recent write to the `bucketA`
/// resource, or the one to the `bucketB` resource, or neither, depending on how quickly either of
/// those writes reached the replica from which the `bucketC` resource is reading. However,
/// assuming there are no unrecoverable errors -- such that the state of a replica is irretrievably
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why we mention "unrecoverable errors". Such errors aren't visible to the guest and thus aren't really of consequence to the guest. I believe the important bit is that the writes one one resource are not guaranteed to be reflected on subsequent reads of a different resource.

As things are written I'm unsure about the following situation. Imagine the guest code:

bucketA = open("foo")
bucketB = open("foo")
bucketA.set("bar", "a")

sleep(1_000_000_years)

assert bucketA.get("bar").equals(bucketB.get("bar"))

The client has left sufficient time (1,000,000 years) for replication to happen. However, the backing implementation uses caching such that once set is called, get on that resource will always reflect the call to set. Unfortunately, the underlying write failed and so the cache does not reflect the state of the backing store. This means bucketA and bucketB will never agree on the value of "bar".

Is that spec compliant?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario I had in mind regarding "unrecoverable errors" was where bucketA is connected to replica X and bucketB is connected to replica Y, but replica X is lost (say the rack caught on fire) before it can send bucketA's write to replica Y. Very unlikely of course, and certainly outside the realm of normal operation, but it still prevents us from making any absolute guarantees. In any case, such an error is of consequence to the guest in that bucketA's write never had a chance to be the one the system eventually settles on. And if both replica X and replica Y were in that same unfortunate rack, then it's possible neither write made it to the rest of the system.

BTW, if the discussion of unusual errors is distracting and/or superfluous, I can omit it or move it to a footnote. I mainly just wanted to point out that failures in a distributed system are non-atomic and can affect the behavior of that system even when it's still (partially) available. That's in contrast to a centralized, ACID database where it either fails completely or not at all.

Regarding caching: I expect assert bucketA.get("bar").equals(bucketB.get("bar")) should eventually be true for a long running process; i.e. values shouldn't be cached indefinitely. Not sure exactly where we draw the line on cache invalidation timing, but certainly less than a million years :). And implementations based on systems which support proactive cache eviction (e.g. by pushing notifications to clients) would presumably make use of that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this discussion is superfluous. I think it's extremely important. It's the difference between whether host implementors of this interface need to wait for guarantee of replication or not. When we settle on the semantics of writes are not guaranteed to replicate, then that means the guest can never trust a write except by opening a new resource handle and doing a new read, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we settle on the semantics of writes are not guaranteed to replicate, then that means the guest can never trust a write except by opening a new resource handle and doing a new read, right?

Yes, that sounds correct to me. FWIW, I do think supporting two kinds of writes (one that uses write-behind caching to avoid blocking and another that blocks until it has received confirmation from at least one replica) and two kinds of reads (one that uses a cache and one that doesn't) could make sense. Even when using the blocking versions of those operations, though, we still wouldn't be able to make guarantees about if/when the write is visible using a different resource handle (since it might be connected to a different replica).

Some distributed databases use a single-master replication model, which make it easier to provide stronger guarantees -- e.g. as long as you get write confirmation from the master and then, when reading, request that the replica syncs with the master before returning a result, then you'll get very ACID-style semantics. That's what Turso does to implement transactional writes and BEGIN IMMEDIATE transactional reads. The only way to do that with a highly-available, asynchronous, peer-to-peer database is to request write confirmation from all replicas and then, when reading, request that the replica you're talking to sync with all the other replicas before returning a result.

It might help in this discussion to nail down the minimum feature set (related to consistency, durability, or otherwise) a backing key value store must provide to be compatible with wasi-keyvalue, and then determine which systems (e.g. Redis, Cassandra, Memcached, etc.) actually support them. If all the backing stores we want to use support consistency features with tighter guarantees than the ones I've described here, then we can tighten up this language as well.

/// lost before it can be propagated -- one of the values ("a" or "b") SHOULD eventually be
/// considered the "latest" and replicated across the system, at which point all three resources
/// will return that same value.
///
/// ## Durability
///
/// This interface does not currently make any hard guarantees about the durability of values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to leave the durability wide open. I am wondering in your case 3 - under async set calls scenario, we want to emphasize that the implementation should still guarantee "Read your write" data consistency.

Now, there is a question of "what happens if an async I/O error occurs right after the set call completes successfully": a weak point of the current specification and I was hoping that we could address this one.

In a strict interpretation of the spec, once set is Ok, the handle SHOULD behave as if the value is now present. A get on the same handle SHOULD return the new value.

If the store experiences a critical I/O failure that causes data corruption or data loss, there are currently no instructions on how the store should respond. Should it return Err(error::other(...)) on subsequent get calls?

I think there are two possible ways to extend the specification to address the above concerns:

Handle defunct after errors

We could define that once a bucket handle experiences a critical I/O error, all further operations on that handle must return an error. That is, if a store fails after set, it would no longer provide a consistent view for subsequent get operations. This does not violate the “read your write” guarantee since the handle is considered defunct.

a Best-effort guarantee tied to success conditions

The specification could define that “read your writes” holds as long as the store does not fail irrecoverably between operations. A get operation should return a Err(error::other("I/O failure")) to reflect the error condition from the store.

Copy link
Member

@lukewagner lukewagner Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mossaka Based on the previous discussion above, I think there's performance reasons not to require "read your writes" (even when reads follow writes on the same bucket handle). In particular, if the implementation of write sends the written values out over the network to a primary/writer node, and the implementation of read sends a request over the network to a read replica (distinct from the primary writer node), then you won't have "read your writes" without maintaining extra cached copies or making extra network requests. Thus, I think even when there is not an irrecoverable error, we shouldn't say that "read your writes" holds.

/// stored. A valid implementation might rely on an in-memory hash table, the contents of which are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For in-memory stores, we probably want to emphasize that the data might be lost due to store crashed, and the Best-effort guarantee described in my comment above should apply to our specification - stating that the "read your write" consistency contract should only apply to store operating under normal conditions.

/// lost when the process exits. Alternatively, another implementation might synchronously persist
/// all writes to disk -- or even to a quorum of disk-backed nodes at multiple locations -- before
/// returning a result for a `set` call. Finally, a third implementation might persist values
/// asynchronously on a best-effort basis without blocking `set` calls, in which case an I/O error
/// could occur after the component instance which originally made the call has exited.
///
/// Future versions of the `wasi-keyvalue` package may provide ways to query and control the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Future versions of the `wasi-keyvalue` package may provide ways to query and control the
/// Future versions of `wasi:keyvalue` may provide ways to query and control the

/// durability and consistency provided by the backing implementation.
interface store {
/// The set of errors which may be raised by functions in this package
variant error {
Expand Down Expand Up @@ -67,7 +112,14 @@ interface store {
/// 6. Memcached calls a collection of key-value pairs a slab
/// 7. Azure Cosmos DB calls a collection of key-value pairs a container
///
/// In this interface, we use the term `bucket` to refer to a collection of key-value pairs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the wording "connection to a collection of key-value pairs" instead of "a collection of key-value pairs" to be a bit strange - it now implies a networked view instead of a logical container. What does this say to downstream implementation that does not involve networking, e.g. a filesystem implementation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used that wording to emphasize the fact that you can have to bucket resource handles pointing to the same key-value connection but connected to different replicas in an eventually consistent distributed system, in which case they'll see that collection from different points of view such that values may arrive in different orders, etc. In other words, I'm trying to emphasize that each handle represents a potentially unique view of the collection which is not necessarily consistent with another view, despite being opened with the same name.

It might help to use two different terms for these concepts, e.g. "bucket" could refer to the collection while "bucket-view" refers to a specific view of the collection, similar the distinction between a value and a pointer to a value in a programing language.

In the interest of minimizing further changes to this PR, though, would it help to change "connection to a collection of key-value pairs" to "view of a collection of key-value pairs" (and likewise replace "connection" with "view" anywhere else it appears)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I am okay to merge this PR as is because we can always update the spec if other people find this confusing.

/// In this interface, we use the term `bucket` to refer to a connection to a collection of
/// key-value pairs.
///
/// Note that opening two `bucket` resources using the same identifier MAY result in connections
/// to two separate replicas in a distributed database, and that writes to one of those
/// resources are not guaranteed to be readable from the other resource promptly (or ever, in
/// the case of a replica failure). See the `Consistency` section of the `store` interface
/// documentation for details.
resource bucket {
/// Get the value associated with the specified `key`
///
Expand Down
Loading