Support shared backing store

Bigslice workers currently [store their task outputs locally](https://github.com/grailbio/bigslice/blob/ee84bdce5e95eaf2c5883a19ea86259f57e46165/exec/bigmachine.go#L543). These stored outputs may then be read by other workers when needed by direct connections between machines.

When machines are especially flaky, e.g. high spot market contention in EC2, progress on a computation can grind to a halt, as machines are lost frequently enough that a large portion of time is spent recomputing lost results.

Workers could instead write to a more durable shared backing store. If workers are lost, their results would remain available. This would allow computations to always make forward progress at the cost of extra (read: slow) data transfer.

There is already a [nod to implementation in the code](https://github.com/grailbio/bigslice/blob/ee84bdce5e95eaf2c5883a19ea86259f57e46165/exec/bigmachine.go#L822-L824). There's work to be done to plumb it through.

[Amazon FSX for Lustre](https://aws.amazon.com/fsx/lustre/) may be a good option, as it's basically designed for this sort of use case:

> The open source Lustre file system is designed for applications that require fast storage – where you want your storage to keep up with your compute. Lustre was built to quickly and cost effectively process the fastest-growing data sets in the world, and it’s the most widely used file system for the 500 fastest computers in the world. It provides sub-millisecond latencies, up to hundreds of gigabytes per second of throughput, and millions of IOPS.

We could also implement something like asynchronous copy to a shared backing store, first preferring worker-worker transfer but falling back to the shared backing store if the machine is no longer available.

It would be good to benchmark various approaches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support shared backing store #99

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support shared backing store #99

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions