-
-
Notifications
You must be signed in to change notification settings - Fork 159
[RFC 0062] Content-addressed paths #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 20 commits
6d001f3
435fc42
7b26144
81099b2
4277386
7af7d2c
5fec861
9edc11f
5717351
1a844cc
26ae77e
bbdca7e
63f3eca
a6d2f38
140e093
288dcb4
60e7da3
1115a0d
13938de
3a25f7f
3a18867
fa16e86
94b65bd
7ed4481
fb4c61d
841fe3f
27bd048
1e8fab7
9772625
02ae2b5
2d74fed
168a149
427abed
f275669
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,310 @@ | ||||||
| --- | ||||||
| feature: Simple content-adressed store paths | ||||||
| start-date: 2019-08-14 | ||||||
| author: Théophane Hufschmitt | ||||||
| co-authors: (find a buddy later to help our with the RFC) | ||||||
| shepherd-team: @layus, @edolstra and @Ericson2314 | ||||||
| shepherd-leader: (name to be appointed by RFC steering committee) | ||||||
| related-issues: (will contain links to implementation PRs) | ||||||
| --- | ||||||
|
|
||||||
| # Summary | ||||||
|
|
||||||
| [summary]: #summary | ||||||
|
|
||||||
| Add some basic but simple support for content-adressed store paths to Nix. | ||||||
edolstra marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| We plan here to give the possibility to mark certain store paths as | ||||||
| content-adressed (ca), while keeping the other input-adressed as they are | ||||||
| now (modulo some mandatory drv rewriting before the build, see below) | ||||||
edolstra marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| By making this opt-in, we can impose arbitrary limitations to the paths that | ||||||
| are allowed to be ca to avoid some tricky issues that can arise with | ||||||
| content-adressability. | ||||||
|
|
||||||
| In particular, we restrict ourselves to paths that are: | ||||||
|
|
||||||
| - without any non-textual self-reference (_i.e_ a self-reference hidden inside a zip file) | ||||||
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| - known to be deterministic (for caching reasons, see [caching]). | ||||||
|
|
||||||
| That way we don't have to worry about the fact that hash-rewriting is only an | ||||||
| approximation nor by the semantics of the distribution of non-deterministic | ||||||
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| paths. | ||||||
|
|
||||||
| We also leave the option to lift these restrictions later. | ||||||
|
|
||||||
| This RFC already has a (somewhat working) POC at | ||||||
| <https://github.com/NixOS/nix/pull/3262>. | ||||||
|
|
||||||
| # Motivation | ||||||
|
|
||||||
| [motivation]: #motivation | ||||||
|
|
||||||
| Having a content-adressed store with Nix (aka the "Intensional store") is a | ||||||
| long-time dream of the community − a design for that was already taking a whole | ||||||
| chapter in [Eelco's PHD thesis][nixphd]. | ||||||
|
|
||||||
| This was never done because it represents a quite big change in Nix's model, | ||||||
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| with some non-totally-solved implications (regarding the trust model in | ||||||
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| particular). | ||||||
| Even without going all the way down to a fully intensional model, we can | ||||||
| make specific paths content-adressed, which can give some important benefits of | ||||||
| the intensional store at a much lower price. In particular, setting some | ||||||
| critical derivations as content-adressed can lead to some substancial build | ||||||
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| cutoffs. | ||||||
|
|
||||||
| # Detailed design | ||||||
|
|
||||||
| [design]: #detailed-design | ||||||
|
|
||||||
| The gist of the design is that: | ||||||
|
|
||||||
| - Derivations can be marked as content-adressed (ca), in which case each | ||||||
| one of their output will be moved to content-addressed `ca` store path. | ||||||
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| This extends the current notion of "fixed-output" derivations. | ||||||
| - We introduce the notion of "resolving" a derivation, which extends to | ||||||
| arbitrary `ca` derivations the current behavior of replacing fixed-outputs | ||||||
| derivations by their output hash. | ||||||
| - We refine the build process so that every derivation is first normalized | ||||||
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| before being realized | ||||||
|
|
||||||
| ## Nix-build process | ||||||
|
|
||||||
| For the sake of clarity, we will refer to the current model (where the | ||||||
| derivations are indexed by their inputs, also sometimes called "extensional") as | ||||||
| the `input-addressed` model | ||||||
|
|
||||||
| ### Output mappings | ||||||
|
|
||||||
| For each output `output` of a derivation `drv`, we define | ||||||
|
|
||||||
| - its output id **DrvOutputId(drv, output)** as the tuple `(hash(drv), output, truster)`, where `truster` is a reserved field for future use and currently always set to `"world"`. | ||||||
|
||||||
| - its output id **DrvOutputId(drv, output)** as the tuple `(hash(drv), output, truster)`, where `truster` is a reserved field for future use and currently always set to `"world"`. | |
| - its output id **DrvOutputId(drv, output)** as the tuple `(hash(drv), output, truster)`, where `truster` is a reserved field for future use (see trust model in <link to future work section>) and currently always set to `"world"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However: is it necessary to be concerned with the trust model at all at this stage? I’d rather just leave this completely out of scope and change the semantics later when we introduce the trust model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for that was to prevent another schema change. That being said we might omit this in the RFC as it's indeed not relevant for the current design
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s leave it out for now.
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - Existing store paths | |
| Existing store paths (including *built* content-addresed resolved derivations' output paths) |
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bigger problem than it might look like, as it means that trivial updates can break the CA marking for reasons not worth mentioning in the upstream changelog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely :)
Maybe that could be clearly stated, but the original scope of this work was to be able to mark very specific derivations that were clearly guaranteed to be deterministic, in which case the problem was less important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the question «why not just propagate CA» shows that writing more is a good idea.
I do think that stressing the limitations in a few key places is also a nice thing to do (people should be able to apply RFC as passed, not what was intended and not what was discussed, after all… we should not treat ourselves worse than we treat computers!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have functionality that allows to build a CA package twice with different apparent output paths, and optionally with different parallelism settings? The build of the package obviously fails if the CA unification doesn't lead to the same result.
Should we mandate that Hydra uses this functionality? Should it be on by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per https://github.com/NixOS/rfcs/pull/62/files#r357243841 I think we can deal with non-deterministic derivation just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, for binary cache transparency it is much better if you can build something locally, then regain connectivity and fetch stuff from a cache, then fetch stuff from a different cache, then build some more locally, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in my mind that's equally risky with and without content addressable derivations. The only difference is one lets you know if something goes wrong, and one doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Build nondeterminism doesn't introduce significant behaviour changes, so as long as the expectations are not broken (yeah, we install you into this output path and your dependencies into those paths, and that is not going to change), it will be mostly usable. There are a few CPU-dependent optimisations from time to time, they are annoying.
With CA things are actually moved around, so even though everything would still work when assembled together, the assembling part will be failing. It is Nix, not the code that is built by Nix, that would fail to do things because of nondeterminism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think trying to keep going despite non-determinism incoherence is a misfeature. You can always evict your own CA mappings (can keep the builds themselves for easy "rollback") and align with cache.reflex-frp.org and keep going.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if things are marked CA, of course it is a good idea to catch failures. But what you propose will not catch much, because a typical derivation is only built once (ever) by Hydra, later Hydra will use the binary cache. Also my proposal includes feeding different «apparent» output paths to the same build with the same dependencies, which has a better chance of discovering compressed self-references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with running --check and trying different termporary output paths. Catching non-determinism I don't think is important, because it's really clashes that we care about. However, catching self-references is important as we have to be able to move the thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, varying the output paths is something that doesn't follow from anything Nix does, so it has to be spelled explicitly.
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can skip this. If we track all the evaluation steps, we have all the information to ensure a binary cache isn't given anything that clashes with ourself. Maybe the first prototype will discover these errrors lazily, but it should discover them
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| might not be a need for the client to also support it (which would require the | |
| be a need for the client to also support it (which would require the |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cas hasn't been defined yet.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we might just store the "generating" mappings from almost-resolved ca input paths (all deps resolved) to output paths, as this will require far less space. OTOH it makes garbage collection tricker as now all mappings in the build closure are needed to recover a maximum-unresolved input path to map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTOH we can just do that with tracing GC. We just read the table backwards, saying each derivation in the codomain references everything in the domain that maps to it, and then look those up in turn.
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ca paths with Nix, leaving as much room as possible for future extensions. | |
| CA paths with Nix, leaving as much room as possible for future extensions. |
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
thufschmitt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reference the reserved truster field from here
Uh oh!
There was an error while loading. Please reload this page.