Skip to content

REv3: Reduce asymmetry between O(n) output files and O(1) output directories #281

@EdSchouten

Description

@EdSchouten

I've noticed that people sometimes craft rules (e.g., packaging rules) that have many, many, many output files. This is problematic, for the reason that it makes ActionResult very large. So large that it can't be sent back to the client. ActionResults are not stored in the CAS. This means that clients can't download them in a streaming manner. They are returned as part of the gRPC response, which can generally only be up to 4 MB in size. To prevent Buildbarn from generating ActionResults that are this big, I generally tell my users to configure their clusters to only allow processing of Command messages up to 1-2 MB. As the size of ActionResult is generally proportional to that of Command (1-2x as big), that tends to work.

A solution I often give to my users when they hit these limits is that they should use output directories instead of plain output files. In that case ActionResult remains small. It will only contain a small number of output_directories containing references to Tree objects. These can be streamed from the CAS. There are also a couple of advantages on top of that:

  • If all paths share long pathname prefixes, Tree objects can become more compact than listing all pathnames explicitly. So a net reduction in network traffic.
  • In case a repeated invocation of an action yields the same output files, you end up with two large and mostly identical ActionResults. When using output directories, both ActionResults will share the same Tree object. This means that if a client is somewhat smart about caching results, incremental builds consume less traffic.

Though I fully understand where the asymmetry comes from, I do think it's hard to sell to our users. Why is there a difference? From their perspective it's 'tomato tomato'.

I think that as part of REv3 we should investigate whether we can reduce the noticeable differences between using O(n) output files and O(1) output directories. For example, what if every action returns exactly 1 directory hierarchy of outputs, and Command's output_paths merely acts as a filter for what needs to be captured as part of that directory hierarchy?

(Relatedly, is there anything we can do to reduce the size of Command's output_paths by preventing repetition of leading pathnames?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions