[Low Priority] React to ds/write-index-with-hashfile-api #389

derrickstolee · 2021-06-22T17:18:44Z

I submitted a change upstream to use the hashfile API when writing the index. This reduced code duplication and unblocks a v5 of the index using the chunk-format API. However, it conflicts with the GVFS_SKIP_SHA_ON_INDEX flag in the core.gvfs config option.

The solution is to modify our approach to allow the hashfile API to skip the hashing (making it be just a buffered stream, which is still helpful).

Without action, this will cause a conflict in the next rebase onto v2.33.0 since the topic was merged into master shortly after the v2.32.0 tag was created.

The flush() logic in csum-file.c was introduced originally by c38138c (git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26) and a portion of the logic performs similar utility to write_in_full() in wrapper.c. The history of write_in_full() is full of moves and renames, but was originally introduced by 7230e6d (Add write_or_die(), a helper function, 2006-08-21). The point of these sections of code are to flush a write buffer using xwrite() and report errors in the case of disk space issues or other generic input/output errors. The logic in flush() can interpret the output of write_in_full() to provide the correct error messages to users. The logic in the hashfile API has an additional set of logic to augment the progress indicator between calls to xwrite(). This was introduced by 2a128d6 (add throughput display to git-push, 2007-10-30). It seems that since the hashfile's buffer is only 8KB, these additional progress indicators might not be incredibly necessary. Instead, update the progress only when write_in_full() complete. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The hashfile API uses a hard-coded buffer size of 8KB and has ever since it was introduced in c38138c (git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26). It performs a similar function to the hashing buffers in read-cache.c, but that code was updated from 8KB to 128KB in f279894 (read-cache: make the index write buffer size 128K, 2021-02-18). The justification there was that do_write_index() improves from 1.02s to 0.72s. Since our end goal is to have the index writing code use the hashfile API, we need to unify this buffer size to avoid a performance regression. There is a buffer, 'check_buffer', that is used to verify the check_fd file descriptor. When this buffer increases to 128K to fit the data being flushed, it causes the stack to overflow the limits placed in the test suite. To avoid issues with stack size, move both 'buffer' and 'check_buffer' to be heap pointers within 'struct hashfile'. The 'check_buffer' member is left as NULL unless check_fd is set in hashfd_check(). Both buffers are cleared as part of finalize_hashfile() which also frees the full structure. Since these buffers are now on the heap, we can adjust their size based on the needs of the consumer. In particular, callers to hashfd_throughput() are expecting to report progress indicators as the buffer flushes. These callers would prefer the smaller 8k buffer to avoid large delays between updates, especially for users with slower networks. When the progress indicator is not used, the larger buffer is preferrable. By adding a new trace2 region in the chunk-format API, we can see that the writing portion of 'git multi-pack-index write' lowers from ~1.49s to ~1.47s on a Linux machine. These effects may be more pronounced or diminished on other filesystems. The end-to-end timing is too noisy to have a definitive change either way. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The do_write_index() method in read-cache.c has its own hashing logic and buffering mechanism. Specifically, the ce_write() method was introduced by 4990aad (Speed up index file writing by chunking it nicely, 2005-04-20) and similar mechanisms were introduced a few months later in c38138c (git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26). Based on the timing, in the early days of the Git codebase, I figured that these roughly equivalent code paths were never unified only because it got lost in the shuffle. The hashfile API has since been used extensively in other file formats, such as pack-indexes, multi-pack-indexes, and commit-graphs. Therefore, it seems prudent to unify the index writing code to use the same mechanism. I discovered this disparity while trying to create a new index format that uses the chunk-format API. That API uses a hashfile as its base, so it is incompatible with the custom code in read-cache.c. This rewrite is rather straightforward. It replaces all writes to the temporary file with writes to the hashfile struct. This takes care of many of the direct interactions with the_hash_algo. There are still some git_hash_ctx uses remaining: the extension headers are hashed for use in the End of Index Entries (EOIE) extension. This use of the git_hash_ctx is left as-is. There are multiple reasons to not use a hashfile here, including the fact that the data is not actually writing to a file, just a hash computation. These hashes do not block our adoption of the chunk-format API in a future change to the index, so leave it as-is. The internals of the algorithms are mostly identical. Previously, the hashfile API used a smaller 8KB buffer instead of the 128KB buffer from read-cache.c. The previous change already unified these sizes. There is one subtle point: we do not pass the CSUM_FSYNC to the finalize_hashfile() method, which differs from most consumers of the hashfile API. The extra fsync() call indicated by this flag causes a significant peformance degradation that is noticeable for quick commands that write the index, such as "git add". Other consumers can absorb this cost with their more complicated data structure organization, and further writing structures such as pack-files and commit-graphs is rarely in the critical path for common user interactions. Some static methods become orphaned in this diff, so I marked them as MAYBE_UNUSED. The diff is much harder to read if they are deleted during this change. Instead, they will be deleted in the following change. In addition to the test suite passing, I computed indexes using the previous binaries and the binaries compiled after this change, and found the index data to be exactly equal. Finally, I did extensive performance testing of "git update-index --force-write" on repos of various sizes, including one with over 2 million paths at HEAD. These tests demonstrated less than 1% difference in behavior. As expected, the performance should be considered unchanged. The previous changes to increase the hashfile buffer size from 8K to 128K ensured this change would not create a peformance regression. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

These methods were marked as MAYBE_UNUSED in the previous change to avoid a complicated diff. Delete them entirely, since we now use the hashfile API instead of this custom hashing code. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

This topic recently reached 'master' upstream, and makes the index code use the hashfile API in a more uniform way. This is critical to the success of index v5 in the future, but for now it causes a small problem: the GVFS_SKIP_SHA_ON_INDEX flag no longer works! This was critical to early days of VFS for Git, where the GVFS.Mount process would intervene and modify the index in-place. However, the current use is just a performance change: it prevents computing the hash in some cases.

This version requires updating the hashfile struct to allow skipping the hashing. Without this change, we would suffer a performance degradation in VFS for Git.

dscho

Whoops. I did a review as if this had not hit upstream Git yet, but I just noticed (after commenting) that Junio committed those changes...

chunk-format.c

csum-file.c

csum-file.h

derrickstolee · 2021-08-03T16:06:44Z

I am pulling this change into the v2.33.0 rebase, and 059ab53 will get squashed in.

derrickstolee added 6 commits May 18, 2021 06:32

fixup! gvfs: add the feature to skip writing the index' SHA-1

059ab53

This version requires updating the hashfile struct to allow skipping the hashing. Without this change, we would suffer a performance degradation in VFS for Git.

derrickstolee self-assigned this Jun 22, 2021

dscho approved these changes Jun 23, 2021

View reviewed changes

chunk-format.c Show resolved Hide resolved

csum-file.c Show resolved Hide resolved

csum-file.c Show resolved Hide resolved

csum-file.c Show resolved Hide resolved

csum-file.h Show resolved Hide resolved

derrickstolee closed this Aug 3, 2021

derrickstolee mentioned this pull request Aug 3, 2021

[DO NOT MERGE] Tentative vfs-2.33.0 branch #405

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Low Priority] React to ds/write-index-with-hashfile-api #389

[Low Priority] React to ds/write-index-with-hashfile-api #389

Uh oh!

derrickstolee commented Jun 22, 2021

Uh oh!

dscho left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

derrickstolee commented Aug 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Low Priority] React to ds/write-index-with-hashfile-api #389

[Low Priority] React to ds/write-index-with-hashfile-api #389

Uh oh!

Conversation

derrickstolee commented Jun 22, 2021

Uh oh!

dscho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

derrickstolee commented Aug 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants