From: Taylor Blau <me@ttaylorr•com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail•com>
Cc: git@vger•kernel.org, gitster@pobox•com,
johannes.schindelin@gmx•de, peff@peff•net, ps@pks•im,
johncai86@gmail•com, newren@gmail•com, jonathantanmy@google•com,
karthik nayak <karthik.188@gmail•com>,
Derrick Stolee <stolee@gmail•com>
Subject: Re: [PATCH v3 8/8] pack-objects: add third name hash version
Date: Wed, 22 Jan 2025 17:37:07 -0500 [thread overview]
Message-ID: <Z5FzE1XpBlEyhK2T@nand.local> (raw)
In-Reply-To: <3d63954f318e5133630b1f579a399a123e434cf8.1734715194.git.gitgitgadget@gmail.com>
On Fri, Dec 20, 2024 at 05:19:54PM +0000, Derrick Stolee via GitGitGadget wrote:
> Create a third name hash function and extend the '--name-hash-version'
> option in 'git pack-objects' and 'git repack' to understand it. This
> hash version abandons all efforts for locality and focuses on creating a
> somewhat uniformly-distributed hash function to minimize collisions.
>
> We can observe the effect of this collision avoidance in a large
> internal monorepo that suffered from collisions in the previous
> versions. The updates to p5314-name-hash.sh show these results:
>
> Test this tree
> --------------------------------------------------
> 5314.1: paths at head 227.3K
> 5314.2: distinct hash value: v1 72.3K
> 5314.3: maximum multiplicity: v1 14.4K
> 5314.4: distinct hash value: v2 166.5K
> 5314.5: maximum multiplicity: v2 138
> 5314.6: distinct hash value: v3 227.3K
> 5314.7: maximum multiplicity: v3 2
>
> These results demonstrate that of the 227,000+ paths, nearly all of them
> find distinct hash values. The maximum multiplicity is 2, improved from
> 138 in the v2 hash function. The v2 hash function also had only 166K
> distinct values, so it had a wide spread of collisions.
I had a little trouble reading this section of the commit message. I
think the framing makes sense (v2 has collisions which can impact pack
generation time and/or size), but this section explains v3 I think one
level too deep.
This comparison (and the one below it for v3) shows a reduction in
distinct hash values and the maximum multiplicity (I'm assuming for
colliding hash values, in which case I might suggest renaming it as
"maximum collisions").
But I imagine that many readers will primarily care about the effect of
the new hash function on pack generation time and size. You show that
below, but I think that it should potentially appear earlier in the
commit message.
Alternatively, you could consider leaving the time/size table alone
where it is, and devote an extra sentence or two to explaining the
impact on repacking time/size that the two metrics above (distinct hash
values, multiplicity/collisions) have on the repacking time/size.
> A more modest improvement is available in the open source fluentui repo
> [1] with these results:
>
> Test this tree
> --------------------------------------------------
> 5314.1: paths at head 19.5K
> 5314.2: distinct hash value: v1 8.2K
> 5314.3: maximum multiplicity: v1 279
> 5314.4: distinct hash value: v2 17.8K
> 5314.5: maximum multiplicity: v2 44
> 5314.6: distinct hash value: v3 19.5K
> 5314.7: maximum multiplicity: v3 1
>
> [1] https://github.com/microsoft/fluentui
>
> However, it is important to demonstrate the effectiveness of this
> function in the context of compressing a repository. We can use
> p5313-pack-objects.sh to measure these changes. I will use a simplified
> table summarizing the output of that performance test.
>
> | Test | V1 Time | V2 Time | V3 Time | V1 Size | V2 Size | V3 Size |
> |-----------|---------|---------|---------|---------|---------|---------|
> | Thin Pack | 0.37 s | 0.12 s | 0.07 s | 1.2 M | 22.0 K | 20.4 K |
> | Big Pack | 2.04 s | 2.80 s | 1.40 s | 20.4 M | 25.9 M | 19.2 M |
> | Shallow | 1.41 s | 1.77 s | 1.27 s | 34.4 M | 33.7 M | 34.8 M |
> | Repack | 95.70 s | 33.68 s | 20.88 s | 439.3 M | 160.5 M | 169.1 M |
OK, now we get to the chart that I demonstrates the effects of each hash
function on the most externally visible effects. Are these measurements
taken from the fluentui repo, or somewhere else? In either case, it
may be worth mentioning.
> Here, there are some performance improvements on a time basis, and the
> thin and big packs are somewhat smaller in v3. The shallow and repacked
> packs are somewhat bigger, though, compared to v2.
>
> Two repositories that have very few collisions in the v1 name hash are
> the Git and Linux repositories. Here are their stats for p5313:
>
> Git:
>
> | Test | V1 Time | V2 Time | V3 Time | V1 Size | V2 Size | V3 Size |
> |-----------|---------|---------|---------|---------|---------|---------|
> | Thin Pack | 0.02 s | 0.02 s | 0.02 s | 1.1 K | 1.1 K | 15.3 K |
> | Big Pack | 1.69 s | 1.95 s | 1.67 s | 13.5 M | 14.5 M | 14.9 M |
> | Shallow | 1.26 s | 1.29 s | 1.16 s | 12.0 M | 12.2 M | 12.5 M |
> | Repack | 29.51 s | 29.01 s | 29.08 s | 237.7 M | 238.2 M | 237.7 M |
>
> Linux:
>
> | Test | V1 Time | V2 Time | V3 Time | V1 Size | V2 Size | V3 Size |
> |-----------|----------|----------|----------|---------|---------|---------|
> | Thin Pack | 0.17 s | 0.07 s | 0.07 s | 4.6 K | 4.6 K | 6.8 K |
> | Big Pack | 17.88 s | 12.35 s | 12.14 s | 201.1 M | 149.1 M | 160.4 M |
> | Shallow | 11.05 s | 22.94 s | 22.16 s | 269.2 M | 273.8 M | 271.8 M |
> | Repack | 727.39 s | 566.95 s | 539.33 s | 2.5 G | 2.5 G | 2.6 G |
>
> These repositories make good use of the cross-path deltas that come
> about from the v1 name hash function, so they already had mixed results
> with the v2 function. The v3 function is generally worse for these
> repositories.
I appreciate you sharing some counterexamples as well.
> While the fluentui repo had an increase in size using the v3 name hash,
> the others had modest improvements over the v2 name hash. But those
> modest improvements are dwarfed by the difference from v1 to v2, so it
> is unlikely that the regression seen in the other scenarios (packfiles
> that are not from full repacks) will be worth using v3 over v2. That is,
> unless there are enough collisions even with v2 that the full repack
> scenario has larger improvements than these.
This is the paragraph that I thought most about (both while reading the
above sections, and then again after seeing my internal thoughts written
down here).
It seems like the general conclusion is that v2 is a strict improvement
on v1 in almost all cases. v3 appears to be an improvement on v2 in some
cases, and a regression (as you note) in others. But I think more
importantly (again as you note) is that the improvement from v1 to v2 is
so pronounced that it's unlikely that the regression from v2 to v3 will
matter or even be noticeable in most cases.
Are there easy ways to detect when v3 would be an improvement over v2?
If so, then I think exposing those detection mechanisms to users (either
as an automated tool or through documentation, perhaps in
git-packing(7), which is perfect for this sort of discussion) would be
worthwhile. Then users could make an informed decision about which hash
function to use for their repositories.
But if there isn't such a mechanism, then I wonder what would drive a
user to choose v3 over v2. I suspect the answer is that curious users
would try repacking both ways, and then stick with whichever one has a
bigger impact on the metric(s) they care most about.
If that's the case, I suspect that v2 will be the dominant choice,
especially if we consider changing the default from 1 to 2 at some point
in the future. Given all of that, I share your feeling that it may be
worth dropping this patch entirely. It is true that some cases will be
worse off (at least compared to v2) without this part of the series. But
it gets us out of having to support v3 forever, or go through the
process of deprecating it. I'd like the project to avoid both of those
if possible, especially if we don't anticipate many users will select v3
over v2.
Thanks,
Taylor
next prev parent reply other threads:[~2025-01-22 22:37 UTC|newest]
Thread overview: 93+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-05 3:05 [PATCH 0/7] pack-objects: Create an alternative name hash algorithm (recreated) Derrick Stolee via GitGitGadget
2024-11-05 3:05 ` [PATCH 1/7] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-11-21 20:08 ` Taylor Blau
2024-11-21 21:35 ` Taylor Blau
2024-11-21 23:32 ` Junio C Hamano
2024-11-22 11:46 ` Derrick Stolee
2024-11-22 11:59 ` Derrick Stolee
2024-11-26 8:26 ` Patrick Steinhardt
2024-11-05 3:05 ` [PATCH 2/7] repack: " Derrick Stolee via GitGitGadget
2024-11-21 20:12 ` Taylor Blau
2024-11-22 12:07 ` Derrick Stolee
2024-11-05 3:05 ` [PATCH 3/7] pack-objects: add GIT_TEST_FULL_NAME_HASH Derrick Stolee via GitGitGadget
2024-11-21 20:15 ` Taylor Blau
2024-11-22 12:09 ` Derrick Stolee
2024-11-22 1:13 ` Jonathan Tan
2024-11-22 3:23 ` Junio C Hamano
2024-11-22 18:01 ` Jonathan Tan
2024-11-25 0:39 ` Junio C Hamano
2024-11-25 19:45 ` Jonathan Tan
2024-11-26 1:29 ` Junio C Hamano
2024-11-26 8:26 ` Patrick Steinhardt
2024-11-05 3:05 ` [PATCH 4/7] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-11-21 20:17 ` Taylor Blau
2024-11-22 15:26 ` Derrick Stolee
2024-11-05 3:05 ` [PATCH 5/7] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-11-21 20:31 ` Taylor Blau
2024-11-22 15:26 ` Derrick Stolee
2024-11-26 8:26 ` Patrick Steinhardt
2024-11-05 3:05 ` [PATCH 6/7] pack-objects: disable --full-name-hash when shallow Derrick Stolee via GitGitGadget
2024-11-21 20:33 ` Taylor Blau
2024-11-22 15:27 ` Derrick Stolee
2024-11-05 3:05 ` [PATCH 7/7] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-11-21 20:42 ` Taylor Blau
2024-11-22 1:23 ` Jonathan Tan
2024-11-21 23:50 ` [PATCH 0/7] pack-objects: Create an alternative name hash algorithm (recreated) Jonathan Tan
2024-11-22 3:01 ` Junio C Hamano
2024-11-22 4:22 ` Junio C Hamano
2024-11-22 15:27 ` Derrick Stolee
2024-11-24 23:57 ` Junio C Hamano
2024-11-22 18:05 ` Jonathan Tan
2024-12-02 23:21 ` [PATCH v2 0/8] " Derrick Stolee via GitGitGadget
2024-12-02 23:21 ` [PATCH v2 1/8] pack-objects: create new name-hash function version Jonathan Tan via GitGitGadget
2024-12-04 20:06 ` karthik nayak
2024-12-04 21:05 ` Junio C Hamano
2024-12-05 9:46 ` karthik nayak
2024-12-09 23:15 ` Jonathan Tan
2024-12-10 0:01 ` Junio C Hamano
2024-12-02 23:21 ` [PATCH v2 2/8] pack-objects: add --name-hash-version option Derrick Stolee via GitGitGadget
2024-12-04 20:53 ` karthik nayak
2024-12-02 23:21 ` [PATCH v2 3/8] repack: " Derrick Stolee via GitGitGadget
2024-12-04 21:15 ` karthik nayak
2024-12-02 23:21 ` [PATCH v2 4/8] pack-objects: add GIT_TEST_NAME_HASH_VERSION Derrick Stolee via GitGitGadget
2024-12-04 21:21 ` karthik nayak
2024-12-09 23:12 ` Jonathan Tan
2024-12-20 17:03 ` Derrick Stolee
2024-12-02 23:21 ` [PATCH v2 5/8] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-12-02 23:21 ` [PATCH v2 6/8] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-12-02 23:21 ` [PATCH v2 7/8] pack-objects: prevent name hash version change Derrick Stolee via GitGitGadget
2024-12-02 23:21 ` [PATCH v2 8/8] pack-objects: add third name hash version Derrick Stolee via GitGitGadget
2024-12-03 3:23 ` [PATCH v2 0/8] pack-objects: Create an alternative name hash algorithm (recreated) Junio C Hamano
2024-12-04 4:56 ` Derrick Stolee
2024-12-04 5:02 ` Junio C Hamano
2024-12-20 17:19 ` [PATCH v3 " Derrick Stolee via GitGitGadget
2024-12-20 17:19 ` [PATCH v3 1/8] pack-objects: create new name-hash function version Jonathan Tan via GitGitGadget
2025-01-22 22:08 ` Taylor Blau
2024-12-20 17:19 ` [PATCH v3 2/8] pack-objects: add --name-hash-version option Derrick Stolee via GitGitGadget
2025-01-22 22:17 ` Taylor Blau
2025-01-24 17:29 ` Derrick Stolee
2024-12-20 17:19 ` [PATCH v3 3/8] repack: " Derrick Stolee via GitGitGadget
2025-01-22 22:18 ` Taylor Blau
2024-12-20 17:19 ` [PATCH v3 4/8] pack-objects: add GIT_TEST_NAME_HASH_VERSION Derrick Stolee via GitGitGadget
2025-01-22 22:20 ` Taylor Blau
2024-12-20 17:19 ` [PATCH v3 5/8] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-12-20 17:19 ` [PATCH v3 6/8] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-12-20 17:19 ` [PATCH v3 7/8] pack-objects: prevent name hash version change Derrick Stolee via GitGitGadget
2025-01-22 22:22 ` Taylor Blau
2024-12-20 17:19 ` [PATCH v3 8/8] pack-objects: add third name hash version Derrick Stolee via GitGitGadget
2025-01-22 22:37 ` Taylor Blau [this message]
2025-01-24 17:34 ` Derrick Stolee
2025-01-21 20:21 ` [PATCH v3 0/8] pack-objects: Create an alternative name hash algorithm (recreated) Derrick Stolee
2025-01-22 23:28 ` Taylor Blau
2025-01-24 17:45 ` Derrick Stolee
2025-01-27 19:02 ` [PATCH v4 0/7] " Derrick Stolee via GitGitGadget
2025-01-27 19:02 ` [PATCH v4 1/7] pack-objects: create new name-hash function version Jonathan Tan via GitGitGadget
2025-01-27 19:02 ` [PATCH v4 2/7] pack-objects: add --name-hash-version option Derrick Stolee via GitGitGadget
2025-01-27 21:18 ` Junio C Hamano
2025-01-29 13:38 ` Derrick Stolee
2025-01-27 19:02 ` [PATCH v4 3/7] repack: " Derrick Stolee via GitGitGadget
2025-01-27 19:02 ` [PATCH v4 4/7] pack-objects: add GIT_TEST_NAME_HASH_VERSION Derrick Stolee via GitGitGadget
2025-01-27 19:02 ` [PATCH v4 5/7] p5313: add size comparison test Derrick Stolee via GitGitGadget
2025-01-27 19:02 ` [PATCH v4 6/7] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2025-01-27 19:02 ` [PATCH v4 7/7] pack-objects: prevent name hash version change Derrick Stolee via GitGitGadget
2025-01-31 21:39 ` [PATCH v4 0/7] pack-objects: Create an alternative name hash algorithm (recreated) Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z5FzE1XpBlEyhK2T@nand.local \
--to=me@ttaylorr$(echo .)com \
--cc=git@vger$(echo .)kernel.org \
--cc=gitgitgadget@gmail$(echo .)com \
--cc=gitster@pobox$(echo .)com \
--cc=johannes.schindelin@gmx$(echo .)de \
--cc=johncai86@gmail$(echo .)com \
--cc=jonathantanmy@google$(echo .)com \
--cc=karthik.188@gmail$(echo .)com \
--cc=newren@gmail$(echo .)com \
--cc=peff@peff$(echo .)net \
--cc=ps@pks$(echo .)im \
--cc=stolee@gmail$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox