public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Jonathan Tan <jonathantanmy@google•com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail•com>
Cc: Jonathan Tan <jonathantanmy@google•com>,
	git@vger•kernel.org, gitster@pobox•com,
	 johannes.schindelin@gmx•de, peff@peff•net, ps@pks•im,
	me@ttaylorr•com,  johncai86@gmail•com, newren@gmail•com,
	Derrick Stolee <stolee@gmail•com>
Subject: Re: [PATCH 0/7] pack-objects: Create an alternative name hash algorithm (recreated)
Date: Thu, 21 Nov 2024 15:50:14 -0800	[thread overview]
Message-ID: <20241121235014.2554033-1-jonathantanmy@google.com> (raw)
In-Reply-To: <pull.1823.git.1730775907.gitgitgadget@gmail.com>

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail•com> writes:
> This series introduces a new name-hash algorithm, but does not replace the
> existing one. There are cases, such as packing a single snapshot of a
> repository, where the existing algorithm outperforms the new one.

I came up with a hash function that both uses information from a lot
more of the path (not the full name, though) and preserves the sortable
property (diff at the end of this email). It also contains fixes to the
existing algorithm: not wasting the most significant bits of the hash
if files in the repo mostly end in a lowercase alphabetic character, and
the cast from a possibly-signed-possibly-unsigned char to a uint32_t.

The results look quite good. In summary, the pack sizes are comparable
to Stolee's results in the case of fluentui, and better than Stolee's
results in the case of git.

Here's one run on the fluentui repo (git clone https://
github.com/microsoft/fluentui; cd fluentui; git checkout
a637a06df05360ce5ff21420803f64608226a875^ following the instructions
in [1]:

(before my change)

Test                                               this tree         
---------------------------------------------------------------------
5313.2: thin pack                                  0.03(0.01+0.01)   
5313.3: thin pack size                                        1.1K   
5313.4: thin pack with --full-name-hash            0.03(0.00+0.02)   
5313.5: thin pack size with --full-name-hash                  3.0K   
5313.6: big pack                                   1.60(2.87+0.32)   
5313.7: big pack size                                        57.9M   
5313.8: big pack with --full-name-hash             1.41(1.94+0.37)   
5313.9: big pack size with --full-name-hash                  57.8M   
5313.10: shallow fetch pack                        1.69(2.70+0.22)   
5313.11: shallow pack size                                   33.0M   
5313.12: shallow pack with --full-name-hash        1.49(1.84+0.34)   
5313.13: shallow pack size with --full-name-hash             33.6M   
5313.14: repack                                    75.10(537.66+5.47)
5313.15: repack size                                        454.2M   
5313.16: repack with --full-name-hash              18.10(92.50+5.14) 
5313.17: repack size with --full-name-hash                  174.8M                                

(after my change)

Test                                               this tree         
---------------------------------------------------------------------
5313.2: thin pack                                  0.03(0.01+0.02)   
5313.3: thin pack size                                        1.1K   
5313.4: thin pack with --full-name-hash            0.03(0.01+0.02)   
5313.5: thin pack size with --full-name-hash                  1.1K   
5313.6: big pack                                   1.62(2.94+0.28)   
5313.7: big pack size                                        57.9M   
5313.8: big pack with --full-name-hash             1.35(2.07+0.37)   
5313.9: big pack size with --full-name-hash                  57.6M   
5313.10: shallow fetch pack                        1.63(2.52+0.29)   
5313.11: shallow pack size                                   33.0M   
5313.12: shallow pack with --full-name-hash        1.50(2.10+0.23)   
5313.13: shallow pack size with --full-name-hash             33.1M   
5313.14: repack                                    74.86(531.39+5.49)
5313.15: repack size                                        454.7M   
5313.16: repack with --full-name-hash              19.71(111.39+5.12)
5313.17: repack size with --full-name-hash                  165.6M  

The tests were run by:

  GENERATE_COMPILATION_DATABASE=yes make CC=clang && (cd t/perf && env GIT_PERF_LARGE_REPO=~/tmp/fluentui ./run -- p5313*.sh)

The similarity in sizes looked suspicious, so I replaced the contents
of the hash function with "return 0;" and indeed the sizes significantly
increased, so hopefully there is nothing wrong with my setup.

The git repo was called out in [1] as demonstrating "some of the issues
with this approach", but here are the results, run by:

  GENERATE_COMPILATION_DATABASE=yes make CC=clang && (cd t/perf && ./run -- p5313*.sh)

Test                                               this tree        
--------------------------------------------------------------------
5313.2: thin pack                                  0.03(0.00+0.02)  
5313.3: thin pack size                                        2.9K  
5313.4: thin pack with --full-name-hash            0.03(0.00+0.02)   
5313.5: thin pack size with --full-name-hash                  2.9K                                                                                                                                                  
5313.6: big pack                                   1.69(2.80+0.28)                                                                                                                                                  
5313.7: big pack size                                        18.7M  
5313.8: big pack with --full-name-hash             1.68(2.82+0.31)  
5313.9: big pack size with --full-name-hash                  18.8M  
5313.10: shallow fetch pack                        0.96(1.47+0.16)  
5313.11: shallow pack size                                   12.1M  
5313.12: shallow pack with --full-name-hash        1.01(1.51+0.14)  
5313.13: shallow pack size with --full-name-hash             12.1M  
5313.14: repack                                    17.05(69.99+4.33)
5313.15: repack size                                        116.5M  
5313.16: repack with --full-name-hash              15.74(67.03+4.18)
5313.17: repack size with --full-name-hash                  116.1M  

[1] https://lore.kernel.org/git/c14ef6879e451401381ebbdb8f30d33c8f56c25b.1730775908.git.gitgitgadget@gmail.com/

> | Repo     | Standard Repack | With --full-name-hash |
> |----------|-----------------|-----------------------|
> | fluentui |         438 MB  |               168 MB  |
> | Repo B   |       6,255 MB  |               829 MB  |
> | Repo C   |      37,737 MB  |             7,125 MB  |
> | Repo D   |     130,049 MB  |             6,190 MB  |
> | Repo E   |     100,957 MB  |            22,979 MB  |
> | Repo F   |       8,308 MB  |               746 MB  |
> | Repo G   |       4,329 MB  |             3,643 MB  |

If the results are similar for some of the above repos (I do not have
access to them), maybe it's worth considering using my hash function (or
a variation of it).

I'll also take a look at the rest of the patch set.

---
diff --git a/pack-objects.h b/pack-objects.h
index 88360aa3e8..c4f35eafa0 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -209,23 +209,24 @@ static inline uint32_t pack_name_hash(const char *name)
 
 static inline uint32_t pack_full_name_hash(const char *name)
 {
-       const uint32_t bigp = 1234572167U;
-       uint32_t c, hash = bigp;
+       uint32_t hash = 0, base = 0;
+       uint8_t c;
 
        if (!name)
                return 0;
 
-       /*
-        * Do the simplest thing that will resemble pseudo-randomness: add
-        * random multiples of a large prime number with a binary shift.
-        * The goal is not to be cryptographic, but to be generally
-        * uniformly distributed.
-        */
-       while ((c = *name++) != 0) {
-               hash += c * bigp;
-               hash = (hash >> 5) | (hash << 27);
+       while ((c = (uint8_t) *name++) != 0) {
+               if (isspace(c))
+                       continue;
+               if (c == '/') {
+                       base = (base >> 6) ^ hash;
+                       hash = 0;
+               } else {
+                       uint8_t nybble_swapped = (c >> 4) + ((c & 15) << 4);
+                       hash = (hash >> 2) + (nybble_swapped << 24);
+               }
        }
-       return hash;
+       return (base >> 6) ^ hash;
 }

  parent reply	other threads:[~2024-11-21 23:50 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-05  3:05 [PATCH 0/7] pack-objects: Create an alternative name hash algorithm (recreated) Derrick Stolee via GitGitGadget
2024-11-05  3:05 ` [PATCH 1/7] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-11-21 20:08   ` Taylor Blau
2024-11-21 21:35     ` Taylor Blau
2024-11-21 23:32       ` Junio C Hamano
2024-11-22 11:46       ` Derrick Stolee
2024-11-22 11:59     ` Derrick Stolee
2024-11-26  8:26   ` Patrick Steinhardt
2024-11-05  3:05 ` [PATCH 2/7] repack: " Derrick Stolee via GitGitGadget
2024-11-21 20:12   ` Taylor Blau
2024-11-22 12:07     ` Derrick Stolee
2024-11-05  3:05 ` [PATCH 3/7] pack-objects: add GIT_TEST_FULL_NAME_HASH Derrick Stolee via GitGitGadget
2024-11-21 20:15   ` Taylor Blau
2024-11-22 12:09     ` Derrick Stolee
2024-11-22  1:13   ` Jonathan Tan
2024-11-22  3:23     ` Junio C Hamano
2024-11-22 18:01       ` Jonathan Tan
2024-11-25  0:39         ` Junio C Hamano
2024-11-25 19:45           ` Jonathan Tan
2024-11-26  1:29             ` Junio C Hamano
2024-11-26  8:26   ` Patrick Steinhardt
2024-11-05  3:05 ` [PATCH 4/7] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-11-21 20:17   ` Taylor Blau
2024-11-22 15:26     ` Derrick Stolee
2024-11-05  3:05 ` [PATCH 5/7] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-11-21 20:31   ` Taylor Blau
2024-11-22 15:26     ` Derrick Stolee
2024-11-26  8:26   ` Patrick Steinhardt
2024-11-05  3:05 ` [PATCH 6/7] pack-objects: disable --full-name-hash when shallow Derrick Stolee via GitGitGadget
2024-11-21 20:33   ` Taylor Blau
2024-11-22 15:27     ` Derrick Stolee
2024-11-05  3:05 ` [PATCH 7/7] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-11-21 20:42   ` Taylor Blau
2024-11-22  1:23   ` Jonathan Tan
2024-11-21 23:50 ` Jonathan Tan [this message]
2024-11-22  3:01   ` [PATCH 0/7] pack-objects: Create an alternative name hash algorithm (recreated) Junio C Hamano
2024-11-22  4:22     ` Junio C Hamano
2024-11-22 15:27     ` Derrick Stolee
2024-11-24 23:57       ` Junio C Hamano
2024-11-22 18:05     ` Jonathan Tan
2024-12-02 23:21 ` [PATCH v2 0/8] " Derrick Stolee via GitGitGadget
2024-12-02 23:21   ` [PATCH v2 1/8] pack-objects: create new name-hash function version Jonathan Tan via GitGitGadget
2024-12-04 20:06     ` karthik nayak
2024-12-04 21:05       ` Junio C Hamano
2024-12-05  9:46         ` karthik nayak
2024-12-09 23:15     ` Jonathan Tan
2024-12-10  0:01       ` Junio C Hamano
2024-12-02 23:21   ` [PATCH v2 2/8] pack-objects: add --name-hash-version option Derrick Stolee via GitGitGadget
2024-12-04 20:53     ` karthik nayak
2024-12-02 23:21   ` [PATCH v2 3/8] repack: " Derrick Stolee via GitGitGadget
2024-12-04 21:15     ` karthik nayak
2024-12-02 23:21   ` [PATCH v2 4/8] pack-objects: add GIT_TEST_NAME_HASH_VERSION Derrick Stolee via GitGitGadget
2024-12-04 21:21     ` karthik nayak
2024-12-09 23:12     ` Jonathan Tan
2024-12-20 17:03       ` Derrick Stolee
2024-12-02 23:21   ` [PATCH v2 5/8] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-12-02 23:21   ` [PATCH v2 6/8] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-12-02 23:21   ` [PATCH v2 7/8] pack-objects: prevent name hash version change Derrick Stolee via GitGitGadget
2024-12-02 23:21   ` [PATCH v2 8/8] pack-objects: add third name hash version Derrick Stolee via GitGitGadget
2024-12-03  3:23   ` [PATCH v2 0/8] pack-objects: Create an alternative name hash algorithm (recreated) Junio C Hamano
2024-12-04  4:56     ` Derrick Stolee
2024-12-04  5:02       ` Junio C Hamano
2024-12-20 17:19   ` [PATCH v3 " Derrick Stolee via GitGitGadget
2024-12-20 17:19     ` [PATCH v3 1/8] pack-objects: create new name-hash function version Jonathan Tan via GitGitGadget
2025-01-22 22:08       ` Taylor Blau
2024-12-20 17:19     ` [PATCH v3 2/8] pack-objects: add --name-hash-version option Derrick Stolee via GitGitGadget
2025-01-22 22:17       ` Taylor Blau
2025-01-24 17:29         ` Derrick Stolee
2024-12-20 17:19     ` [PATCH v3 3/8] repack: " Derrick Stolee via GitGitGadget
2025-01-22 22:18       ` Taylor Blau
2024-12-20 17:19     ` [PATCH v3 4/8] pack-objects: add GIT_TEST_NAME_HASH_VERSION Derrick Stolee via GitGitGadget
2025-01-22 22:20       ` Taylor Blau
2024-12-20 17:19     ` [PATCH v3 5/8] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-12-20 17:19     ` [PATCH v3 6/8] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-12-20 17:19     ` [PATCH v3 7/8] pack-objects: prevent name hash version change Derrick Stolee via GitGitGadget
2025-01-22 22:22       ` Taylor Blau
2024-12-20 17:19     ` [PATCH v3 8/8] pack-objects: add third name hash version Derrick Stolee via GitGitGadget
2025-01-22 22:37       ` Taylor Blau
2025-01-24 17:34         ` Derrick Stolee
2025-01-21 20:21     ` [PATCH v3 0/8] pack-objects: Create an alternative name hash algorithm (recreated) Derrick Stolee
2025-01-22 23:28       ` Taylor Blau
2025-01-24 17:45         ` Derrick Stolee
2025-01-27 19:02     ` [PATCH v4 0/7] " Derrick Stolee via GitGitGadget
2025-01-27 19:02       ` [PATCH v4 1/7] pack-objects: create new name-hash function version Jonathan Tan via GitGitGadget
2025-01-27 19:02       ` [PATCH v4 2/7] pack-objects: add --name-hash-version option Derrick Stolee via GitGitGadget
2025-01-27 21:18         ` Junio C Hamano
2025-01-29 13:38           ` Derrick Stolee
2025-01-27 19:02       ` [PATCH v4 3/7] repack: " Derrick Stolee via GitGitGadget
2025-01-27 19:02       ` [PATCH v4 4/7] pack-objects: add GIT_TEST_NAME_HASH_VERSION Derrick Stolee via GitGitGadget
2025-01-27 19:02       ` [PATCH v4 5/7] p5313: add size comparison test Derrick Stolee via GitGitGadget
2025-01-27 19:02       ` [PATCH v4 6/7] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2025-01-27 19:02       ` [PATCH v4 7/7] pack-objects: prevent name hash version change Derrick Stolee via GitGitGadget
2025-01-31 21:39       ` [PATCH v4 0/7] pack-objects: Create an alternative name hash algorithm (recreated) Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241121235014.2554033-1-jonathantanmy@google.com \
    --to=jonathantanmy@google$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=gitgitgadget@gmail$(echo .)com \
    --cc=gitster@pobox$(echo .)com \
    --cc=johannes.schindelin@gmx$(echo .)de \
    --cc=johncai86@gmail$(echo .)com \
    --cc=me@ttaylorr$(echo .)com \
    --cc=newren@gmail$(echo .)com \
    --cc=peff@peff$(echo .)net \
    --cc=ps@pks$(echo .)im \
    --cc=stolee@gmail$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox