public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail•com>
To: Junio C Hamano <gitster@pobox•com>,
	Derrick Stolee via GitGitGadget <gitgitgadget@gmail•com>
Cc: git@vger•kernel.org, johannes.schindelin@gmx•de, peff@peff•net,
	ps@pks•im, me@ttaylorr•com, johncai86@gmail•com,
	newren@gmail•com
Subject: Re: [PATCH 0/4] pack-objects: create new name-hash algorithm
Date: Mon, 9 Sep 2024 22:37:30 -0400	[thread overview]
Message-ID: <0e6dde0f-63e2-4db3-9225-9a8ca5e78eb3@gmail.com> (raw)
In-Reply-To: <xmqqjzfkr9b0.fsf@gitster.g>

On 9/9/24 2:06 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail•com> writes:
> 
>> One way to find some improvement in these repositories is to increase the
>> window size, which was an initial indicator that the delta compression could
>> be improved, but was not a clear indicator. After some digging (and
>> prototyping some analysis tools) the main discovery was that the current
>> name-hash algorithm only considers the last 16 characters in the path name
>> and has some naturally-occurring collisions within that scope.
> 
> Yes, as I explained in the other message, this "collision" is an
> integral part of the design to allow us gather candidates together
> that may yield good deltas among them.  In addition, header files
> whose names end with ".h" tend to share a bit comment at the
> beginning of them in many projects, and the proximity (not
> "collision") of the hash value is used to make them delta candidates
> with each other.
> 
> I do agree that considering files at the same path from different
> (but close-by) revisions as the prime candidates is very important,
> but if you spread the "collissions" very thin by using "uniform
> distribution", I am afraid that you'd end up discarding anything but
> the blobs at the same path, which may go too far.  Having name hash
> value that are close by no longer has any meaning in such a system.

You are right that some "nearby" paths are lost in this change, and
this can be measured by trying to use this option in the pack-objects
process underneath a small 'git push'.

The thing that surprised me is just how effective this is for the
creation of large pack-files that include many versions of most
files. The cross-path deltas have less of an effect here, and the
benefits of avoiding name-hash collisions can be overwhelming in
many cases.

> I hope you can find a solution that strikes a good balance at the
> end of the series (I saw only the first step), but I suspect an easy
> way to avoid the downsides you observed is to use both.  Compare
> with a handful of blobs taken from nearby commits (the original
> object order is roughly in traversal order, and you can take
> advantage of that fact) from exactly the same path (using your
> "uniform distribution") before comparing with the blobs with close
> value (of the current function) like the current implementation
> does, may go a long way.

Funny you should say that, since the RFC I finally submitted [1]
actually does just that. The --path-walk option changes the object
walk to consider batches of objects based on their path, computes
deltas among that batch, and then does the normal name-hash pass
later. This seems to really strike the balance that you are
looking for and solves the issues where small pushes need to stay
small. It also fixes some problematic cases even when pushing a
single commit.

[1] https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/

However, the --path-walk option requires significant implementation
of a "path walk API" and my RFC doesn't even do threading right.
The --path-walk version (probably) doesn't work with delta islands
or other features the same way as the drop-in change to the name-
hash heuristic can. For that reason, I think there is likely some
long-term value to the --full-name-hash option even though the
--path-walk option will be better in many cases.

Thanks,
-Stolee


  reply	other threads:[~2024-09-10  2:37 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-09 13:56 [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee via GitGitGadget
2024-09-09 13:56 ` [PATCH 1/4] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-09-09 17:45   ` Junio C Hamano
2024-09-10  2:31     ` Derrick Stolee
2024-09-09 13:56 ` [PATCH 2/4] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-09-09 17:48   ` Junio C Hamano
2024-09-09 13:56 ` [PATCH 3/4] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-09-09 13:56 ` [PATCH 4/4] p5314: add a size test for name-hash collisions Derrick Stolee via GitGitGadget
2024-09-09 16:07 ` [PATCH 0/4] pack-objects: create new name-hash algorithm Derrick Stolee
2024-09-09 18:06 ` Junio C Hamano
2024-09-10  2:37   ` Derrick Stolee [this message]
2024-09-10 14:56     ` Taylor Blau
2024-09-10 21:05       ` Derrick Stolee
2024-09-11  6:32         ` Jeff King
2024-09-10 16:07     ` Junio C Hamano
2024-09-10 20:36       ` Junio C Hamano
2024-09-10 21:09         ` Derrick Stolee
2024-09-18 20:46 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
2024-09-18 20:46   ` [PATCH v2 1/6] pack-objects: add --full-name-hash option Derrick Stolee via GitGitGadget
2024-09-19 21:56     ` Junio C Hamano
2024-09-18 20:46   ` [PATCH v2 2/6] repack: test " Derrick Stolee via GitGitGadget
2024-09-19 22:11     ` Junio C Hamano
2024-09-18 20:46   ` [PATCH v2 3/6] pack-objects: add GIT_TEST_FULL_NAME_HASH Derrick Stolee via GitGitGadget
2024-09-19 22:22     ` Junio C Hamano
2024-09-23  1:39       ` Derrick Stolee
2024-09-18 20:46   ` [PATCH v2 4/6] git-repack: update usage to match docs Derrick Stolee via GitGitGadget
2024-09-18 20:46   ` [PATCH v2 5/6] p5313: add size comparison test Derrick Stolee via GitGitGadget
2024-09-19 21:58     ` Junio C Hamano
2024-09-23  1:50       ` Derrick Stolee
2024-09-23 16:14         ` Junio C Hamano
2024-09-18 20:46   ` [PATCH v2 6/6] test-tool: add helper for name-hash values Derrick Stolee via GitGitGadget
2024-09-24  7:02     ` Patrick Steinhardt
2024-09-25 13:35       ` Derrick Stolee
2024-09-18 23:30   ` [PATCH v2 0/6] pack-objects: create new name-hash algorithm Derrick Stolee
2024-10-04 21:17   ` Junio C Hamano
2024-10-04 22:30     ` Taylor Blau
2024-10-04 22:35     ` Jeff King
2024-10-08 18:32   ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0e6dde0f-63e2-4db3-9225-9a8ca5e78eb3@gmail.com \
    --to=stolee@gmail$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=gitgitgadget@gmail$(echo .)com \
    --cc=gitster@pobox$(echo .)com \
    --cc=johannes.schindelin@gmx$(echo .)de \
    --cc=johncai86@gmail$(echo .)com \
    --cc=me@ttaylorr$(echo .)com \
    --cc=newren@gmail$(echo .)com \
    --cc=peff@peff$(echo .)net \
    --cc=ps@pks$(echo .)im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox