public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Aaron Plattner <aplattner@nvidia•com>
To: Jeff King <peff@peff•net>
Cc: git@vger•kernel.org
Subject: Re: [PATCH v2] packfile: skip decompressing and hashing blobs in add_promisor_object()
Date: Sat, 6 Dec 2025 11:40:01 -0800	[thread overview]
Message-ID: <51c866cb-9a7a-4c59-834a-2f710f34f3a1@nvidia.com> (raw)
In-Reply-To: <20251206020648.GB1714099@coredump.intra.peff.net>

On 12/5/25 6:06 PM, Jeff King wrote:
> On Fri, Dec 05, 2025 at 04:20:12PM -0800, Aaron Plattner wrote:
> 
>> When is_promisor_object() is called for the first time, it lazily
>> initializes a set of all promisor objects by iterating through all
>> objects in promisor packs. For each object, add_promisor_object() calls
>> parse_object(), which decompresses and hashes the entire object.
>>
>> For repositories with large pack files, this can take an extremely long
>> time. For example, on a production repository with a 176 GB promisor
>> pack:
>>
>>   $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
>>   ________________________________________________________
>>   Executed in   76.10 mins    fish           external
>>      usr time   72.10 mins    1.83 millis   72.10 mins
>>      sys time    3.56 mins    0.17 millis    3.56 mins
>>
>> add_promisor_object() needs the full object for trees, commits, and
>> tags. But blobs contain no references to other objects, so the function
>> can just insert their oids into the set and move on.
>>
>> parse_object_with_flags() has code to skip decompressing blobs, but it
>> unfortunately doesn't work with the objects created by
>> mark_uninteresting() because they have obj->type == OBJ_NONE. Update
>> parse_object_with_flags() to handle blobs and trees that are in this
>> state, and then update add_promisor_object() to use
>> PARSE_OBJECT_SKIP_HASH_CHECK.
> 
> Good catch on the matching tree code. It doesn't trigger for your use
> case (the caller has to pass in the DISCARD_TREE flag), but it's a
> lurking bug nonetheless.
> 
> I'm tempted to say that those changes in parse_object_with_flags()
> should happen as a separate patch, since they really are fixing an
> existing bug. But I can live with it all as one, too.
> 
> One other thing it might be worth thinking about or mentioning in the
> commit message: we are skipping the hash check on all objects now (not
> just blobs). I think this is OK to do along the lines of discussion in
> c868d8e91f (parse_object(): allow skipping hash check, 2022-09-06). I
> dunno. Maybe it is kind of self-evident that not every operation needs
> to do a consistency check of every object.

I was rewriting the commit message for that part to justify why it's 
safe to use PARSE_OBJECT_SKIP_HASH_CHECK, and now I'm questioning it. :)

It definitely seems fine for blobs but if what we're trying to check for 
is on-disk corruption, maybe it's not a good idea to skip it for other 
objects since we're actually using their contents here.

I still think the OBJ_NONE fix is worthwhile and I'll send that out 
separately, but maybe it would be a good idea to split 
PARSE_OBJECT_SKIP_HASH_CHECK into separate flags for blobs vs. all 
objects? Or just go back to v1 of the add_promisor_object() patch? Or do 
you think this version is okay despite that concern?

-- Aaron

>>   object.c   | 4 ++--
>>   packfile.c | 3 ++-
>>   2 files changed, 4 insertions(+), 3 deletions(-)
> 
> The patch itself looks great to me.
> 
> -Peff


  reply	other threads:[~2025-12-06 19:40 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-06  0:20 [PATCH v2] packfile: skip decompressing and hashing blobs in add_promisor_object() Aaron Plattner
2025-12-06  2:06 ` Jeff King
2025-12-06 19:40   ` Aaron Plattner [this message]
2025-12-08 20:28     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51c866cb-9a7a-4c59-834a-2f710f34f3a1@nvidia.com \
    --to=aplattner@nvidia$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=peff@peff$(echo .)net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox