public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
* [PATCH v2] packfile: skip decompressing and hashing blobs in add_promisor_object()
@ 2025-12-06  0:20 Aaron Plattner
  2025-12-06  2:06 ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Aaron Plattner @ 2025-12-06  0:20 UTC (permalink / raw)
  To: git; +Cc: Aaron Plattner, Jeff King

When is_promisor_object() is called for the first time, it lazily
initializes a set of all promisor objects by iterating through all
objects in promisor packs. For each object, add_promisor_object() calls
parse_object(), which decompresses and hashes the entire object.

For repositories with large pack files, this can take an extremely long
time. For example, on a production repository with a 176 GB promisor
pack:

 $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
 ________________________________________________________
 Executed in   76.10 mins    fish           external
    usr time   72.10 mins    1.83 millis   72.10 mins
    sys time    3.56 mins    0.17 millis    3.56 mins

add_promisor_object() needs the full object for trees, commits, and
tags. But blobs contain no references to other objects, so the function
can just insert their oids into the set and move on.

parse_object_with_flags() has code to skip decompressing blobs, but it
unfortunately doesn't work with the objects created by
mark_uninteresting() because they have obj->type == OBJ_NONE. Update
parse_object_with_flags() to handle blobs and trees that are in this
state, and then update add_promisor_object() to use
PARSE_OBJECT_SKIP_HASH_CHECK.

This improves performance for very large pack files significantly:

 $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
 ________________________________________________________
 Executed in  117.63 secs    fish           external
    usr time   45.56 secs    1.09 millis   45.56 secs
    sys time   37.91 secs    1.05 millis   37.91 secs

Signed-off-by: Aaron Plattner <aplattner@nvidia•com>
---
v2: Fix PARSE_OBJECT_SKIP_HASH_CHECK with UNINTERESTING objects, use it
in parse_object_with_flags.

 object.c   | 4 ++--
 packfile.c | 3 ++-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/object.c b/object.c
index b08fc7a163..4669b8d65e 100644
--- a/object.c
+++ b/object.c
@@ -328,7 +328,7 @@ struct object *parse_object_with_flags(struct repository *r,
 			return &commit->object;
 	}
 
-	if ((!obj || obj->type == OBJ_BLOB) &&
+	if ((!obj || obj->type == OBJ_NONE || obj->type == OBJ_BLOB) &&
 	    odb_read_object_info(r->objects, oid, NULL) == OBJ_BLOB) {
 		if (!skip_hash && stream_object_signature(r, repl) < 0) {
 			error(_("hash mismatch %s"), oid_to_hex(oid));
@@ -344,7 +344,7 @@ struct object *parse_object_with_flags(struct repository *r,
 	 * have the on-disk object with the correct type.
 	 */
 	if (skip_hash && discard_tree &&
-	    (!obj || obj->type == OBJ_TREE) &&
+	    (!obj || obj->type == OBJ_NONE || obj->type == OBJ_TREE) &&
 	    odb_read_object_info(r->objects, oid, NULL) == OBJ_TREE) {
 		return &lookup_tree(r, oid)->object;
 	}
diff --git a/packfile.c b/packfile.c
index 9cc11b6dc5..01b992a4e1 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2310,7 +2310,8 @@ static int add_promisor_object(const struct object_id *oid,
 		we_parsed_object = 0;
 	} else {
 		we_parsed_object = 1;
-		obj = parse_object(pack->repo, oid);
+		obj = parse_object_with_flags(pack->repo, oid,
+					      PARSE_OBJECT_SKIP_HASH_CHECK);
 	}
 
 	if (!obj)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] packfile: skip decompressing and hashing blobs in add_promisor_object()
  2025-12-06  0:20 [PATCH v2] packfile: skip decompressing and hashing blobs in add_promisor_object() Aaron Plattner
@ 2025-12-06  2:06 ` Jeff King
  2025-12-06 19:40   ` Aaron Plattner
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2025-12-06  2:06 UTC (permalink / raw)
  To: Aaron Plattner; +Cc: git

On Fri, Dec 05, 2025 at 04:20:12PM -0800, Aaron Plattner wrote:

> When is_promisor_object() is called for the first time, it lazily
> initializes a set of all promisor objects by iterating through all
> objects in promisor packs. For each object, add_promisor_object() calls
> parse_object(), which decompresses and hashes the entire object.
> 
> For repositories with large pack files, this can take an extremely long
> time. For example, on a production repository with a 176 GB promisor
> pack:
> 
>  $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
>  ________________________________________________________
>  Executed in   76.10 mins    fish           external
>     usr time   72.10 mins    1.83 millis   72.10 mins
>     sys time    3.56 mins    0.17 millis    3.56 mins
> 
> add_promisor_object() needs the full object for trees, commits, and
> tags. But blobs contain no references to other objects, so the function
> can just insert their oids into the set and move on.
> 
> parse_object_with_flags() has code to skip decompressing blobs, but it
> unfortunately doesn't work with the objects created by
> mark_uninteresting() because they have obj->type == OBJ_NONE. Update
> parse_object_with_flags() to handle blobs and trees that are in this
> state, and then update add_promisor_object() to use
> PARSE_OBJECT_SKIP_HASH_CHECK.

Good catch on the matching tree code. It doesn't trigger for your use
case (the caller has to pass in the DISCARD_TREE flag), but it's a
lurking bug nonetheless.

I'm tempted to say that those changes in parse_object_with_flags()
should happen as a separate patch, since they really are fixing an
existing bug. But I can live with it all as one, too.

One other thing it might be worth thinking about or mentioning in the
commit message: we are skipping the hash check on all objects now (not
just blobs). I think this is OK to do along the lines of discussion in
c868d8e91f (parse_object(): allow skipping hash check, 2022-09-06). I
dunno. Maybe it is kind of self-evident that not every operation needs
to do a consistency check of every object.

>  object.c   | 4 ++--
>  packfile.c | 3 ++-
>  2 files changed, 4 insertions(+), 3 deletions(-)

The patch itself looks great to me.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] packfile: skip decompressing and hashing blobs in add_promisor_object()
  2025-12-06  2:06 ` Jeff King
@ 2025-12-06 19:40   ` Aaron Plattner
  2025-12-08 20:28     ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Aaron Plattner @ 2025-12-06 19:40 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 12/5/25 6:06 PM, Jeff King wrote:
> On Fri, Dec 05, 2025 at 04:20:12PM -0800, Aaron Plattner wrote:
> 
>> When is_promisor_object() is called for the first time, it lazily
>> initializes a set of all promisor objects by iterating through all
>> objects in promisor packs. For each object, add_promisor_object() calls
>> parse_object(), which decompresses and hashes the entire object.
>>
>> For repositories with large pack files, this can take an extremely long
>> time. For example, on a production repository with a 176 GB promisor
>> pack:
>>
>>   $ time ~/git/git/git-rev-list --objects --all --exclude-promisor-objects --quiet
>>   ________________________________________________________
>>   Executed in   76.10 mins    fish           external
>>      usr time   72.10 mins    1.83 millis   72.10 mins
>>      sys time    3.56 mins    0.17 millis    3.56 mins
>>
>> add_promisor_object() needs the full object for trees, commits, and
>> tags. But blobs contain no references to other objects, so the function
>> can just insert their oids into the set and move on.
>>
>> parse_object_with_flags() has code to skip decompressing blobs, but it
>> unfortunately doesn't work with the objects created by
>> mark_uninteresting() because they have obj->type == OBJ_NONE. Update
>> parse_object_with_flags() to handle blobs and trees that are in this
>> state, and then update add_promisor_object() to use
>> PARSE_OBJECT_SKIP_HASH_CHECK.
> 
> Good catch on the matching tree code. It doesn't trigger for your use
> case (the caller has to pass in the DISCARD_TREE flag), but it's a
> lurking bug nonetheless.
> 
> I'm tempted to say that those changes in parse_object_with_flags()
> should happen as a separate patch, since they really are fixing an
> existing bug. But I can live with it all as one, too.
> 
> One other thing it might be worth thinking about or mentioning in the
> commit message: we are skipping the hash check on all objects now (not
> just blobs). I think this is OK to do along the lines of discussion in
> c868d8e91f (parse_object(): allow skipping hash check, 2022-09-06). I
> dunno. Maybe it is kind of self-evident that not every operation needs
> to do a consistency check of every object.

I was rewriting the commit message for that part to justify why it's 
safe to use PARSE_OBJECT_SKIP_HASH_CHECK, and now I'm questioning it. :)

It definitely seems fine for blobs but if what we're trying to check for 
is on-disk corruption, maybe it's not a good idea to skip it for other 
objects since we're actually using their contents here.

I still think the OBJ_NONE fix is worthwhile and I'll send that out 
separately, but maybe it would be a good idea to split 
PARSE_OBJECT_SKIP_HASH_CHECK into separate flags for blobs vs. all 
objects? Or just go back to v1 of the add_promisor_object() patch? Or do 
you think this version is okay despite that concern?

-- Aaron

>>   object.c   | 4 ++--
>>   packfile.c | 3 ++-
>>   2 files changed, 4 insertions(+), 3 deletions(-)
> 
> The patch itself looks great to me.
> 
> -Peff


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] packfile: skip decompressing and hashing blobs in add_promisor_object()
  2025-12-06 19:40   ` Aaron Plattner
@ 2025-12-08 20:28     ` Jeff King
  0 siblings, 0 replies; 4+ messages in thread
From: Jeff King @ 2025-12-08 20:28 UTC (permalink / raw)
  To: Aaron Plattner; +Cc: git

On Sat, Dec 06, 2025 at 11:40:01AM -0800, Aaron Plattner wrote:

> I was rewriting the commit message for that part to justify why it's safe to
> use PARSE_OBJECT_SKIP_HASH_CHECK, and now I'm questioning it. :)
> 
> It definitely seems fine for blobs but if what we're trying to check for is
> on-disk corruption, maybe it's not a good idea to skip it for other objects
> since we're actually using their contents here.
> 
> I still think the OBJ_NONE fix is worthwhile and I'll send that out
> separately, but maybe it would be a good idea to split
> PARSE_OBJECT_SKIP_HASH_CHECK into separate flags for blobs vs. all objects?
> Or just go back to v1 of the add_promisor_object() patch? Or do you think
> this version is okay despite that concern?

I think it's OK to skip the hashes here. In some code paths we really
care about checking the consistency of the objects (like "rev-list
--verify", or fsck). But if the caller knows that we are not trying to
do that, then that opens the doors to other optimizations like avoiding
object loads entirely.

You could argue that if we _do_ load an object, we might as well hash it
to check its consistency. IMHO there is not much value in that, as the
cost is not totally trivial, it's not a thorough validation of what's on
disk (because we are skipping some objects), and this sort of
bit-flipping corruption is pretty rare in the first place.

So to my mind, there are really two types of callers that want to parse
an object: ones that are verifying database consistency and want to be
thorough, and ones that want things to be as fast as possible. And I
think the caller here (traversing the promisor objects) is in the latter
camp. E.g., if we ever added a secondary index of "these are all the
promised objects we don't have" we would just use that!

So really, I think all I was suggesting for the commit message is to say
"this is the kind of caller that wants things to be as fast as possible". ;)

The ideal series to me is probably:

  - patch 1 fixes the OBJ_NONE issues. It would be great if there was
    some way to show the impact of this bug on an existing case, but I
    imagine it would be quite hard. We'd never produce the wrong answer
    but just do things slowly. So you'd need a case where we end up with
    OBJ_NONE, and then a SKIP_HASH code path that loads a big blob. It's
    easy to find the latter (just hand rev-list a blob on the command
    line), but the former is harder. I doubt it's worth the effort to
    dig for one, since we know your case in patch 2 will demonstrate it.

  - patch 2 uses SKIP_HASH

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-12-08 20:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-06  0:20 [PATCH v2] packfile: skip decompressing and hashing blobs in add_promisor_object() Aaron Plattner
2025-12-06  2:06 ` Jeff King
2025-12-06 19:40   ` Aaron Plattner
2025-12-08 20:28     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox