public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Toon Claes <toon@iotcl•com>
To: Karthik Nayak <karthik.188@gmail•com>, git@vger•kernel.org
Cc: jltobler@gmail•com, ps@pks•im, Karthik Nayak <karthik.188@gmail•com>
Subject: Re: [PATCH 2/2] bundle: fix non-linear performance scaling with refs
Date: Thu, 03 Apr 2025 21:07:31 +0200	[thread overview]
Message-ID: <874iz5f4i4.fsf@iotcl.com> (raw)
In-Reply-To: <20250401-488-generating-bundles-with-many-references-has-non-linear-performance-v1-2-6d23b2d96557@gmail.com>

Karthik Nayak <karthik.188@gmail•com> writes:

> The 'git bundle create' command has non-linear performance with the
> number of refs in the repository. Benchmarking the command shows that
> a large portion of the time (~75%) is spent in the
> `object_array_remove_duplicates()` function.
>
> The `object_array_remove_duplicates()` function was added in
> b2a6d1c686 (bundle: allow the same ref to be given more than once,
> 2009-01-17) to skip duplicate refs provided by the user from being
> written to the bundle. Since this is an O(N^2) algorithm, in repos with
> large number of references, this can take up a large amount of time.
>
> Let's instead use a 'strset' to skip duplicates inside
> `write_bundle_refs()`. This improves the performance by around 6 times
> when tested against in repository with 100000 refs:
>
> Benchmark 1: bundle (refcount = 100000, revision = master)
>   Time (mean ± σ):     14.653 s ±  0.203 s    [User: 13.940 s, System: 0.762 s]
>   Range (min … max):   14.237 s … 14.920 s    10 runs
>
> Benchmark 2: bundle (refcount = 100000, revision = HEAD)
>   Time (mean ± σ):      2.394 s ±  0.023 s    [User: 1.684 s, System: 0.798 s]
>   Range (min … max):    2.364 s …  2.425 s    10 runs
>
> Summary
>   bundle (refcount = 100000, revision = HEAD) ran
>     6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master)

That's a good find!

> Previously, `object_array_remove_duplicates()` ensured that both the
> refname and the object it pointed to were checked for duplicates. The
> new approach, implemented within `write_bundle_refs()`, eliminates
> duplicate refnames without comparing the objects they reference. This
> works because, for bundle creation, we only need to prevent duplicate
> refs from being written to the bundle header. The `revs->pending` array
> can contain duplicates of multiple types.

Makes sense to me.

> First, references which resolve to the same refname. For e.g. "git
> bundle create out.bdl master master" or "git bundle create out.bdl
> refs/heads/master refs/heads/master" or "git bundle create out.bdl
> master refs/heads/master". In these scenarios we want to prevent writing
> "refs/heads/master" twice to the bundle header. Since both the refnames
> here would point to the same object (unless there is a race), we do not
> need to check equality of the object.

Yeah, we can never be sure about the changes that happen while the
bundle is being created. I fixed another race[1] recently which also was
comparing equality of the object, that causes the ref to be omitted. We
can only act by "best effort" and having the ref point to /some/ object
is the best we can do.

[1]: https://lore.kernel.org/git/20241211-fix-bundle-create-race-v3-1-0587f6f9db1b@iotcl.com/

> Second, refnames which are duplicates but do not point to the same
> object. This can happen when we use an exclusion criteria. For e.g. "git
> bundle create out.bdl master master^!", Here `revs->pending` would
> contain two elements, both with refname set to "master". However, each
> of them would be pointing to an INTERESTING and UNINTERESTING object
> respectively. Since we only write refnames with INTERESTING objects to
> the bundle header, we perform our duplicate checks only on such
> objects.

Thanks for that context, I didn't consider that.

> Signed-off-by: Karthik Nayak <karthik.188@gmail•com>
> ---
>  bundle.c               | 10 +++++++++-
>  object.c               | 33 ---------------------------------
>  object.h               |  6 ------
>  t/t6020-bundle-misc.sh |  4 ----
>  4 files changed, 9 insertions(+), 44 deletions(-)
>
> diff --git a/bundle.c b/bundle.c
> index d7ad690843..30cfba0be2 100644
> --- a/bundle.c
> +++ b/bundle.c
> @@ -384,6 +384,9 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
>  {
>  	int i;
>  	int ref_count = 0;
> +	struct strset objects;
> +
> +	strset_init(&objects);

Any reason why you're not using the `STRMAP_INIT` macro?

>  
>  	for (i = 0; i < revs->pending.nr; i++) {
>  		struct object_array_entry *e = revs->pending.objects + i;
> @@ -401,6 +404,9 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
>  			flag = 0;
>  		display_ref = (flag & REF_ISSYMREF) ? e->name : ref;
>  
> +		if (strset_contains(&objects, display_ref))
> +			goto skip_write_ref;
> +
>  		if (e->item->type == OBJ_TAG &&
>  				!is_tag_in_date_range(e->item, revs)) {
>  			e->item->flags |= UNINTERESTING;
> @@ -423,6 +429,7 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
>  		}
>  
>  		ref_count++;
> +		strset_add(&objects, display_ref);
>  		write_or_die(bundle_fd, oid_to_hex(&e->item->oid), the_hash_algo->hexsz);
>  		write_or_die(bundle_fd, " ", 1);
>  		write_or_die(bundle_fd, display_ref, strlen(display_ref));
> @@ -431,6 +438,8 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
>  		free(ref);
>  	}
>  
> +	strset_clear(&objects);
> +
>  	/* end header */
>  	write_or_die(bundle_fd, "\n", 1);
>  	return ref_count;
> @@ -566,7 +575,6 @@ int create_bundle(struct repository *r, const char *path,
>  	 */
>  	revs.blob_objects = revs.tree_objects = 0;
>  	traverse_commit_list(&revs, write_bundle_prerequisites, NULL, &bpi);
> -	object_array_remove_duplicates(&revs_copy.pending);
>  
>  	/* write bundle refs */
>  	ref_count = write_bundle_refs(bundle_fd, &revs_copy);
> diff --git a/object.c b/object.c
> index 100bf9b8d1..a2c5986178 100644
> --- a/object.c
> +++ b/object.c
> @@ -491,39 +491,6 @@ void object_array_clear(struct object_array *array)
>  	array->nr = array->alloc = 0;
>  }
>  
> -/*
> - * Return true if array already contains an entry.
> - */
> -static int contains_object(struct object_array *array,
> -			   const struct object *item, const char *name)
> -{
> -	unsigned nr = array->nr, i;
> -	struct object_array_entry *object = array->objects;
> -
> -	for (i = 0; i < nr; i++, object++)
> -		if (item == object->item && !strcmp(object->name, name))
> -			return 1;
> -	return 0;
> -}
> -
> -void object_array_remove_duplicates(struct object_array *array)
> -{
> -	unsigned nr = array->nr, src;
> -	struct object_array_entry *objects = array->objects;
> -
> -	array->nr = 0;
> -	for (src = 0; src < nr; src++) {
> -		if (!contains_object(array, objects[src].item,
> -				     objects[src].name)) {
> -			if (src != array->nr)
> -				objects[array->nr] = objects[src];
> -			array->nr++;
> -		} else {
> -			object_array_release_entry(&objects[src]);
> -		}
> -	}
> -}
> -
>  void clear_object_flags(unsigned flags)
>  {
>  	int i;
> diff --git a/object.h b/object.h
> index 17f32f1103..0e12c75922 100644
> --- a/object.h
> +++ b/object.h
> @@ -324,12 +324,6 @@ typedef int (*object_array_each_func_t)(struct object_array_entry *, void *);
>  void object_array_filter(struct object_array *array,
>  			 object_array_each_func_t want, void *cb_data);
>  
> -/*
> - * Remove from array all but the first entry with a given name.
> - * Warning: this function uses an O(N^2) algorithm.

Funny this has been here for more than 10 years. Thanks for this cleanup.

> - */
> -void object_array_remove_duplicates(struct object_array *array);
> -
>  /*
>   * Remove any objects from the array, freeing all used memory; afterwards
>   * the array is ready to store more objects with add_object_array().
> diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
> index dd09df1287..500c81b8a1 100755
> --- a/t/t6020-bundle-misc.sh
> +++ b/t/t6020-bundle-misc.sh
> @@ -684,7 +684,6 @@ test_expect_success 'create bundle with duplicate refnames' '
>  	test_cmp expect actual
>  '
>  
> -# This exhibits a bug, since the same refname is now added to the bundle twice.
>  test_expect_success 'create bundle with duplicate refnames and --all' '
>  	git bundle create out.bdl --all "main" "main" &&
>  
> @@ -701,7 +700,6 @@ test_expect_success 'create bundle with duplicate refnames and --all' '
>  	<TAG-2> refs/tags/v2
>  	<TAG-3> refs/tags/v3
>  	<COMMIT-P> HEAD
> -	<COMMIT-P> refs/heads/main
>  	EOF
>  	test_cmp expect actual
>  '
> @@ -717,7 +715,6 @@ test_expect_success 'create bundle with duplicate exlusion refnames' '
>  	test_cmp expect actual
>  '
>  
> -# This exhibits a bug, since the same refname is now added to the bundle twice.
>  test_expect_success 'create bundle with duplicate refname short-form' '
>  	git bundle create out.bdl "main" "main" "refs/heads/main" "refs/heads/main" &&
>  
> @@ -725,7 +722,6 @@ test_expect_success 'create bundle with duplicate refname short-form' '
>  		make_user_friendly_and_stable_output >actual &&
>  	cat >expect <<-\EOF &&
>  	<COMMIT-P> refs/heads/main
> -	<COMMIT-P> refs/heads/main
>  	EOF
>  	test_cmp expect actual
>  '

Great work on the alternative implmentation. And thanks for adding these
tests and actually fixing them. I've been manually testing a few more
edge cases, I couldn't find any other scenario that's not covered by the
current implementation.

I approve.

--
Toon

  reply	other threads:[~2025-04-03 19:07 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-01 17:00 [PATCH 0/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
2025-04-01 17:00 ` [PATCH 1/2] t6020: test for duplicate refnames in bundle creation Karthik Nayak
2025-04-01 17:00 ` [PATCH 2/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
2025-04-03 19:07   ` Toon Claes [this message]
2025-04-06 20:48     ` Karthik Nayak
2025-04-08  9:00 ` [PATCH v2 0/2] " Karthik Nayak
2025-04-08  9:00   ` [PATCH v2 1/2] t6020: test for duplicate refnames in bundle creation Karthik Nayak
2025-04-08  9:00   ` [PATCH v2 2/2] bundle: fix non-linear performance scaling with refs Karthik Nayak
2025-04-10  8:57     ` Toon Claes
2025-04-10  9:04       ` Karthik Nayak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874iz5f4i4.fsf@iotcl.com \
    --to=toon@iotcl$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=jltobler@gmail$(echo .)com \
    --cc=karthik.188@gmail$(echo .)com \
    --cc=ps@pks$(echo .)im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox