From: Kristofer Karlsson <krka@spotify•com>
To: git@vger•kernel.org
Cc: Jeff King <peff@peff•net>, Derrick Stolee <derrickstolee@github•com>
Subject: Re: limiting git branch --contains
Date: Wed, 27 May 2026 09:05:09 +0200 [thread overview]
Message-ID: <20260527070510.3510836-1-krka@spotify.com> (raw)
In-Reply-To: <20230324191009.GA536967@coredump.intra.peff.net>
Hi!
I'm reviving this old thread because I believe the discussion here
is still relevant and the patch Peff included is good. I think I
accidentally created the exact same patch before I found this thread.
I work on a large repo (millions of commits, ~8000 remote tracking
branches) where "git for-each-ref --contains <commit> refs/remotes/"
was taking 4+ seconds even with a commit-graph, and much worse for
deeper commits. I started investigating and building a more complex
batched solution before I realized that the cached DFS algorithm
(contains_tag_algo) already exists and does exactly what's needed --
it just wasn't enabled for branches.
With generation numbers (commit-graph), the cached algorithm is
strictly at least as good as the uncached one. Both have the same
walk floor -- the generation cutoff in contains_tag_algo is equivalent
to the STALE boundary in paint_down_to_common. The cache then provides
a pure win: O(total unique commits) instead of O(N * commits per ref).
In my benchmarks, the improvement is significant and scales with the
number of refs and the depth of the target commit:
git.git (69K commits, 5826 tags) with commit-graph:
Scenario Master Cached Speedup
Deep (v2.30.0) 1.08s 292ms 3.7x
Recent (HEAD~100) 338ms 240ms 1.4x
Orphan (unreachable) 271ms 241ms 1.1x
Large monorepo with commit-graph:
Scenario Master Cached Speedup
HEAD~10000 511ms 239ms 2.1x
HEAD~50000 4.14s 272ms 15x
Orphan (unreachable) 4.13s 252ms 16x
I also tested with varying ref counts on git.git. The cached
algorithm is faster or tied in every scenario I tested -- the
crossover point is around 3-6 refs, below which both complete in
single-digit milliseconds.
Without commit-graph, the difference is even more dramatic (41-54x
on git.git with 5826 tags), though the theoretical argument is less
clean: without generation numbers, contains_tag_algo has no walk
floor and may traverse to root commits for unreachable targets.
In practice the cache still wins for N > 1, but I don't have a
proof that covers all possible DAG shapes. I think we should start
by optimizing it for the case where we have generation numbers,
and maybe keep exploring if there are any meaningful scenarios
without generation numbers where it would actually be a regression.
To reproduce the benchmarks:
# Setup
ORPHAN=$(git commit-tree HEAD^{tree} -m "unreachable")
git commit-graph write --reachable
# git.git with 5826 tags
git for-each-ref --contains v2.30.0 refs/tags/
# Large repo with remote refs
git for-each-ref --contains HEAD~50000 refs/remotes/
-- Kristofer
next prev parent reply other threads:[~2026-05-27 7:05 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-23 18:54 limiting git branch --contains Oswald Buddenhagen
2023-03-23 19:42 ` Junio C Hamano
2023-03-23 20:44 ` Oswald Buddenhagen
2023-03-24 17:23 ` Derrick Stolee
2023-03-24 18:15 ` Oswald Buddenhagen
2023-03-24 18:20 ` Derrick Stolee
2023-03-24 19:02 ` Oswald Buddenhagen
2023-03-24 19:13 ` Jeff King
2023-03-24 19:58 ` Oswald Buddenhagen
2023-03-24 20:45 ` Jeff King
2023-03-24 22:06 ` Oswald Buddenhagen
2023-03-25 6:30 ` Jeff King
2023-03-25 8:05 ` Oswald Buddenhagen
2023-03-24 19:10 ` Jeff King
2026-05-27 7:05 ` Kristofer Karlsson [this message]
2023-03-23 20:56 ` Felipe Contreras
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260527070510.3510836-1-krka@spotify.com \
--to=krka@spotify$(echo .)com \
--cc=derrickstolee@github$(echo .)com \
--cc=git@vger$(echo .)kernel.org \
--cc=peff@peff$(echo .)net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox