public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Kristofer Karlsson <krka@spotify•com>
To: git@vger•kernel.org
Cc: Jeff King <peff@peff•net>, Derrick Stolee <derrickstolee@github•com>
Subject: Re: limiting git branch --contains
Date: Wed, 27 May 2026 09:05:09 +0200	[thread overview]
Message-ID: <20260527070510.3510836-1-krka@spotify.com> (raw)
In-Reply-To: <20230324191009.GA536967@coredump.intra.peff.net>

Hi!

I'm reviving this old thread because I believe the discussion here
is still relevant and the patch Peff included is good. I think I
accidentally created the exact same patch before I found this thread.

I work on a large repo (millions of commits, ~8000 remote tracking
branches) where "git for-each-ref --contains <commit> refs/remotes/"
was taking 4+ seconds even with a commit-graph, and much worse for
deeper commits. I started investigating and building a more complex
batched solution before I realized that the cached DFS algorithm
(contains_tag_algo) already exists and does exactly what's needed --
it just wasn't enabled for branches.

With generation numbers (commit-graph), the cached algorithm is
strictly at least as good as the uncached one. Both have the same
walk floor -- the generation cutoff in contains_tag_algo is equivalent
to the STALE boundary in paint_down_to_common. The cache then provides
a pure win: O(total unique commits) instead of O(N * commits per ref).

In my benchmarks, the improvement is significant and scales with the
number of refs and the depth of the target commit:

git.git (69K commits, 5826 tags) with commit-graph:

  Scenario              Master    Cached    Speedup
  Deep (v2.30.0)        1.08s     292ms     3.7x
  Recent (HEAD~100)     338ms     240ms     1.4x
  Orphan (unreachable)  271ms     241ms     1.1x

Large monorepo with commit-graph:

  Scenario              Master    Cached    Speedup
  HEAD~10000            511ms     239ms     2.1x
  HEAD~50000            4.14s     272ms     15x
  Orphan (unreachable)  4.13s     252ms     16x

I also tested with varying ref counts on git.git. The cached
algorithm is faster or tied in every scenario I tested -- the
crossover point is around 3-6 refs, below which both complete in
single-digit milliseconds.

Without commit-graph, the difference is even more dramatic (41-54x
on git.git with 5826 tags), though the theoretical argument is less
clean: without generation numbers, contains_tag_algo has no walk
floor and may traverse to root commits for unreachable targets.
In practice the cache still wins for N > 1, but I don't have a
proof that covers all possible DAG shapes. I think we should start
by optimizing it for the case where we have generation numbers,
and maybe keep exploring if there are any meaningful scenarios
without generation numbers where it would actually be a regression.

To reproduce the benchmarks:

  # Setup
  ORPHAN=$(git commit-tree HEAD^{tree} -m "unreachable")
  git commit-graph write --reachable

  # git.git with 5826 tags
  git for-each-ref --contains v2.30.0 refs/tags/

  # Large repo with remote refs
  git for-each-ref --contains HEAD~50000 refs/remotes/

-- Kristofer


  reply	other threads:[~2026-05-27  7:05 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-23 18:54 limiting git branch --contains Oswald Buddenhagen
2023-03-23 19:42 ` Junio C Hamano
2023-03-23 20:44   ` Oswald Buddenhagen
2023-03-24 17:23     ` Derrick Stolee
2023-03-24 18:15       ` Oswald Buddenhagen
2023-03-24 18:20         ` Derrick Stolee
2023-03-24 19:02           ` Oswald Buddenhagen
2023-03-24 19:13             ` Jeff King
2023-03-24 19:58               ` Oswald Buddenhagen
2023-03-24 20:45                 ` Jeff King
2023-03-24 22:06                   ` Oswald Buddenhagen
2023-03-25  6:30                     ` Jeff King
2023-03-25  8:05                       ` Oswald Buddenhagen
2023-03-24 19:10       ` Jeff King
2026-05-27  7:05         ` Kristofer Karlsson [this message]
2023-03-23 20:56 ` Felipe Contreras

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260527070510.3510836-1-krka@spotify.com \
    --to=krka@spotify$(echo .)com \
    --cc=derrickstolee@github$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=peff@peff$(echo .)net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox