From: Derrick Stolee <stolee@gmail•com>
To: "SZEDER Gábor" <szeder.dev@gmail•com>, git@vger•kernel.org
Subject: Re: [PATCH 1/4] line-log: avoid unnecessary tree diffs when processing merge commits
Date: Mon, 25 Aug 2025 10:13:43 -0400 [thread overview]
Message-ID: <930745d3-85a6-467b-a87b-b57e9623a604@gmail.com> (raw)
In-Reply-To: <20250824190644.2573279-2-szeder.dev@gmail.com>
On 8/24/2025 3:06 PM, SZEDER Gábor wrote:
> In process_ranges_merge_commit(), the line-level log first creates an
> array of diff queues by iterating over all parents of a merge commit
> and computing a tree diff for each. Then in a second loop it iterates
> over those diff queues, and if it finds that none of the interesting
> paths were modified in one of them, then it will return early. This
> means that when none of the interesting paths were modified between a
> merge and its first parent, then the tree diff between the merge and
> its second (Nth...) parent was computed in vain.
Great find! This goes all the way back to the original implementation
in 12da1d1f6f (Implement line-history search (git log -L), 2013-03-28)
where this detail could easily be missed in the rest of the scaffolding
to implement the feature.
This is an understandable mistake to make as it can take some time to
understand Git's simplified history computation and how it short-
circuits these merge diffs in most cases.
> Summary
> './git -C ~/src/git log -L:'lookup_commit(':commit.c v2.51.0' ran
> 2.25 ± 0.03 times faster than './git_v2.51.0 -C ~/src/git log -L:'lookup_commit(':commit.c v2.51.0'
> Summary
> './git -C ~/src/linux.git log -L:build_restore_work_registers:arch/mips/mm/tlbex.c v6.16' ran
> 1.32 ± 0.01 times faster than './git_v2.51.0 -C ~/src/linux.git log -L:build_restore_work_registers:arch/mips/mm/tlbex.c v6.16'
Great stats!
> And since now each iteration computes a tree diff and processes its
> result, there is no reason to store the diff queues for each merge
> parent anymore, so replace that diff queue array with a loop-local
> diff queue variable. With this change the static free_diffqueues()
> helper function in 'line-log.c' has no more callers left, remove it.
>
> Signed-off-by: SZEDER Gábor <szeder.dev@gmail•com>
> ---
> line-log.c | 20 +++++---------------
> 1 file changed, 5 insertions(+), 15 deletions(-)
>
> diff --git a/line-log.c b/line-log.c
> index 07f2154e84..cf30915c94 100644
> --- a/line-log.c
> +++ b/line-log.c
> @@ -1087,13 +1087,6 @@ static struct diff_filepair *diff_filepair_dup(struct diff_filepair *pair)
> return new_filepair;
> }
>
> -static void free_diffqueues(int n, struct diff_queue_struct *dq)
> -{
> - for (int i = 0; i < n; i++)
> - diff_queue_clear(&dq[i]);
> - free(dq);
> -}
> -
> static int process_all_files(struct line_log_data **range_out,
> struct rev_info *rev,
> struct diff_queue_struct *queue,
> @@ -1209,7 +1202,6 @@ static int process_ranges_ordinary_commit(struct rev_info *rev, struct commit *c
> static int process_ranges_merge_commit(struct rev_info *rev, struct commit *commit,
> struct line_log_data *range)
> {
> - struct diff_queue_struct *diffqueues;
> struct line_log_data **cand;
> struct commit **parents;
> struct commit_list *p;
> @@ -1220,20 +1212,19 @@ static int process_ranges_merge_commit(struct rev_info *rev, struct commit *comm
> if (nparents > 1 && rev->first_parent_only)
> nparents = 1;
>
> - ALLOC_ARRAY(diffqueues, nparents);
> CALLOC_ARRAY(cand, nparents);
> ALLOC_ARRAY(parents, nparents);
>
> p = commit->parents;
> for (i = 0; i < nparents; i++) {
> + struct diff_queue_struct diffqueue = DIFF_QUEUE_INIT;
> + int changed;
> parents[i] = p->item;
> p = p->next;
> - queue_diffs(range, &rev->diffopt, &diffqueues[i], commit, parents[i]);
> - }
> + queue_diffs(range, &rev->diffopt, &diffqueue, commit, parents[i]);
>
> - for (i = 0; i < nparents; i++) {
> - int changed;
> - changed = process_all_files(&cand[i], rev, &diffqueues[i], range);
> + changed = process_all_files(&cand[i], rev, &diffqueue, range);
> + diff_queue_clear(&diffqueue);
> if (!changed) {
> /*
> * This parent can take all the blame, so we
> @@ -1267,7 +1258,6 @@ static int process_ranges_merge_commit(struct rev_info *rev, struct commit *comm
> free(cand[i]);
> }
> free(cand);
> - free_diffqueues(nparents, diffqueues);
> return ret;
>
> /* NEEDSWORK evil merge detection stuff */
This diff is a lot cleaner than I expected it to be. Excellent!
I applied this patch locally and tested it on a few repos I have
to give extra confidence to your patch.
For an internal monorepo, I was able to measure these results:
Benchmark 1: old
Time (mean ± σ): 19.709 s ± 0.014 s [User: 18.846 s, System: 0.862 s]
Range (min … max): 19.681 s … 19.725 s 10 runs
Benchmark 2: new
Time (mean ± σ): 9.061 s ± 0.015 s [User: 8.487 s, System: 0.574 s]
Range (min … max): 9.042 s … 9.089 s 10 runs
Summary
'new' ran
2.18 ± 0.00 times faster than 'old'
I did also want to check to see the impact of f32dde8c12 (line-log:
integrate with changed-path Bloom filters, 2020-05-11), and having
computed filters diminished the size of your impact somewhat:
Your Git example:
Benchmark 1: old
Time (mean ± σ): 279.2 ms ± 2.2 ms [User: 231.0 ms, System: 47.9 ms]
Range (min … max): 275.5 ms … 282.6 ms 10 runs
Benchmark 2: new
Time (mean ± σ): 242.4 ms ± 3.6 ms [User: 191.8 ms, System: 50.4 ms]
Range (min … max): 237.3 ms … 249.9 ms 12 runs
Summary
'new ' ran
1.15 ± 0.02 times faster than 'old'
Your Linux example:
Benchmark 1: old
Time (mean ± σ): 1.694 s ± 0.008 s [User: 1.524 s, System: 0.169 s]
Range (min … max): 1.688 s … 1.714 s 10 runs
Benchmark 2: new
Time (mean ± σ): 1.644 s ± 0.008 s [User: 1.482 s, System: 0.161 s]
Range (min … max): 1.636 s … 1.663 s 10 runs
Summary
'new ' ran
1.03 ± 0.01 times faster than 'old'
My internal monorepo example:
Benchmark 1: old
Time (mean ± σ): 3.749 s ± 0.007 s [User: 3.188 s, System: 0.559 s]
Range (min … max): 3.736 s … 3.759 s 10 runs
Benchmark 2: new
Time (mean ± σ): 2.713 s ± 0.005 s [User: 2.318 s, System: 0.394 s]
Range (min … max): 2.706 s … 2.723 s 10 runs
Summary
'new' ran
1.38 ± 0.00 times faster than 'old'
Rerunning with "-c commitGraph.readChangedPaths=false" resulted
in numbers closer to your examples (940ms->420ms for Git and
2.6->2.1s for Linux). I expect most users to be in the situation
where there are no changed-path Bloom filters, so this is very
good to deliver that value.
At first, I found this to be concerning: we only store filters
for the diff between a commit and its first parent, so this cost
of visiting the later parents should be _much worse_ in those
cases. However, it turns out that the way that the filters are
handled in line_log_process_ranges_arbitrary_commit() avoids a
call to process_ranges_merge_commit() if the filter says that the
first parent is TREESAME on the given path.
This means that a good chunk of the performance benefits in
f32dde8c12 (line-log: integrate with changed-path Bloom filters,
2020-05-11) are _actually_ due to avoiding this extra work for
multiple parents. Thanks for digging in and bringing this benefit
to all users!
Thanks,
-Stolee
next prev parent reply other threads:[~2025-08-25 14:14 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-24 19:06 [PATCH 0/4] line-log: optimize merge commit processing SZEDER Gábor
2025-08-24 19:06 ` [PATCH 1/4] line-log: avoid unnecessary tree diffs when processing merge commits SZEDER Gábor
2025-08-25 14:13 ` Derrick Stolee [this message]
2025-08-25 15:35 ` Junio C Hamano
2025-08-28 20:27 ` SZEDER Gábor
2025-08-24 19:06 ` [PATCH 2/4] line-log: get rid of the parents array in process_ranges_merge_commit() SZEDER Gábor
2025-08-24 19:06 ` [PATCH 3/4] line-log: initialize diff queue in process_ranges_ordinary_commit() SZEDER Gábor
2025-08-24 19:06 ` [PATCH 4/4] line-log: simplify condition checking for merge commits SZEDER Gábor
2025-08-25 20:57 ` Junio C Hamano
2025-08-25 21:43 ` Derrick Stolee
2025-08-25 21:57 ` Junio C Hamano
2025-08-25 14:16 ` [PATCH 0/4] line-log: optimize merge commit processing Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=930745d3-85a6-467b-a87b-b57e9623a604@gmail.com \
--to=stolee@gmail$(echo .)com \
--cc=git@vger$(echo .)kernel.org \
--cc=szeder.dev@gmail$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox