From: "René Scharfe" <l.s.r@web•de>
To: Junio C Hamano <gitster@pobox•com>
Cc: Git List <git@vger•kernel.org>,
"D. Ben Knoble" <ben.knoble@gmail•com>, Jeff King <peff@peff•net>,
Phillip Wood <phillip.wood@dunelm•org.uk>
Subject: Re: [PATCH v2] diff-index: don't queue unchanged filepairs with diff_change()
Date: Tue, 2 Dec 2025 23:07:17 +0100 [thread overview]
Message-ID: <9a253514-376f-49fd-99fe-f076ecb180b6@web.de> (raw)
In-Reply-To: <xmqq5xarcsb8.fsf@gitster.g>
On 11/30/25 7:02 PM, Junio C Hamano wrote:
> René Scharfe <l.s.r@web•de> writes:
>
>> Add a new streamlined function for queuing unchanged filepairs and
>> use it in show_modified(), which is called by diff_cache() via
>> oneway_diff() and do_oneway_diff(). It allocates only a single filespec
>> for each filepair and uses it twice with reference counting. This has a
>> measurable effect if there are a lot of them, like in the Linux repo:
>>
>> Benchmark 1: ./git_v2.52.0 -C ../linux diff --cached --find-copies-harder
>> Time (mean ± σ): 31.8 ms ± 0.2 ms [User: 24.2 ms, System: 6.3 ms]
>> Range (min … max): 31.5 ms … 32.3 ms 85 runs
>>
>> Benchmark 2: ./git -C ../linux diff --cached --find-copies-harder
>> Time (mean ± σ): 23.9 ms ± 0.2 ms [User: 18.1 ms, System: 4.6 ms]
>> Range (min … max): 23.5 ms … 24.4 ms 111 runs
>>
>> Summary
>> ./git -C ../linux diff --cached --find-copies-harder ran
>> 1.33 ± 0.01 times faster than ./git_v2.52.0 -C ../linux diff --cached --find-copies-harder
>
> Nice. Is this technique only applicable to diff-index among the
> three diff plumbing siblings?
> [...] it would apply to diff-tree, wouldn't it?
Yes, but its diff_change() call is behind two layers of callbacks, which
complicates things.
And I don't know how to avoid adding an object ID comparison. Do we
perhaps have that bit somewhere in tree-diff.c already and can pass it
along the pathchange call?
And I wonder now if diff_same() is the right name. Shouldn't it be
diff_keep() or a similar verb to match the siblings diff_change(),
diff_addremove() and diff_unmerge()?
Performance looks mixed. E.g. memory usage is reduced slightly here:
$ for git in ./git_v2.52.0 ./git
do
for i in $(seq 5)
do
/usr/bin/time -l $git diff-tree --find-copies-harder -r v2.51.0 v2.51.1 2>&1 >/dev/null
done
done | grep peak
36111744 peak memory footprint
36291968 peak memory footprint
36177280 peak memory footprint
36177280 peak memory footprint
36291968 peak memory footprint
35456384 peak memory footprint
35489152 peak memory footprint
35505536 peak memory footprint
35636608 peak memory footprint
35505536 peak memory footprint
But diff-tree needs 1% more time with the patch:
Benchmark 1: ./git_v2.52.0 diff-tree --find-copies-harder -r v2.51.0 v2.51.1
Time (mean ± σ): 78.3 ms ± 0.2 ms [User: 57.4 ms, System: 19.8 ms]
Range (min … max): 77.9 ms … 78.7 ms 36 runs
Benchmark 2: ./git diff-tree --find-copies-harder -r v2.51.0 v2.51.1
Time (mean ± σ): 78.8 ms ± 0.2 ms [User: 57.9 ms, System: 19.8 ms]
Range (min … max): 78.4 ms … 79.2 ms 36 runs
Summary
./git_v2.52.0 diff-tree --find-copies-harder -r v2.51.0 v2.51.1 ran
1.01 ± 0.00 times faster than ./git diff-tree --find-copies-harder -r v2.51.0 v2.51.1
Other examples look better:
Benchmark 1: ./git_v2.52.0 -C ../linux diff-tree --find-copies-harder -r v6.8 v6.9
Time (mean ± σ): 110.7 ms ± 0.2 ms [User: 83.1 ms, System: 26.8 ms]
Range (min … max): 110.2 ms … 111.4 ms 26 runs
Benchmark 2: ./git -C ../linux diff-tree --find-copies-harder -r v6.8 v6.9
Time (mean ± σ): 105.3 ms ± 0.4 ms [User: 78.9 ms, System: 24.0 ms]
Range (min … max): 104.7 ms … 106.0 ms 27 runs
Summary
./git -C ../linux diff-tree --find-copies-harder -r v6.8 v6.9 ran
1.05 ± 0.00 times faster than ./git_v2.52.0 -C ../linux diff-tree --find-copies-harder -r v6.8 v6.9
But overall I'm not impressed. :-|
René
---
builtin/reset.c | 1 +
diff.c | 1 +
diff.h | 7 ++++++-
diffcore.h | 4 ++--
tree-diff.c | 8 ++++++--
5 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/builtin/reset.c b/builtin/reset.c
index ed35802af1..ec674694dd 100644
--- a/builtin/reset.c
+++ b/builtin/reset.c
@@ -210,6 +210,7 @@ static int read_from_tree(const struct pathspec *pathspec,
opt.repo = the_repository;
opt.change = diff_change;
opt.add_remove = diff_addremove;
+ opt.keep = diff_same;
if (pathspec->nr && pathspec_needs_expanded_index(the_repository->index, pathspec))
ensure_full_index(the_repository->index);
diff --git a/diff.c b/diff.c
index 436da250eb..9671524d2b 100644
--- a/diff.c
+++ b/diff.c
@@ -4847,6 +4847,7 @@ void repo_diff_setup(struct repository *r, struct diff_options *options)
/* pathchange left =NULL by default */
options->change = diff_change;
options->add_remove = diff_addremove;
+ options->keep = diff_same;
options->use_color = diff_use_color_default;
options->detect_rename = diff_detect_rename_default;
options->xdl_opts |= diff_algorithm;
diff --git a/diff.h b/diff.h
index 7eb84aadf4..6dfec55039 100644
--- a/diff.h
+++ b/diff.h
@@ -43,7 +43,8 @@ struct oidset;
* set_default in diff_options can be used to tweak this more.
*
* - As you find different pairs of files, call `diff_change()` to feed
- * modified files, `diff_addremove()` to feed created or deleted files, or
+ * modified files, `diff_addremove()` to feed created or deleted files,
+ * `diff_same()` to feed unmodified files if needed for copy detection, or
* `diff_unmerge()` to feed a file whose state is 'unmerged' to the API.
* These are thin wrappers to a lower-level `diff_queue()` function that is
* flexible enough to record any of these kinds of changes.
@@ -92,6 +93,9 @@ typedef void (*add_remove_fn_t)(struct diff_options *options,
int oid_valid,
const char *fullpath, unsigned dirty_submodule);
+typedef void (*keep_fn_t)(struct diff_options *options, unsigned mode,
+ const struct object_id *oid, const char *fullpath);
+
typedef void (*diff_format_fn_t)(struct diff_queue_struct *q,
struct diff_options *options, void *data);
@@ -384,6 +388,7 @@ struct diff_options {
pathchange_fn_t pathchange;
change_fn_t change;
add_remove_fn_t add_remove;
+ keep_fn_t keep;
void *change_fn_data;
diff_format_fn_t format_callback;
void *format_callback_data;
diff --git a/diffcore.h b/diffcore.h
index 9c0a0e7aaf..64b419b33f 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -35,8 +35,8 @@ struct userdiff_driver;
/**
* the internal representation for a single file (blob). It records the blob
* object name (if known -- for a work tree file it typically is a NUL SHA-1),
- * filemode and pathname. This is what the `diff_addremove()`, `diff_change()`
- * and `diff_unmerge()` synthesize and feed `diff_queue()` function with.
+ * filemode and pathname. This is what `diff_addremove()`, `diff_change()`,
+ * `diff_same()` and `diff_unmerge()` synthesize and feed `diff_queue()`.
*/
struct diff_filespec {
struct object_id oid;
diff --git a/tree-diff.c b/tree-diff.c
index 5988148b60..c5e2cf6a69 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -167,8 +167,12 @@ static int emit_diff_first_parent_only(struct diff_options *opt, struct combine_
{
struct combine_diff_parent *p0 = &p->parent[0];
if (p->mode && p0->mode) {
- opt->change(opt, p0->mode, p->mode, &p0->oid, &p->oid,
- 1, 1, p->path, 0, 0);
+ if (opt->flags.find_copies_harder && opt->keep &&
+ p0->mode == p->mode && oideq(&p0->oid, &p->oid))
+ opt->keep(opt, p->mode, &p->oid, p->path);
+ else
+ opt->change(opt, p0->mode, p->mode, &p0->oid, &p->oid,
+ 1, 1, p->path, 0, 0);
}
else {
const struct object_id *oid;
--
2.52.0
next prev parent reply other threads:[~2025-12-02 22:12 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-30 11:47 [PATCH v2] diff-index: don't queue unchanged filepairs with diff_change() René Scharfe
2025-11-30 18:02 ` Junio C Hamano
2025-12-02 21:16 ` René Scharfe
2025-12-02 22:07 ` René Scharfe [this message]
2025-12-03 15:06 ` René Scharfe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9a253514-376f-49fd-99fe-f076ecb180b6@web.de \
--to=l.s.r@web$(echo .)de \
--cc=ben.knoble@gmail$(echo .)com \
--cc=git@vger$(echo .)kernel.org \
--cc=gitster@pobox$(echo .)com \
--cc=peff@peff$(echo .)net \
--cc=phillip.wood@dunelm$(echo .)org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox