From: "René Scharfe" <l.s.r@web•de>
To: Junio C Hamano <gitster@pobox•com>
Cc: Git List <git@vger•kernel.org>,
"D. Ben Knoble" <ben.knoble@gmail•com>, Jeff King <peff@peff•net>,
Phillip Wood <phillip.wood@dunelm•org.uk>
Subject: Re: [PATCH v2] diff-index: don't queue unchanged filepairs with diff_change()
Date: Tue, 2 Dec 2025 22:16:27 +0100 [thread overview]
Message-ID: <f2e187bb-c765-4cc3-a0a0-1fbaec9a14e2@web.de> (raw)
In-Reply-To: <xmqq5xarcsb8.fsf@gitster.g>
On 11/30/25 7:02 PM, Junio C Hamano wrote:
> René Scharfe <l.s.r@web•de> writes:
>
>> Add a new streamlined function for queuing unchanged filepairs and
>> use it in show_modified(), which is called by diff_cache() via
>> oneway_diff() and do_oneway_diff(). It allocates only a single filespec
>> for each filepair and uses it twice with reference counting. This has a
>> measurable effect if there are a lot of them, like in the Linux repo:
>>
>> Benchmark 1: ./git_v2.52.0 -C ../linux diff --cached --find-copies-harder
>> Time (mean ± σ): 31.8 ms ± 0.2 ms [User: 24.2 ms, System: 6.3 ms]
>> Range (min … max): 31.5 ms … 32.3 ms 85 runs
>>
>> Benchmark 2: ./git -C ../linux diff --cached --find-copies-harder
>> Time (mean ± σ): 23.9 ms ± 0.2 ms [User: 18.1 ms, System: 4.6 ms]
>> Range (min … max): 23.5 ms … 24.4 ms 111 runs
>>
>> Summary
>> ./git -C ../linux diff --cached --find-copies-harder ran
>> 1.33 ± 0.01 times faster than ./git_v2.52.0 -C ../linux diff --cached --find-copies-harder
>
> Nice. Is this technique only applicable to diff-index among the
> three diff plumbing siblings? I suspect diff-files is an oddball
> in that on the working tree side we do not necessarily have the
> blob object names
Indeed:
- git diff-files compares index and working tree,
- a copy is a new file with contents from an old file,
- git ignores new files in the working tree.
So in theory git diff-files can only detect copies in the other
direction. Or is there a way I'm missing? In practice, however, it
doesn't do that reliably because it simply skips up-to-date index
entries. Oops.
--- >8 ---
Subject: [PATCH v2 2/1] diff-files: fix copy detection
Copy detection cannot work when comparing the index to the working tree
because Git ignores files that it is not explicitly told to track. It
should work in the other direction, though, i.e. for a reverse diff of
the deletion of a copy from the index.
d1f2d7e8ca (Make run_diff_index() use unpack_trees(), not read_tree(),
2008-01-19) broke it with a seemingly stray change to run_diff_files().
We didn't notice because there's no test for that. But even if we had
one, it might have gone unnoticed because the breakage only happens
with index preloading, which requires at least 1000 entries (more than
most test repos have) and is racy because it runs in parallel with the
actual command.
Fix copy detection by queuing up-to-date and skip-worktree entries using
diff_same().
While at it, use diff_same() also for queuing unchanged files not
flagged as up-to-date, i.e. clean submodules and entries where
preloading was not done at all or not quickly enough. It uses less
memory than diff_change() and doesn't unnecessarily set the diff flag
has_changes.
Add two tests to cover running both without and with preloading. The
first one passes reliably with the original code. The second one
enables preloading and thus is racy. It has a good chance to pass even
without the fix, but fails within seconds when running the test script
with --stress. With the fix it runs fine for several minutes, until
my patience runs out.
Signed-off-by: René Scharfe <l.s.r@web•de>
---
Patch formatted with -U9 for easier review of the second hunk.
diff-lib.c | 12 +++++++++---
t/t4007-rename-3.sh | 23 ++++++++++++++++++++++-
2 files changed, 31 insertions(+), 4 deletions(-)
diff --git a/diff-lib.c b/diff-lib.c
index 8e624f38c6..5307390ff3 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -220,20 +220,24 @@ void run_diff_files(struct rev_info *revs, unsigned int option)
* from the desired stage.
*/
pair = diff_unmerge(&revs->diffopt, ce->name);
if (wt_mode)
pair->two->mode = wt_mode;
if (ce_stage(ce) != diff_unmerged_stage)
continue;
}
- if (ce_uptodate(ce) || ce_skip_worktree(ce))
+ if (ce_uptodate(ce) || ce_skip_worktree(ce)) {
+ if (revs->diffopt.flags.find_copies_harder)
+ diff_same(&revs->diffopt, ce->ce_mode,
+ &ce->oid, ce->name);
continue;
+ }
/*
* When CE_VALID is set (via "update-index --assume-unchanged"
* or via adding paths while core.ignorestat is set to true),
* the user has promised that the working tree file for that
* path will not be modified. When CE_FSMONITOR_VALID is true,
* the fsmonitor knows that the path hasn't been modified since
* we refreshed the cached stat information. In either case,
* we do not have to stat to see if the path has been removed
@@ -266,20 +270,22 @@ void run_diff_files(struct rev_info *revs, unsigned int option)
changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
ce_option, &dirty_submodule);
newmode = ce_mode_from_stat(ce, st.st_mode);
}
if (!changed && !dirty_submodule) {
ce_mark_uptodate(ce);
mark_fsmonitor_valid(istate, ce);
- if (!revs->diffopt.flags.find_copies_harder)
- continue;
+ if (revs->diffopt.flags.find_copies_harder)
+ diff_same(&revs->diffopt, newmode,
+ &ce->oid, ce->name);
+ continue;
}
oldmode = ce->ce_mode;
old_oid = &ce->oid;
new_oid = changed ? null_oid(the_hash_algo) : &ce->oid;
diff_change(&revs->diffopt, oldmode, newmode,
old_oid, new_oid,
!is_null_oid(old_oid),
!is_null_oid(new_oid),
ce->name, 0, dirty_submodule);
diff --git a/t/t4007-rename-3.sh b/t/t4007-rename-3.sh
index 3fc81bcd76..1012a370dd 100755
--- a/t/t4007-rename-3.sh
+++ b/t/t4007-rename-3.sh
@@ -61,19 +61,40 @@ cat >expected <<EOF
:000000 100644 $ZERO_OID $blob A path1/COPYING
EOF
test_expect_success 'copy, limited to a subtree' '
git diff-index -C --find-copies-harder $tree path1 >current &&
compare_diff_raw current expected
'
test_expect_success 'tweak work tree' '
- rm -f path0/COPYING &&
+ rm -f path0/COPYING
+'
+
+cat >expected <<EOF
+:100644 100644 $blob $blob C100 path1/COPYING path0/COPYING
+EOF
+
+# The cache has path0/COPYING and path1/COPYING, the working tree only
+# path1/COPYING. This is a deletion -- we don't treat deduplication
+# specially. In reverse it should be detected as a copy, though.
+test_expect_success 'copy detection, files to index' '
+ git diff-files -C --find-copies-harder -R >current &&
+ compare_diff_raw current expected
+'
+
+test_expect_success 'copy detection, files to preloaded index' '
+ GIT_TEST_PRELOAD_INDEX=1 \
+ git diff-files -C --find-copies-harder -R >current &&
+ compare_diff_raw current expected
+'
+
+test_expect_success 'tweak index' '
git update-index --remove path0/COPYING
'
# In the tree, there is only path0/COPYING. In the cache, path0 does
# not have COPYING anymore and path1 has COPYING which is a copy of
# path0/COPYING. Showing the full tree with cache should tell us about
# the rename.
cat >expected <<EOF
:100644 100644 $blob $blob R100 path0/COPYING path1/COPYING
--
2.52.0
next prev parent reply other threads:[~2025-12-02 21:16 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-30 11:47 [PATCH v2] diff-index: don't queue unchanged filepairs with diff_change() René Scharfe
2025-11-30 18:02 ` Junio C Hamano
2025-12-02 21:16 ` René Scharfe [this message]
2025-12-02 22:07 ` René Scharfe
2025-12-03 15:06 ` René Scharfe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f2e187bb-c765-4cc3-a0a0-1fbaec9a14e2@web.de \
--to=l.s.r@web$(echo .)de \
--cc=ben.knoble@gmail$(echo .)com \
--cc=git@vger$(echo .)kernel.org \
--cc=gitster@pobox$(echo .)com \
--cc=peff@peff$(echo .)net \
--cc=phillip.wood@dunelm$(echo .)org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox