public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: "René Scharfe" <l.s.r@web•de>
To: Junio C Hamano <gitster@pobox•com>
Cc: Git List <git@vger•kernel.org>,
	"D. Ben Knoble" <ben.knoble@gmail•com>, Jeff King <peff@peff•net>,
	Phillip Wood <phillip.wood@dunelm•org.uk>
Subject: Re: [PATCH v2] diff-index: don't queue unchanged filepairs with diff_change()
Date: Wed, 3 Dec 2025 16:06:46 +0100	[thread overview]
Message-ID: <5581a8d8-afe3-4097-8a27-7e52c7e915ce@web.de> (raw)
In-Reply-To: <9a253514-376f-49fd-99fe-f076ecb180b6@web.de>

On 12/2/25 11:07 PM, René Scharfe wrote:
> On 11/30/25 7:02 PM, Junio C Hamano wrote:
>> René Scharfe <l.s.r@web•de> writes:
>>
>>> Add a new streamlined function for queuing unchanged filepairs and
>>> use it in show_modified(), which is called by diff_cache() via
>>> oneway_diff() and do_oneway_diff().  It allocates only a single filespec
>>> for each filepair and uses it twice with reference counting.  This has a
>>> measurable effect if there are a lot of them, like in the Linux repo:
>>>
>>> Benchmark 1: ./git_v2.52.0 -C ../linux diff --cached --find-copies-harder
>>>   Time (mean ± σ):      31.8 ms ±   0.2 ms    [User: 24.2 ms, System: 6.3 ms]
>>>   Range (min … max):    31.5 ms …  32.3 ms    85 runs
>>>
>>> Benchmark 2: ./git -C ../linux diff --cached --find-copies-harder
>>>   Time (mean ± σ):      23.9 ms ±   0.2 ms    [User: 18.1 ms, System: 4.6 ms]
>>>   Range (min … max):    23.5 ms …  24.4 ms    111 runs
>>>
>>> Summary
>>>   ./git -C ../linux diff --cached --find-copies-harder ran
>>>     1.33 ± 0.01 times faster than ./git_v2.52.0 -C ../linux diff --cached --find-copies-harder
>>
>> Nice.  Is this technique only applicable to diff-index among the
>> three diff plumbing siblings?
> 
>> [...] it would apply to diff-tree, wouldn't it?
> Yes, but its diff_change() call is behind two layers of callbacks, which
> complicates things.
> 
> And I don't know how to avoid adding an object ID comparison.  Do we
> perhaps have that bit somewhere in tree-diff.c already and can pass it
> along the pathchange call?

Yes, new patch below.  Not sure if it's better, though.

> Benchmark 1: ./git_v2.52.0 diff-tree --find-copies-harder -r v2.51.0 v2.51.1
>   Time (mean ± σ):      78.3 ms ±   0.2 ms    [User: 57.4 ms, System: 19.8 ms]
>   Range (min … max):    77.9 ms …  78.7 ms    36 runs
> 
> Benchmark 2: ./git diff-tree --find-copies-harder -r v2.51.0 v2.51.1
>   Time (mean ± σ):      78.8 ms ±   0.2 ms    [User: 57.9 ms, System: 19.8 ms]
>   Range (min … max):    78.4 ms …  79.2 ms    36 runs
> 
> Summary
>   ./git_v2.52.0 diff-tree --find-copies-harder -r v2.51.0 v2.51.1 ran
>     1.01 ± 0.00 times faster than ./git diff-tree --find-copies-harder -r v2.51.0 v2.51.1
Benchmark 1: ./git_v2.52.0 diff-tree --find-copies-harder -r v2.51.0 v2.51.1
  Time (mean ± σ):      75.6 ms ±   0.6 ms    [User: 57.4 ms, System: 17.2 ms]
  Range (min … max):    75.0 ms …  78.0 ms    37 runs

Benchmark 2: ./git diff-tree --find-copies-harder -r v2.51.0 v2.51.1
  Time (mean ± σ):      76.0 ms ±   0.2 ms    [User: 57.9 ms, System: 17.1 ms]
  Range (min … max):    75.6 ms …  76.4 ms    37 runs

Summary
  ./git_v2.52.0 diff-tree --find-copies-harder -r v2.51.0 v2.51.1 ran
    1.00 ± 0.01 times faster than ./git diff-tree --find-copies-harder -r v2.51.0 v2.51.1

Hmm, I probably ran some background task when I measured yesterday and got
worse results for the git_v2.52.0 baseline than today.

René

---
 builtin/reset.c |  1 +
 diff.c          |  1 +
 diff.h          |  9 +++++++--
 diffcore.h      |  4 ++--
 tree-diff.c     | 48 +++++++++++++++++++++++++++---------------------
 5 files changed, 38 insertions(+), 25 deletions(-)

diff --git a/builtin/reset.c b/builtin/reset.c
index ed35802af1..ec674694dd 100644
--- a/builtin/reset.c
+++ b/builtin/reset.c
@@ -210,6 +210,7 @@ static int read_from_tree(const struct pathspec *pathspec,
 	opt.repo = the_repository;
 	opt.change = diff_change;
 	opt.add_remove = diff_addremove;
+	opt.keep = diff_same;
 
 	if (pathspec->nr && pathspec_needs_expanded_index(the_repository->index, pathspec))
 		ensure_full_index(the_repository->index);
diff --git a/diff.c b/diff.c
index 436da250eb..9671524d2b 100644
--- a/diff.c
+++ b/diff.c
@@ -4847,6 +4847,7 @@ void repo_diff_setup(struct repository *r, struct diff_options *options)
 	/* pathchange left =NULL by default */
 	options->change = diff_change;
 	options->add_remove = diff_addremove;
+	options->keep = diff_same;
 	options->use_color = diff_use_color_default;
 	options->detect_rename = diff_detect_rename_default;
 	options->xdl_opts |= diff_algorithm;
diff --git a/diff.h b/diff.h
index 7eb84aadf4..2e3a5ac04a 100644
--- a/diff.h
+++ b/diff.h
@@ -43,7 +43,8 @@ struct oidset;
  * set_default in diff_options can be used to tweak this more.
  *
  * - As you find different pairs of files, call `diff_change()` to feed
- * modified files, `diff_addremove()` to feed created or deleted files, or
+ * modified files, `diff_addremove()` to feed created or deleted files,
+ * `diff_same()` to feed unmodified files if needed for copy detection, or
  * `diff_unmerge()` to feed a file whose state is 'unmerged' to the API.
  * These are thin wrappers to a lower-level `diff_queue()` function that is
  * flexible enough to record any of these kinds of changes.
@@ -76,7 +77,7 @@ struct rev_info;
 struct userdiff_driver;
 
 typedef int (*pathchange_fn_t)(struct diff_options *options,
-		 struct combine_diff_path *path);
+		 struct combine_diff_path *path, bool is_change);
 
 typedef void (*change_fn_t)(struct diff_options *options,
 		 unsigned old_mode, unsigned new_mode,
@@ -92,6 +93,9 @@ typedef void (*add_remove_fn_t)(struct diff_options *options,
 		    int oid_valid,
 		    const char *fullpath, unsigned dirty_submodule);
 
+typedef void (*keep_fn_t)(struct diff_options *options, unsigned mode,
+			  const struct object_id *oid, const char *fullpath);
+
 typedef void (*diff_format_fn_t)(struct diff_queue_struct *q,
 		struct diff_options *options, void *data);
 
@@ -384,6 +388,7 @@ struct diff_options {
 	pathchange_fn_t pathchange;
 	change_fn_t change;
 	add_remove_fn_t add_remove;
+	keep_fn_t keep;
 	void *change_fn_data;
 	diff_format_fn_t format_callback;
 	void *format_callback_data;
diff --git a/diffcore.h b/diffcore.h
index 9c0a0e7aaf..64b419b33f 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -35,8 +35,8 @@ struct userdiff_driver;
 /**
  * the internal representation for a single file (blob).  It records the blob
  * object name (if known -- for a work tree file it typically is a NUL SHA-1),
- * filemode and pathname.  This is what the `diff_addremove()`, `diff_change()`
- * and `diff_unmerge()` synthesize and feed `diff_queue()` function with.
+ * filemode and pathname.  This is what `diff_addremove()`, `diff_change()`,
+ * `diff_same()` and `diff_unmerge()` synthesize and feed `diff_queue()`.
  */
 struct diff_filespec {
 	struct object_id oid;
diff --git a/tree-diff.c b/tree-diff.c
index 5988148b60..e711456766 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -163,12 +163,17 @@ static int tree_entry_pathcmp(struct tree_desc *t1, struct tree_desc *t2)
  * emits diff to first parent only, and tells diff tree-walker that we are done
  * with p and it can be freed.
  */
-static int emit_diff_first_parent_only(struct diff_options *opt, struct combine_diff_path *p)
+static int emit_diff_first_parent_only(struct diff_options *opt,
+				       struct combine_diff_path *p,
+				       bool is_change)
 {
 	struct combine_diff_parent *p0 = &p->parent[0];
 	if (p->mode && p0->mode) {
-		opt->change(opt, p0->mode, p->mode, &p0->oid, &p->oid,
-			1, 1, p->path, 0, 0);
+		if (is_change)
+			opt->change(opt, p0->mode, p->mode, &p0->oid, &p->oid,
+				    1, 1, p->path, 0, 0);
+		else if (opt->keep)
+			opt->keep(opt, p->mode, &p->oid, p->path);
 	}
 	else {
 		const struct object_id *oid;
@@ -205,7 +210,7 @@ static int emit_diff_first_parent_only(struct diff_options *opt, struct combine_
 static void emit_path(struct combine_diff_path ***tail,
 		      struct strbuf *base, struct diff_options *opt,
 		      int nparent, struct tree_desc *t, struct tree_desc *tp,
-		      int imin, int depth)
+		      int imin, int depth, bool is_change)
 {
 	unsigned short mode;
 	const char *path;
@@ -288,7 +293,7 @@ static void emit_path(struct combine_diff_path ***tail,
 
 		keep = 1;
 		if (opt->pathchange)
-			keep = opt->pathchange(opt, p);
+			keep = opt->pathchange(opt, p, is_change);
 
 		if (keep) {
 			**tail = p;
@@ -518,26 +523,27 @@ static void ll_diff_tree_paths(
 		/* t = p[imin] */
 		if (cmp == 0) {
 			/* are either pi > p[imin] or diff(t,pi) != ø ? */
-			if (!opt->flags.find_copies_harder) {
-				for (i = 0; i < nparent; ++i) {
-					/* p[i] > p[imin] */
-					if (tp[i].entry.mode & S_IFXMIN_NEQ)
-						continue;
+			bool is_change = true;
 
-					/* diff(t,pi) != ø */
-					if (!oideq(&t.entry.oid, &tp[i].entry.oid) ||
-					    (t.entry.mode != tp[i].entry.mode))
-						continue;
+			for (i = 0; i < nparent; ++i) {
+				/* p[i] > p[imin] */
+				if (tp[i].entry.mode & S_IFXMIN_NEQ)
+					continue;
 
-					goto skip_emit_t_tp;
-				}
+				/* diff(t,pi) != ø */
+				if (!oideq(&t.entry.oid, &tp[i].entry.oid) ||
+				    (t.entry.mode != tp[i].entry.mode))
+					continue;
+
+				is_change = false;
+				break;
 			}
 
 			/* D += {δ(t,pi) if pi=p[imin];  "+a" if pi > p[imin]} */
-			emit_path(tail, base, opt, nparent,
-				  &t, tp, imin, depth);
+			if (is_change || opt->flags.find_copies_harder)
+				emit_path(tail, base, opt, nparent,
+					  &t, tp, imin, depth, is_change);
 
-		skip_emit_t_tp:
 			/* t↓,  ∀ pi=p[imin]  pi↓ */
 			update_tree_entry(&t);
 			update_tp_entries(tp, nparent);
@@ -547,7 +553,7 @@ static void ll_diff_tree_paths(
 		else if (cmp < 0) {
 			/* D += "+t" */
 			emit_path(tail, base, opt, nparent,
-				  &t, /*tp=*/NULL, -1, depth);
+				  &t, /*tp=*/NULL, -1, depth, true);
 
 			/* t↓ */
 			update_tree_entry(&t);
@@ -563,7 +569,7 @@ static void ll_diff_tree_paths(
 			}
 
 			emit_path(tail, base, opt, nparent,
-				  /*t=*/NULL, tp, imin, depth);
+				  /*t=*/NULL, tp, imin, depth, true);
 
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
-- 
2.52.0


      reply	other threads:[~2025-12-03 15:06 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-30 11:47 [PATCH v2] diff-index: don't queue unchanged filepairs with diff_change() René Scharfe
2025-11-30 18:02 ` Junio C Hamano
2025-12-02 21:16   ` René Scharfe
2025-12-02 22:07   ` René Scharfe
2025-12-03 15:06     ` René Scharfe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5581a8d8-afe3-4097-8a27-7e52c7e915ce@web.de \
    --to=l.s.r@web$(echo .)de \
    --cc=ben.knoble@gmail$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=gitster@pobox$(echo .)com \
    --cc=peff@peff$(echo .)net \
    --cc=phillip.wood@dunelm$(echo .)org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox