public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Phillip Wood <phillip.wood123@gmail•com>
To: Niels Glodny <n.glodny@campus•lmu.de>, git@vger•kernel.org
Cc: johannes.schindelin@gmx•de, peff@peff•net, phillip.wood@dunelm•org.uk
Subject: Re: [PATCH] xdiff: disable cleanup_records heuristic with --minimal
Date: Sun, 27 Apr 2025 16:04:10 +0100	[thread overview]
Message-ID: <23078e29-8f1f-4eb3-be71-7ef419252bab@gmail.com> (raw)
In-Reply-To: <20250425155951.1227700-1-n.glodny@campus.lmu.de>

Hi Niels

On 25/04/2025 16:59, Niels Glodny wrote:
> The cleanup_records function marks some lines as changed
> before running the actual diff algorithm. For most lines,
> this is a good performance optimization, but it also marks
> lines that are surrounded by many changed lines as changed
> as well. This can cause redundant changes and longer-than-
> necessary diffs.

Nicely explained. We normally wrap commit messages at 72 characters

> Whether this results in better-looking diffs is subjective.
> However, the --minimal flag explicitly requests the shortest
> possible diff.

Looking at the diff for 2ff58dec493 (refs: introduce function to batch 
refname availability checks, 2025-03-12) I'd say it is definitely less 
readable with this change but some of the changes in 320f2061b63 (t0602: 
use subshell to ensure working directory unchanged, 2025-02-28) are more 
readable. Anyway as you say if we promise to find the minimal diff then 
that is what we should do.
> The performance impact of this change is negligible, and it
> results in shorter diffs in about 1.3% of diffs in Git's
> history.

Have you got any numbers for the performance change?

I think the premise of this patch is sound, I've left a couple of 
comments below as I think we can simplify the code changes.

> diff --git a/t/t4071-diff-minimal.sh b/t/t4071-diff-minimal.sh
> new file mode 100755
> index 0000000000..3ad759dab4
> --- /dev/null
> +++ b/t/t4071-diff-minimal.sh
> @@ -0,0 +1,16 @@
> +#!/bin/sh
> +
> +test_description='minimal diff algorithm'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'minimal diff should not mark changes between changed lines' '
> +	printf "x\nx\nx\nx\n" >pre &&
> +	printf "x\nx\nx\nA\nB\nC\nD\nx\nE\nF\nG\n" >post &&
> +	test_must_fail git diff --no-index \
> +		--minimal pre post >diff &&
> +	! grep "+x" diff &&
> +	! grep "-x" diff
> +'

Thanks for taking the trouble to add a new test. We have a few test 
helper functions which we could use here.

	test_write_lines x x x x >pre &&
	test_write_lines x x x A B C D x E F G >post &&
	test_expect_code 1 git diff --no-index --minimal pre post >diff &&
	test_grep ! ^[-+]x diff

test_expect_code ensures that the non-zero exit code is from the files 
not matching rather than another error like an invalid option or missing 
file. test_grep will display the file if there are matches which helps 
when debugging test failures.

 > [...]> -static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t 
*xdf1, xdfile_t *xdf2) {
> +static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1,
> +			       xdfile_t *xdf2, int need_min) {
>   	long i, nm, nreff, mlim;
>   	xrecord_t **recs;
>   	xdlclass_t *rcrec;
> @@ -379,7 +383,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
xdlclassifier_t carries a copy of the flag that were interested in so I 
think at the start of xdl_cleanup_records we can add

	int need_min = !!(cf->flags & XDF_NEED_MINIMAL);

and then we don't need to change the signature.

>   	for (i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart]; i <= xdf1->dend; i++, recs++) {
>   		rcrec = cf->rcrecs[(*recs)->ha];
>   		nm = rcrec ? rcrec->len2 : 0;
> -		dis1[i] = (nm == 0) ? 0: (nm >= mlim) ? 2: 1;
> +		dis1[i] = (nm == 0) ? 0: (nm >= mlim && !need_min) ? 2: 1;

If we want a minimal diff then we force dis1[i] to be 1 so that this 
line will never be marked as changed before searching for the longest 
common sequence. Well spotted.

Best Wishes

Phillip

>   	}
>   
>   	if ((mlim = xdl_bogosqrt(xdf2->nrec)) > XDL_MAX_EQLIMIT)
> @@ -387,7 +391,7 @@ static int xdl_cleanup_records(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xd
>   	for (i = xdf2->dstart, recs = &xdf2->recs[xdf2->dstart]; i <= xdf2->dend; i++, recs++) {
>   		rcrec = cf->rcrecs[(*recs)->ha];
>   		nm = rcrec ? rcrec->len1 : 0;
> -		dis2[i] = (nm == 0) ? 0: (nm >= mlim) ? 2: 1;
> +		dis2[i] = (nm == 0) ? 0: (nm >= mlim && !need_min) ? 2: 1;
>   	}
>   
>   	for (nreff = 0, i = xdf1->dstart, recs = &xdf1->recs[xdf1->dstart];
> @@ -449,10 +453,10 @@ static int xdl_trim_ends(xdfile_t *xdf1, xdfile_t *xdf2) {
>   }
>   
>   
> -static int xdl_optimize_ctxs(xdlclassifier_t *cf, xdfile_t *xdf1, xdfile_t *xdf2) {
> -
> +static int xdl_optimize_ctxs(xdlclassifier_t *cf, xdfile_t *xdf1,
> +	xdfile_t *xdf2, int need_min) {
>   	if (xdl_trim_ends(xdf1, xdf2) < 0 ||
> -	    xdl_cleanup_records(cf, xdf1, xdf2) < 0) {
> +	    xdl_cleanup_records(cf, xdf1, xdf2, need_min) < 0) {
>   
>   		return -1;
>   	}
> 
> base-commit: f65182a99e545d2f2bc22e6c1c2da192133b16a3


  reply	other threads:[~2025-04-27 15:04 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-25 15:59 [PATCH] xdiff: disable cleanup_records heuristic with --minimal Niels Glodny
2025-04-27 15:04 ` Phillip Wood [this message]
2025-04-27 21:44   ` Niels Glodny
2025-04-28 17:05     ` Junio C Hamano
2025-04-27 22:06 ` [PATCH v2] " Niels Glodny
2025-04-29  9:00   ` Phillip Wood
2025-04-29 14:09 ` [PATCH v3] " Niels Glodny
2025-05-06 13:21   ` Phillip Wood
2025-05-06 22:50     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=23078e29-8f1f-4eb3-be71-7ef419252bab@gmail.com \
    --to=phillip.wood123@gmail$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=johannes.schindelin@gmx$(echo .)de \
    --cc=n.glodny@campus$(echo .)lmu.de \
    --cc=peff@peff$(echo .)net \
    --cc=phillip.wood@dunelm$(echo .)org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox