public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Keita Oda <ainsophyao@gmail•com>
To: git@vger•kernel.org
Cc: Keita ODA <ainsophyao@gmail•com>
Subject: [RFC PATCH 0/3] diff: pair edited lines inside moved blocks
Date: Wed, 27 May 2026 13:23:59 +0900	[thread overview]
Message-ID: <20260527042402.13607-1-ainsophyao@gmail.com> (raw)

From: Keita ODA <ainsophyao@gmail•com>

This is an RFC for a review aid, not a proposed final UI or option name.

The motivation is the gap between --word-diff and --color-moved.
--word-diff is very useful when the line-level diff already found useful
old/new line pairs.  --color-moved is useful when moved lines are exact
matches.  But when a block is moved and one line inside the block is edited,
the small edit can be buried in a large delete/add region.

That case matters for review.  A one-line move is usually easy to inspect by
eye.  A ten-line moved block with a one-character change inside it is harder
to audit.  A small synthetic permission-table example in patch 3 uses this
shape:

  -#define PERM_RESOURCE_EXPORT       0x0008
  +#define PERM_RESOURCE_EXPORT       0x0001

That particular toy example is not meant to show something that
--color-moved cannot see.  It is meant to make the review question small:
can Git expose "this moved line was also edited" in a lightweight way?
The real-world cases below are less about proving that existing modes are
blind, and more about making the row-to-row correspondence explicit enough
that the small edits are easy to check.

This series adds an opt-in prototype, --word-diff-align, that post-processes
the emitted diff symbols and tries to pair similar deleted and inserted lines.
It does not change the underlying diff algorithm, patch semantics, apply, or
merge behavior.
The prototype is deliberately language-agnostic.  It does not parse source
code or build an AST; it only tokenizes diff lines into small text tokens and
scores local token overlap.  This keeps the experiment applicable to code,
tests, generated tables, documentation, and other text files.

The prototype is intentionally split into three pieces:

  * patch 1 adds the candidate retrieval and line-pair scoring, and exposes
    selected pairs with an RFC/debug comment;
  * patch 2 adds a small RFC-only renderer that inserts word-diff-like
    markers on the selected pairs, so that the recovered pairs are easier to
    inspect;
  * patch 3 adds a focused test case.

The current prototype is still larger than I would like, but the split keeps
the experimental pieces visible.  The full series is about 1000 inserted
lines; roughly 800 lines are option plumbing, tokenization, candidate
retrieval, scoring, pair selection, and debug comments, while about 200 lines
are temporary rendering code for review.

The scoring model is:

  S = W + aL

where W is a 5-line-window token overlap score and L is a center-line token
LCS score.  A small 64-bit window fingerprint is used only as a candidate
retrieval index; candidate pairs are scored again before they are selected.
Tokens repeated in the surrounding small window carry less weight for the
center-line score, which is a local-IDF-like approximation.  This keeps tokens
such as "import" or "#define" from overwhelming the line-specific identifier.

Some real-world examples that motivated the prototype:

  * CPython opcode/metadata renumbering, where many table rows stay logically
    paired but their numeric values shift;
  * CPython test parameterization rewrites such as tuple rows becoming
    dict(input=..., expected=...) rows;
  * Git's own expected-output tables, where a column width change adds spaces
    across many rows and a row insertion shifts the surrounding context;
  * Git's own remote.c refactoring, where extracted helper code has small
    identifier changes.

As a rough trigger-rate sanity check, I ran the prototype over 5734 changed
file pairs sampled from recent Git, CPython, and Rust history.  The stricter
"crossing and edited" signal, which ignores the many adjacent row pairs and
looks for pairs that cross another recovered pair, appeared in 739 file pairs
(about 13%).  This is not a gold-label quality number, but it suggests that
the mode is not only triggering on the synthetic test.
A small manual review found both clear wins and loose matches.

I found the problem easiest to inspect with four-way comparisons:

  * git diff --histogram
  * git diff --histogram --word-diff=plain
  * git diff --histogram --color-moved=blocks
  * git diff --histogram --word-diff-align

I put a small set of rendered four-way examples here:

  https://oda.github.io/git-diff-rfc-examples/rfc-word-diff-align/

These links are supplemental; the patch series is intended to be readable
without them.

Known limitations:

  * the UI/debug output is not final;
  * generated or boilerplate-heavy hunks, especially Rust generated test
    updates, can still produce loose matches;
  * one-line long-distance pairs are often less useful than block-level pairs;
  * the prototype intentionally gives local and remote pairs similar treatment
    for now, to make the recovered pairings visible for discussion;
  * thresholds and tie-breaking are still experimental.

The question for this RFC is whether this kind of language-agnostic line-pair
annotation is worth pursuing in core, and if so whether it should be shaped as
word-diff plumbing, a color-moved extension, or a separate opt-in mode.

Keita ODA (3):
  diff: add word-diff-align line pairing
  diff: render word-diff-align pairs for RFC review
  t4034: cover moved-and-edited word diff alignment

 diff.c                | 996 +++++++++++++++++++++++++++++++++++++++++-
 diff.h                |   1 +
 t/t4034-diff-words.sh |  46 ++
 3 files changed, 1035 insertions(+), 8 deletions(-)

-- 
2.39.3 (Apple Git-146)

             reply	other threads:[~2026-05-27  4:24 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-27  4:23 Keita Oda [this message]
2026-05-27  4:24 ` [RFC PATCH 1/3] diff: add word-diff-align line pairing Keita Oda
2026-05-27  4:24 ` [RFC PATCH 2/3] diff: render word-diff-align pairs for RFC review Keita Oda
2026-05-27  4:24 ` [RFC PATCH 3/3] t4034: cover moved-and-edited word diff alignment Keita Oda

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260527042402.13607-1-ainsophyao@gmail.com \
    --to=ainsophyao@gmail$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox