public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Jeff King <peff@peff•net>
To: Junio C Hamano <gitster@pobox•com>
Cc: Sruteesh Kumar <sruteesh.oss@protonmail•com>,
	"git@vger•kernel.org" <git@vger•kernel.org>
Subject: Re: [PATCH] match_pathname(): give fnmatch one char of prefix context
Date: Sun, 26 Oct 2025 11:18:36 -0400	[thread overview]
Message-ID: <20251026151836.GA2095501@coredump.intra.peff.net> (raw)
In-Reply-To: <xmqq7bwltlb8.fsf@gitster.g>

On Thu, Oct 23, 2025 at 01:28:11PM -0700, Junio C Hamano wrote:

> > I wonder how much this prefix-matching buys us in practice. There are
> > two cases that are helped:
> >
> >   1. When there is no wildcard in the pattern at all, we skip fnmatch
> >      entirely.
> >
> >   2. We do a raw match of the prefix chars, shrinking the size of what
> >      is passed to fnmatch.
> >
> > My suspicion is that most of the improvement comes from (1), and it
> > would be very easy to retain that case and get rid of (2). But I haven't
> > done any measuring.
> 
> The above matches my intuition as well.

I wanted some actual numbers, so here's more than anybody probably cares
to read on the subject. This is the test script I came up with:

-- >8 --
#!/bin/sh

pattern=$1; shift
path=$1; shift
versions=
for i in "$@"; do
	versions="${versions:+$versions,}$i"
done

rm -rf repo

git init repo
cd repo

echo "/$pattern" >.gitignore
for i in $(seq 1 1000000); do
	eval "echo \"$path\""
done >input

hyperfine -L git "$versions" -n "{git}" '../{git} check-ignore -q --no-index --stdin <input || true'
-- 8< --

The idea is to just match $path a million times against $path, with as
little other junk as possible. And we test it against a series of
builds, which are:

  - git.orig; the current tip of 'master'

  - git.none; ripping out the prefix-match entirely, leaving it to
    wildmatch to match those characters.

  - git.fullonly; leaving the early return when the prefix consumes the
    whole name, but not advancing name/pattern by the prefix amount
    before handing it to fnmatch

  - git.minusone; shrinking the prefix by one to give fnmatch one char
    of context (i.e., the patch under discussion)

Here are results for a long name that matches exactly:

  $ long='this is an extremely long pattern that has no globs in it'
  $ ./test "$long" "$long" git.*
  Benchmark 1: git.fullonly
    Time (mean ± σ):     311.0 ms ±   5.9 ms    [User: 302.5 ms, System: 8.4 ms]
    Range (min … max):   303.4 ms … 319.3 ms    10 runs
  
  Benchmark 2: git.minusone
    Time (mean ± σ):     316.4 ms ±   4.8 ms    [User: 303.5 ms, System: 12.9 ms]
    Range (min … max):   306.3 ms … 323.5 ms    10 runs
  
  Benchmark 3: git.none
    Time (mean ± σ):     455.3 ms ±  12.2 ms    [User: 444.4 ms, System: 10.8 ms]
    Range (min … max):   442.1 ms … 475.1 ms    10 runs
  
  Benchmark 4: git.orig
    Time (mean ± σ):     320.8 ms ±   6.8 ms    [User: 307.8 ms, System: 12.9 ms]
    Range (min … max):   311.7 ms … 331.0 ms    10 runs
  
  Summary
    git.fullonly ran
      1.02 ± 0.02 times faster than git.minusone
      1.03 ± 0.03 times faster than git.orig
      1.46 ± 0.05 times faster than git.none

So we can see that "git.none" performs way worse than the rest of them.
I.e., this optimization really is doing something. That wasn't
immediately obvious to me, since it's all O(n) in the end. But
presumably wildmatch's slow char-by-char walk is much less fast than
what amounts to a strcmp().

Interestingly fullonly is a little faster, even though we should never
be running the removed code in this case. I guess the function fits in
cache a little better? :)

Now let's try the same thing with paths that only prefix match:

  $ ./test "$long" "$long-\$i" git.*

  Benchmark 1: git.fullonly
    Time (mean ± σ):     472.1 ms ±   6.1 ms    [User: 458.0 ms, System: 14.0 ms]
    Range (min … max):   463.1 ms … 484.9 ms    10 runs
  
  Benchmark 2: git.minusone
    Time (mean ± σ):     334.4 ms ±   5.5 ms    [User: 324.3 ms, System: 10.0 ms]
    Range (min … max):   325.7 ms … 340.6 ms    10 runs
  
  Benchmark 3: git.none
    Time (mean ± σ):     461.7 ms ±   6.0 ms    [User: 448.8 ms, System: 12.8 ms]
    Range (min … max):   449.2 ms … 473.6 ms    10 runs
  
  Benchmark 4: git.orig
    Time (mean ± σ):     334.4 ms ±   3.4 ms    [User: 325.5 ms, System: 8.8 ms]
    Range (min … max):   326.0 ms … 338.1 ms    10 runs
  
  Summary
    git.minusone ran
      1.00 ± 0.02 times faster than git.orig
      1.38 ± 0.03 times faster than git.none
      1.41 ± 0.03 times faster than git.fullonly

This matches what I'd expect. git.fullonly performs even worse than
git.none now (because we're matching that long prefix only to hand the
entire thing off to wildmatch to do it again). There is a possible
optimization missing in the code for this case, where we know we've
eaten all of "patternlen" but not all "namelen" (so we know we cannot
match), but it's probably rare enough not to worry about (and certainly
fnmatch should be able to figure that out quickly).

The other interesting thing here is that git.minusone performs about the
same as git.orig. So adding one character back for fnmatch to process is
not a big deal. It's a small constant amount of extra work.

Note that neither pattern has a wildcard here! We could test that with:

  $ ./test "$long-*" "$long-\$i" git.*

but the timings are roughly the same as above. In either case, fnmatch
will accept or reject pretty quickly once it hits the part after
"$long". So all we are really measuring is how long fnmatch takes to
walk over that first prefix part without wildcards.


So the obvious takeaway is that yes, fnmatch is much slower than strmcp.
That doesn't tell us how _often_ these cases kick in. That is, the
intuition we were talking about above is that matching "foo" with "foo"
is a lot more common than "foo*" matching "foobar". And no amount of
playing with made-up inputs will tell us that. ;)

But my inclination is to leave the optimization in place, assuming we do
not find any more corner cases that foil the one-char-of-context hack.
It's not much code, and it definitely _can_ help.

-Peff

  parent reply	other threads:[~2025-10-26 15:18 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-10 14:57 Probable issue with code/documentation Sruteesh Kumar
2025-10-14  0:34 ` [PATCH] match_pathname(): give fnmatch one char of prefix context Jeff King
2025-10-14  3:09   ` Jeff King
2025-10-22 16:19   ` Sruteesh Kumar
2025-10-23 20:28   ` Junio C Hamano
2025-10-24 18:28     ` Sruteesh Kumar
2025-10-26 15:18     ` Jeff King [this message]
2025-10-26 15:26     ` Jeff King
2025-10-26 15:40       ` [PATCH v2 0/2] fix "foo**/bar" matching "foobar" Jeff King
2025-10-26 15:41         ` [PATCH v2 1/2] match_pathname(): reorder prefix-match check Jeff King
2025-10-26 15:42         ` [PATCH v2 2/2] match_pathname(): give fnmatch one char of prefix context Jeff King
2025-10-26 23:29       ` [PATCH] " Junio C Hamano
2025-10-27 14:29         ` Jeff King
2025-10-27 15:35           ` Junio C Hamano
2025-10-28 23:19             ` Jeff King
2025-10-29 15:32               ` [PATCH] doc: document backslash in gitignore patterns Jeff King
2025-10-29 15:55                 ` Jeff King
2025-10-30 13:40                   ` D. Ben Knoble
2025-10-30 15:08                     ` Jeff King
2025-10-30 16:05                       ` Ben Knoble

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251026151836.GA2095501@coredump.intra.peff.net \
    --to=peff@peff$(echo .)net \
    --cc=git@vger$(echo .)kernel.org \
    --cc=gitster@pobox$(echo .)com \
    --cc=sruteesh.oss@protonmail$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox