From: Hamza Mahfooz <someguy@effective-light•com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail•com>
Cc: "Carlo Marcelo Arenas Belón" <carenas@gmail•com>,
git@vger•kernel.org, gitster@pobox•com,
"Andreas Schwab" <schwab@linux-m68k•org>
Subject: Re: [PATCH] grep: avoid setting UTF mode when not needed
Date: Tue, 16 Nov 2021 08:35:51 -0500 [thread overview]
Message-ID: <R33O2R.WYXS4AQP7W05@effective-light.com> (raw)
In-Reply-To: <211116.86tugcp8mg.gmgdl@evledraar.gmail.com>
Does anyone know if René's patch causes older PCRE versions to break?
If it doesn't I think René's patch plus Ævar's fix is the way to go.
On Tue, Nov 16 2021 at 01:32:17 PM +0100, Ævar Arnfjörð Bjarmason
<avarab@gmail•com> wrote:
>
> On Tue, Nov 16 2021, Carlo Marcelo Arenas Belón wrote:
>
>> Since ae39ba431a (grep/pcre2: fix an edge case concerning ascii
>> patterns
>> and UTF-8 data, 2021-10-15), PCRE_UTF mode is enabled in cases
>> where it
>> will fail because of UTF-8 validation, which is needed for versions
>> of
>> PCRE2 older than 10.34.
>>
>> Revert the change on logic to avoid failures that were reported
>> from the
>> test cases, but that should also reflect in normal use when JIT is
>> enabled
>> and could result in crashes (or worse), as UTF-8 validation is
>> skipped.
>>
>> Keeping the tests, as they pass even without the fix as replicated
>> locally
>> in Debian 10 and the CI.
>>
>> Reported-by: Andreas Schwab <schwab@linux-m68k•org>
>> Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail•com>
>> ---
>> grep.c | 6 ++----
>> 1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/grep.c b/grep.c
>> index f6e113e9f0..fe847a0111 100644
>> --- a/grep.c
>> +++ b/grep.c
>> @@ -382,10 +382,8 @@ static void compile_pcre2_pattern(struct
>> grep_pat *p, const struct grep_opt *opt
>> }
>> options |= PCRE2_CASELESS;
>> }
>> - if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
>> - (!opt->ignore_locale && is_utf8_locale() &&
>> - has_non_ascii(p->pattern) && !(!opt->ignore_case &&
>> - (p->fixed || p->is_fixed))))
>> + if (!opt->ignore_locale && is_utf8_locale() &&
>> has_non_ascii(p->pattern) &&
>> + !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>> options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>>
>> #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER
>
> Hrm.
>
> A few things:
>
> First, if we've got a post-PCREv2 version whatever fix let's guard
> that
> with an ifdef, see thep GIT_PCRE2_VERSION_*_HIGHER at the top of
> grep.h.
>
> It really helps to have those, both to know to test on those older
> versions, and also so that we can at some point in the future bump the
> required version and drop these workarounds entirely (as we do with
> git-curl-compat.h).
>
> Secondly, whatever do here let's first fix the test added in
> ae39ba431a,
> so we're not groping around in the dark even more.
>
> I didn't spot this at the time but the test that Hamza added in that
> based on my initial report[1] is broken & doesn't test anything
> meaningful. It needs to have this applied:
>
> diff --git a/t/t7812-grep-icase-non-ascii.sh
> b/t/t7812-grep-icase-non-ascii.sh
> index 22487d90fdc..1da6b07a579 100755
> --- a/t/t7812-grep-icase-non-ascii.sh
> +++ b/t/t7812-grep-icase-non-ascii.sh
> @@ -60,7 +60,7 @@ test_expect_success GETTEXT_LOCALE,PCRE 'log
> --author with an ascii pattern on U
> test_write_lines "forth" >file4 &&
> git add file4 &&
> git commit --author="À Ú Thor <author@example•com>" -m
> sécond &&
> - git log -1 --color=always --perl-regexp --author=".*Thor"
> >log &&
> + git log -1 --color=always --perl-regexp --author=". . Thor"
> >log &&
> grep Author log >actual.raw &&
> test_decode_color <actual.raw >actual &&
> test_cmp expected actual
>
> I.e. the whole point of using the color output to test this is to
> discover where PCRE2 is going to consider a character boundary to be,
> using .* means that it won't be tested at add, since .* will happily
> eat
> up whatever arbitrary data it finds with or without UTF-8 mode.
>
> Other tests added in that & adjacent (if any?) commits may have the
> same
> issue, I haven't dug into it.
>
> If we lead with that patch we'll get the test passing on master as
> before, but with your patch above it'll break. I.e. the "when not
> needed' in the $subject isn't true, it's just that the test is
> completely broken.
>
> In the context of this being a pretty urgent post-release fix (but I
> don't know if Junio would consider a point-release, so perhaphs it's
> not) I'd be OK with either of:
>
> A. Let's back out this new log grep color thing entirely while we
> reconsider this. The gitster/hm/paint-hits-in-log-grep topic
> currently reverts cleanly.
>
> B. Don't break the new log grep color thing, and also fix the
> 'grepping
> binary' regression (which is much more important than having A)
>
> But let's not go for some in-between where we break the new feature to
> the point of it being worse than the state of not having it at all in
> v2.33.0.
>
> I.e. without the that log grep color feature we wouldn't screw up the
> display of non-ASCII characters in log output (yay), in v2.34.0 we
> don't, but also color the match (yay), but we broke grepping binary
> *files* (boo!).
>
> I think the approach I suggested in [2] is a much more viable way
> forward, i.e. let's stop fiddling with this giant nested if statement
> that's mainly meant for the grep-a-file-case, revert to the
> pre-log-grep-color state, and have the log-grep-color mode pass in a
> "yes, I'd like the UTF-8 mode, please".
>
> Having said that that's probably also broken, just in ways we're less
> likely to spot (or maybe some of the log encoding settings/reencoding
> saves us there?), we may need to have that ifdef'd to 10.34 and
> higher,
> or have some opt-in setting for this.
>
> 1. https://lore.kernel.org/git/87v92bju64.fsf@evledraar.gmail.com/
> 2. https://lore.kernel.org/git/87v92bju64.fsf@evledraar.gmail.com/
next prev parent reply other threads:[~2021-11-16 13:36 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-15 16:13 [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 2/3] pretty: colorize pattern matches in commit messages Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data Hamza Mahfooz
2021-10-15 20:03 ` Junio C Hamano
2021-10-16 16:25 ` René Scharfe
2021-10-16 17:12 ` Ævar Arnfjörð Bjarmason
2021-10-16 19:44 ` René Scharfe
2021-10-17 6:00 ` Junio C Hamano
2021-10-17 6:55 ` René Scharfe
2021-10-17 9:44 ` Ævar Arnfjörð Bjarmason
2021-11-15 20:43 ` Andreas Schwab
2021-11-15 22:41 ` Ævar Arnfjörð Bjarmason
2021-11-16 2:12 ` Carlo Arenas
2021-11-16 8:41 ` Andreas Schwab
2021-11-16 9:06 ` Carlo Arenas
2021-11-16 9:18 ` Andreas Schwab
2021-11-16 9:29 ` Andreas Schwab
2021-11-16 9:38 ` Carlo Arenas
2021-11-16 9:55 ` Andreas Schwab
2021-11-16 11:00 ` [PATCH] grep: avoid setting UTF mode when not needed Carlo Marcelo Arenas Belón
2021-11-16 12:32 ` Ævar Arnfjörð Bjarmason
2021-11-16 13:35 ` Hamza Mahfooz [this message]
2021-11-17 14:31 ` Ævar Arnfjörð Bjarmason
2021-11-16 18:22 ` Carlo Arenas
2021-11-16 18:48 ` Junio C Hamano
2021-11-17 10:23 ` [PATCH v2] grep: avoid setting UTF mode when dangerous with PCRE Carlo Marcelo Arenas Belón
2021-11-18 7:29 ` Junio C Hamano
2021-11-18 10:15 ` Carlo Arenas
2021-11-18 12:49 ` Hamza Mahfooz
2021-11-19 6:58 ` Ævar Arnfjörð Bjarmason
2021-11-18 21:21 ` Hamza Mahfooz
2021-11-17 18:46 ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data René Scharfe
2021-11-17 19:56 ` Carlo Arenas
2021-11-17 20:59 ` Ævar Arnfjörð Bjarmason
2021-11-17 21:53 ` Carlo Arenas
2021-11-18 0:00 ` Ævar Arnfjörð Bjarmason
2021-11-18 18:17 ` Junio C Hamano
2021-11-18 20:57 ` René Scharfe
2021-11-19 7:00 ` Ævar Arnfjörð Bjarmason
2021-11-19 16:08 ` René Scharfe
2021-11-19 17:33 ` Carlo Arenas
2021-11-19 17:11 ` Junio C Hamano
2021-10-15 18:05 ` [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Junio C Hamano
2021-10-15 18:24 ` Hamza Mahfooz
2021-10-15 19:28 ` Junio C Hamano
2021-10-15 19:40 ` Hamza Mahfooz
2021-10-15 19:49 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=R33O2R.WYXS4AQP7W05@effective-light.com \
--to=someguy@effective-light$(echo .)com \
--cc=avarab@gmail$(echo .)com \
--cc=carenas@gmail$(echo .)com \
--cc=git@vger$(echo .)kernel.org \
--cc=gitster@pobox$(echo .)com \
--cc=schwab@linux-m68k$(echo .)org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox