public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Hamza Mahfooz <someguy@effective-light•com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail•com>
Cc: "Carlo Marcelo Arenas Belón" <carenas@gmail•com>,
	git@vger•kernel.org, gitster@pobox•com,
	"Andreas Schwab" <schwab@linux-m68k•org>
Subject: Re: [PATCH] grep: avoid setting UTF mode when not needed
Date: Tue, 16 Nov 2021 08:35:51 -0500	[thread overview]
Message-ID: <R33O2R.WYXS4AQP7W05@effective-light.com> (raw)
In-Reply-To: <211116.86tugcp8mg.gmgdl@evledraar.gmail.com>

Does anyone know if René's patch causes older PCRE versions to break? 
If it doesn't I think René's patch plus Ævar's fix is the way to go.

On Tue, Nov 16 2021 at 01:32:17 PM +0100, Ævar Arnfjörð Bjarmason 
<avarab@gmail•com> wrote:
> 
> On Tue, Nov 16 2021, Carlo Marcelo Arenas Belón wrote:
> 
>>  Since ae39ba431a (grep/pcre2: fix an edge case concerning ascii 
>> patterns
>>  and UTF-8 data, 2021-10-15), PCRE_UTF mode is enabled in cases 
>> where it
>>  will fail because of UTF-8 validation, which is needed for versions 
>> of
>>  PCRE2 older than 10.34.
>> 
>>  Revert the change on logic to avoid failures that were reported 
>> from the
>>  test cases, but that should also reflect in normal use when JIT is 
>> enabled
>>  and could result in crashes (or worse), as UTF-8 validation is 
>> skipped.
>> 
>>  Keeping the tests, as they pass even without the fix as replicated 
>> locally
>>  in Debian 10 and the CI.
>> 
>>  Reported-by: Andreas Schwab <schwab@linux-m68k•org>
>>  Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail•com>
>>  ---
>>   grep.c | 6 ++----
>>   1 file changed, 2 insertions(+), 4 deletions(-)
>> 
>>  diff --git a/grep.c b/grep.c
>>  index f6e113e9f0..fe847a0111 100644
>>  --- a/grep.c
>>  +++ b/grep.c
>>  @@ -382,10 +382,8 @@ static void compile_pcre2_pattern(struct 
>> grep_pat *p, const struct grep_opt *opt
>>   		}
>>   		options |= PCRE2_CASELESS;
>>   	}
>>  -	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
>>  -	    (!opt->ignore_locale && is_utf8_locale() &&
>>  -	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
>>  -					    (p->fixed || p->is_fixed))))
>>  +	if (!opt->ignore_locale && is_utf8_locale() && 
>> has_non_ascii(p->pattern) &&
>>  +	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>>   		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>> 
>>   #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER
> 
> Hrm.
> 
> A few things:
> 
> First, if we've got a post-PCREv2 version whatever fix let's guard 
> that
> with an ifdef, see thep GIT_PCRE2_VERSION_*_HIGHER at the top of 
> grep.h.
> 
> It really helps to have those, both to know to test on those older
> versions, and also so that we can at some point in the future bump the
> required version and drop these workarounds entirely (as we do with
> git-curl-compat.h).
> 
> Secondly, whatever do here let's first fix the test added in 
> ae39ba431a,
> so we're not groping around in the dark even more.
> 
> I didn't spot this at the time but the test that Hamza added in that
> based on my initial report[1] is broken & doesn't test anything
> meaningful. It needs to have this applied:
> 
> 	diff --git a/t/t7812-grep-icase-non-ascii.sh 
> b/t/t7812-grep-icase-non-ascii.sh
> 	index 22487d90fdc..1da6b07a579 100755
> 	--- a/t/t7812-grep-icase-non-ascii.sh
> 	+++ b/t/t7812-grep-icase-non-ascii.sh
> 	@@ -60,7 +60,7 @@ test_expect_success GETTEXT_LOCALE,PCRE 'log 
> --author with an ascii pattern on U
> 	        test_write_lines "forth" >file4 &&
> 	        git add file4 &&
> 	        git commit --author="À Ú Thor <author@example•com>" -m 
> sécond &&
> 	-       git log -1 --color=always --perl-regexp --author=".*Thor" 
> >log &&
> 	+       git log -1 --color=always --perl-regexp --author=". . Thor" 
> >log &&
> 	        grep Author log >actual.raw &&
> 	        test_decode_color <actual.raw >actual &&
> 	        test_cmp expected actual
> 
> I.e. the whole point of using the color output to test this is to
> discover where PCRE2 is going to consider a character boundary to be,
> using .* means that it won't be tested at add, since .* will happily 
> eat
> up whatever arbitrary data it finds with or without UTF-8 mode.
> 
> Other tests added in that & adjacent (if any?) commits may have the 
> same
> issue, I haven't dug into it.
> 
> If we lead with that patch we'll get the test passing on master as
> before, but with your patch above it'll break. I.e. the "when not
> needed' in the $subject isn't true, it's just that the test is
> completely broken.
> 
> In the context of this being a pretty urgent post-release fix (but I
> don't know if Junio would consider a point-release, so perhaphs it's
> not) I'd be OK with either of:
> 
>  A. Let's back out this new log grep color thing entirely while we
>     reconsider this. The gitster/hm/paint-hits-in-log-grep topic
>     currently reverts cleanly.
> 
>  B. Don't break the new log grep color thing, and also fix the 
> 'grepping
>     binary' regression (which is much more important than having A)
> 
> But let's not go for some in-between where we break the new feature to
> the point of it being worse than the state of not having it at all in
> v2.33.0.
> 
> I.e. without the that log grep color feature we wouldn't screw up the
> display of non-ASCII characters in log output (yay), in v2.34.0 we
> don't, but also color the match (yay), but we broke grepping binary
> *files* (boo!).
> 
> I think the approach I suggested in [2] is a much more viable way
> forward, i.e. let's stop fiddling with this giant nested if statement
> that's mainly meant for the grep-a-file-case, revert to the
> pre-log-grep-color state, and have the log-grep-color mode pass in a
> "yes, I'd like the UTF-8 mode, please".
> 
> Having said that that's probably also broken, just in ways we're less
> likely to spot (or maybe some of the log encoding settings/reencoding
> saves us there?), we may need to have that ifdef'd to 10.34 and 
> higher,
> or have some opt-in setting for this.
> 
> 1. https://lore.kernel.org/git/87v92bju64.fsf@evledraar.gmail.com/
> 2. https://lore.kernel.org/git/87v92bju64.fsf@evledraar.gmail.com/



  reply	other threads:[~2021-11-16 13:36 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-15 16:13 [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 2/3] pretty: colorize pattern matches in commit messages Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data Hamza Mahfooz
2021-10-15 20:03   ` Junio C Hamano
2021-10-16 16:25     ` René Scharfe
2021-10-16 17:12       ` Ævar Arnfjörð Bjarmason
2021-10-16 19:44         ` René Scharfe
2021-10-17  6:00           ` Junio C Hamano
2021-10-17  6:55             ` René Scharfe
2021-10-17  9:44               ` Ævar Arnfjörð Bjarmason
2021-11-15 20:43   ` Andreas Schwab
2021-11-15 22:41     ` Ævar Arnfjörð Bjarmason
2021-11-16  2:12       ` Carlo Arenas
2021-11-16  8:41       ` Andreas Schwab
2021-11-16  9:06         ` Carlo Arenas
2021-11-16  9:18           ` Andreas Schwab
2021-11-16  9:29           ` Andreas Schwab
2021-11-16  9:38             ` Carlo Arenas
2021-11-16  9:55               ` Andreas Schwab
2021-11-16 11:00                 ` [PATCH] grep: avoid setting UTF mode when not needed Carlo Marcelo Arenas Belón
2021-11-16 12:32                   ` Ævar Arnfjörð Bjarmason
2021-11-16 13:35                     ` Hamza Mahfooz [this message]
2021-11-17 14:31                       ` Ævar Arnfjörð Bjarmason
2021-11-16 18:22                     ` Carlo Arenas
2021-11-16 18:48                     ` Junio C Hamano
2021-11-17 10:23                   ` [PATCH v2] grep: avoid setting UTF mode when dangerous with PCRE Carlo Marcelo Arenas Belón
2021-11-18  7:29                     ` Junio C Hamano
2021-11-18 10:15                       ` Carlo Arenas
2021-11-18 12:49                         ` Hamza Mahfooz
2021-11-19  6:58                           ` Ævar Arnfjörð Bjarmason
2021-11-18 21:21                     ` Hamza Mahfooz
2021-11-17 18:46               ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data René Scharfe
2021-11-17 19:56                 ` Carlo Arenas
2021-11-17 20:59                 ` Ævar Arnfjörð Bjarmason
2021-11-17 21:53                   ` Carlo Arenas
2021-11-18  0:00                     ` Ævar Arnfjörð Bjarmason
2021-11-18 18:17                   ` Junio C Hamano
2021-11-18 20:57                     ` René Scharfe
2021-11-19  7:00                       ` Ævar Arnfjörð Bjarmason
2021-11-19 16:08                         ` René Scharfe
2021-11-19 17:33                           ` Carlo Arenas
2021-11-19 17:11                         ` Junio C Hamano
2021-10-15 18:05 ` [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Junio C Hamano
2021-10-15 18:24   ` Hamza Mahfooz
2021-10-15 19:28     ` Junio C Hamano
2021-10-15 19:40       ` Hamza Mahfooz
2021-10-15 19:49       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=R33O2R.WYXS4AQP7W05@effective-light.com \
    --to=someguy@effective-light$(echo .)com \
    --cc=avarab@gmail$(echo .)com \
    --cc=carenas@gmail$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=gitster@pobox$(echo .)com \
    --cc=schwab@linux-m68k$(echo .)org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox