public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: "Torsten Bögershausen" <tboegi@web•de>
To: Matthieu Beauchamp-Boulay via GitGitGadget <gitgitgadget@gmail•com>
Cc: git@vger•kernel.org, Matheus Tavares <matheus.tavb@gmail•com>,
	Johannes Schindelin <johannes.schindelin@gmx•de>,
	Matthieu Beauchamp-Boulay <matthieu.beauchamp.boulay@gmail•com>
Subject: Re: [PATCH] ignores: handle non UTF-8 exclude files
Date: Sun, 4 Jan 2026 18:35:24 +0100	[thread overview]
Message-ID: <20260104173524.GA29867@tb-raspi4> (raw)
In-Reply-To: <pull.2157.git.git.1767478617198.gitgitgadget@gmail.com>

On Sat, Jan 03, 2026 at 10:16:57PM +0000, Matthieu Beauchamp-Boulay via GitGitGadget wrote:
> From: Matthieu Beauchamp-Boulay <matthieu.beauchamp.boulay@gmail•com>
Thanks for contributing - some comments inlie
> 
> When reading exclude files, git assumes it is encoded in UTF-8 and will
Question: The report citet below talks about ignore files.

> fail to apply patterns if it isn't. This is a silent failure as no warning
> or errors are shown to the users. This is a problem that can take a while
> to diagnose as many users will not think of checking the encoding of their
> file and may believe their patterns are wrong instead. Users may also
> accidentally commit undesired files.
Note:
git status is your friend.
Blindly commiting without checking what is staged or not may
lead to unwanted results.

> 
> On Windows, this happens if a user uses Windows PowerShell to create the
> file, which results in a UTF-16LE file with a BOM.
>  This issue was discussed
> here https://github.com/git-for-windows/git/issues/3329. An example of
> where a user was confused that his exclude file was not working is cited
> https://github.com/git-for-windows/git/issues/3227.
A very short research indicates that powershell can be configured
to use UTF-8. I am not a powershell user, please correct if I am wrong.

> 
> A minimal fix should at least warn the user if git cannot properly decode
> the exclude file.
I think that reading an ignore file that contains a '\0' could/should
Git to complain. If someone asks my, most users are tempted to ignore
warnings for different reasons. Bailing out may feel more unpolite
but more clear that somethinh is wrong.

>Ideally, git would handle any given Unicode file.
That is debatable.

> 
> First, check if a BOM is present. If it is, decode the file to UTF-8.
> If no BOM is detected, then try to parse the file as UTF-8. If that fails,
> attempt to decode the file using the working tree encoding of the file,
> if any. If that fails, print a warning to tell the user that the exclude
> file could not be decoded and skip the file.
> 
> This raises the issue that if the entire tree is encoded in, for example
> UTF-16BE (no BOM), then even if the encoding is given in .gitattributes,
> git would not be able to decode it.
"able to decode: Yes. But willing to do so: not with the patch, right ?
> I believe that this is still
> acceptable since a warning will be emitted for the file (since it has no
> BOM, is not valid UTF-8 and no working tree encoding could be found).
> 
> One case that isn't handled is if a wrong encoding is given in the
> attributes and the exclude file has no BOM and is not UTF-8. Using
> iconv to convert an UTF16BE file to UTF-8 while specifying UTF-16LE
> yields gibberish without an error and so this case is a silent failure
> where no patterns will match.
One question is, if we should look at working_tree_encoding at all.
The other one is, how much UTF-16 handling of ignore or
other file should we have have in Git ?
It seems that this fix is for a very special case only ?

From
https://github.com/git-for-windows/git/issues/3329
we read:
/******/
if (size > 1 && buf[0] == 0xff && buf[1] == 0xfe) {
    char *reencoded = reencode_string_len(buf, size, "UTF-8", "UTF16-LE-BOM", &size);
    if (!reencoded)
        die(_("could not convert contents of '%s' from UTF-16"), fname);
    free(buf);
    buf = reencoded;
}
/******/
(Which seems a simpler suggestion)
However,  there is no UTF-16-LE-BOM in iconv 
(at least in the majority of implementations), 
so a better approach, totaly untested, may be:

if (size >= 2 && buf[0] == 0xff && buf[1] == 0xfe) {
    char *reencoded = reencode_string_len(buf+2, size-2, "UTF-8", "UTF16", &size);
    if (!reencoded)
        die(_("could not convert contents of '%s' from UTF-16"), fname);
    free(buf);
    buf = reencoded;
}

This leads to some free thinking, especially when we look at
other implementations of Git:
Would it be better to simply bail out on UTF-16 files ?
Techically all files with a '\0'.
[snip] 

  parent reply	other threads:[~2026-01-04 17:35 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-03 22:16 [PATCH] ignores: handle non UTF-8 exclude files Matthieu Beauchamp-Boulay via GitGitGadget
2026-01-04  2:54 ` Junio C Hamano
2026-01-06 19:52   ` Matthieu Beauchamp
2026-01-04 17:35 ` Torsten Bögershausen [this message]
2026-01-06 20:32   ` Matthieu Beauchamp
2026-01-07 14:36     ` Phillip Wood
2026-01-04 19:40 ` brian m. carlson
2026-01-06 20:45   ` Matthieu Beauchamp
2026-01-06 23:22     ` brian m. carlson
2026-01-07  1:35       ` Collin Funk
2026-01-07 14:28         ` Phillip Wood
2026-01-07 23:38         ` brian m. carlson
2026-01-08  1:13           ` Collin Funk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260104173524.GA29867@tb-raspi4 \
    --to=tboegi@web$(echo .)de \
    --cc=git@vger$(echo .)kernel.org \
    --cc=gitgitgadget@gmail$(echo .)com \
    --cc=johannes.schindelin@gmx$(echo .)de \
    --cc=matheus.tavb@gmail$(echo .)com \
    --cc=matthieu.beauchamp.boulay@gmail$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox