public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: "brian m. carlson" <sandals@crustytoothpaste•net>
To: Matthieu Beauchamp-Boulay via GitGitGadget <gitgitgadget@gmail•com>
Cc: git@vger•kernel.org, Matheus Tavares <matheus.tavb@gmail•com>,
	Johannes Schindelin <johannes.schindelin@gmx•de>,
	Matthieu Beauchamp-Boulay <matthieu.beauchamp.boulay@gmail•com>
Subject: Re: [PATCH] ignores: handle non UTF-8 exclude files
Date: Sun, 4 Jan 2026 19:40:14 +0000	[thread overview]
Message-ID: <aVrCHr_NRDqNjPn0@fruit.crustytoothpaste.net> (raw)
In-Reply-To: <pull.2157.git.git.1767478617198.gitgitgadget@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3337 bytes --]

On 2026-01-03 at 22:16:57, Matthieu Beauchamp-Boulay via GitGitGadget wrote:
> When reading exclude files, git assumes it is encoded in UTF-8 and will
> fail to apply patterns if it isn't. This is a silent failure as no warning
> or errors are shown to the users. This is a problem that can take a while
> to diagnose as many users will not think of checking the encoding of their
> file and may believe their patterns are wrong instead. Users may also
> accidentally commit undesired files.

This isn't actually true.  Git allows arbitrary byte sequences in the
file because Git allows filenames to have arbitrary byte sequences, just
like Unix.

> On Windows, this happens if a user uses Windows PowerShell to create the
> file, which results in a UTF-16LE file with a BOM. This issue was discussed
> here https://github.com/git-for-windows/git/issues/3329. An example of
> where a user was confused that his exclude file was not working is cited
> https://github.com/git-for-windows/git/issues/3227.

Ah, yes, here's the problem.  UTF-16LE is used on Windows, and on
Windows, Git stores pathnames as if they were converted into UTF-8, so
you do need to write the filenames in UTF-8 in the ignore file.

> A minimal fix should at least warn the user if git cannot properly decode
> the exclude file. Ideally, git would handle any given Unicode file.

As I mentioned, the file isn't necessarily in UTF-8 or Unicode.  Here's
an example shell script to demonstrate (requires a non-macOS Unix):

----
#!/bin/sh

rm -fr test-repo
git init --object-format=sha256 test-repo
cd test-repo
touch abc.txt
touch "$(printf '\220')"
printf '\220\n' >.gitignore
git add .
git status
git ls-files -io --exclude-standard
----

I'll point out that all of this is also true for things like config
files (which are also used in `.gitmodules`) and `.gitattributes` files.
If we wanted to make a change, we would be wise to make it everywhere.

However, if we wanted to force `.gitignore` to UTF-8, we'd need to have
an escape mechanism to write non-UTF-8 sequences, and as far as I know,
we don't.

> First, check if a BOM is present. If it is, decode the file to UTF-8.
> If no BOM is detected, then try to parse the file as UTF-8. If that fails,
> attempt to decode the file using the working tree encoding of the file,
> if any. If that fails, print a warning to tell the user that the exclude
> file could not be decoded and skip the file.

We do not accept and strip BOMs in UTF-8 files elsewhere (including in
things like `git diff` output), so we should not do so here, either.
For Unicode files, if there is no BOM, then the standard is that it's
assumed to automatically be UTF-8, so a BOM is superfluous and not
recommended.

> diff --git a/t/lib-encoding.sh b/t/lib-encoding.sh
> index 2dabc8c73e..1b1cc357ba 100644
> --- a/t/lib-encoding.sh
> +++ b/t/lib-encoding.sh
> @@ -23,3 +23,11 @@ write_utf32 () {
>  	fi &&
>  	iconv -f UTF-8 -t UTF-32
>  }
> +
> +write_encoded () {
> +  iconv -f UTF-8 -t "$1"
> +}
> +
> +write_bom () {
> +  echo "$@" | perl -pe 's/\s+//g; $_=pack("H*", $_)'
> +}
> \ No newline at end of file

We place newlines at the end of our text files unless there's a good
reason no to.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

  parent reply	other threads:[~2026-01-04 19:40 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-03 22:16 [PATCH] ignores: handle non UTF-8 exclude files Matthieu Beauchamp-Boulay via GitGitGadget
2026-01-04  2:54 ` Junio C Hamano
2026-01-06 19:52   ` Matthieu Beauchamp
2026-01-04 17:35 ` Torsten Bögershausen
2026-01-06 20:32   ` Matthieu Beauchamp
2026-01-07 14:36     ` Phillip Wood
2026-01-04 19:40 ` brian m. carlson [this message]
2026-01-06 20:45   ` Matthieu Beauchamp
2026-01-06 23:22     ` brian m. carlson
2026-01-07  1:35       ` Collin Funk
2026-01-07 14:28         ` Phillip Wood
2026-01-07 23:38         ` brian m. carlson
2026-01-08  1:13           ` Collin Funk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aVrCHr_NRDqNjPn0@fruit.crustytoothpaste.net \
    --to=sandals@crustytoothpaste$(echo .)net \
    --cc=git@vger$(echo .)kernel.org \
    --cc=gitgitgadget@gmail$(echo .)com \
    --cc=johannes.schindelin@gmx$(echo .)de \
    --cc=matheus.tavb@gmail$(echo .)com \
    --cc=matthieu.beauchamp.boulay@gmail$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox