public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: "Shawn O. Pearce" <spearce@spearce•org>
To: Johannes Sixt <j.sixt@viscovery•net>
Cc: Thomas Singer <thomas.singer@syntevo•com>, git@vger•kernel.org
Subject: Re: non-US-ASCII file names (e.g. Hiragana) on Windows
Date: Tue, 1 Dec 2009 08:26:27 -0800	[thread overview]
Message-ID: <20091201162627.GE21299@spearce.org> (raw)
In-Reply-To: <4B14EB2E.9020906@viscovery.net>

Johannes Sixt <j.sixt@viscovery•net> wrote:
> Thomas Singer schrieb:
> > To be more precise: Who is interpreting the bytes in the file names as
> > characters? Windows, Git or Java?
> 
> In the case of git: Windows does it, using the console's codepage to
> convert between bytes and Unicode.
> 
> I don't know about Java, but I guess that no conversion is necessary
> because Java is Unicode-aware.

Actually, conversion is necessary, and its something that is proving
to be really painful within JGit.

The Java IO APIs use UTF-16 for file names.  However we are reading
a stream of unknown bytes from the index file and tree objects.
Thus JGit must convert a stream of bytes into UTF-16 just to get
to the OS.

The JVM then turns around and converts from UTF-16 to some other
encoding for the filesystem.

On Win32 I suspect the JVM uses the native UTF-16 file APIs, so
this translation is lossless.

On POSIX, I suspect the JVM uses $LANG or some other related
environment variable to guess the user's preferred encoding, and
then converts from UTF-16 to bytes in that encoding.  And I have
no idea how they handle normalization of composed code points.

All of these layers make for a *very* confusing situation for us
within JGit:

  git tree
  +---------+
  | bytes   | -+
  +---------+   \
                 \             +--------+            +---------+
                  +-- JGit --> | UTF-16 | -- JVM --> | OS call |
  .git/index     /             +--------+            +---------+
  +---------+   /
  | bytes   | -+
  +---------+

Its impossible for us to do what C git does, which is just use the
bytes used by the OS call within the git datastructure.  Which of
course also isn't always portable, e.g. the Mac OS X HFS+ mess.

:-)

-- 
Shawn.

  reply	other threads:[~2009-12-01 16:26 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-28 18:15 non-US-ASCII file names (e.g. Hiragana) on Windows Thomas Singer
2009-11-28 20:00 ` Johannes Sixt
2009-12-01  8:57   ` Thomas Singer
2009-12-01  9:04     ` Thomas Singer
2009-12-01 10:08       ` Johannes Sixt
2009-12-01 16:26         ` Shawn O. Pearce [this message]
2009-12-01 22:11           ` Robin Rosenberg
2009-11-28 23:07 ` Maximilien Noal
2009-11-29  9:18   ` Thomas Singer
2009-12-01  7:49     ` Thomas Singer
2009-12-01  8:27       ` Johannes Sixt
2009-12-01  8:55         ` Thomas Singer
2009-12-01 10:00           ` Johannes Sixt
2009-12-01 12:08             ` Thomas Singer
2009-12-01 13:17               ` Johannes Sixt
2009-12-01 15:41                 ` Thomas Singer
2009-12-01 15:50                   ` Erik Faye-Lund
2009-12-01 16:33                     ` Thomas Singer
2010-10-30  4:02                       ` brad12
2010-10-30  8:58                         ` Jakub Narebski
2009-12-01 17:24               ` Jakub Narebski
2009-12-01 18:55                 ` Thomas Singer
2009-12-02 16:22                   ` Shawn Pearce
2010-10-30  9:52                 ` demerphq
2009-12-01  9:12     ` Erik Faye-Lund
2009-12-01 12:11       ` Thomas Singer
2009-11-28 23:37 ` Reece Dunn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091201162627.GE21299@spearce.org \
    --to=spearce@spearce$(echo .)org \
    --cc=git@vger$(echo .)kernel.org \
    --cc=j.sixt@viscovery$(echo .)net \
    --cc=thomas.singer@syntevo$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox