public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: "Brian O'Mahoney" <omb@khandalf•com>
To: unlisted-recipients:; (no To-header on input)
Cc: Git Mailing List <git@vger•kernel.org>
Subject: The git repo format
Date: Wed, 27 Apr 2005 21:47:04 +0200	[thread overview]
Message-ID: <426FEC38.1060507@khandalf.com> (raw)
In-Reply-To: <Pine.LNX.4.58.0504271154470.18901@ppc970.osdl.org>

In understanding how to work with 'git' I had a number of initial
difficulties which are mostly covered by the e-mail from Linus below.

Most of these are already covered in the README:

for objects, ie blob, commit, tag, tree: inflate, then
<type>\s<size>\0<data>

where <data> is in the form, described by Linus below

when you look at them closely, all the formats are simple,
un-ambiguous, and very easy to parse.

The index is also easy to parse, but there is a detail,
after the 3-int header the records are padded to a multiple
of 8 bytes. The detail is in cache.h.

Maybe the README needs to re-inforce this.

Brian

> I repeat: git does not do any free-form parsin AT ALL.
========================================================

 The links are in well-defined places, and you do not ever search for them.

And that's really very very important.


> For a "commit", the format is
> 
>  - first line is exactly 46 bytes: five bytes of "tree ", 40 bytes of hex 
>    sha1, and one byte of "\n".
> 
>    NOTHING ELSE. Not extra spaces at the end, not extra spaces at the 
>    beginning or the middle. It's ASCII, but it's not free-format ASCII.
> 
>  - the next <n> (where 'n' can be 0 or more) lines are _exactly_ 48 bytes
>    each:  seven bytes of "parent ", 40 bytes of hex sha1, and one byte of 
>    "\n".
> 
>    NOTHING ELSE.
> 
>  - the next lines are "author " and "committer ". They have well-defined 
>    delimters for their fields, and no sha1's. The fields cannot contain 
>    '<', '>' or newlines, since those are the field/line delimeters.
> 
> There is no free-format text _anywhere_ that git parses. No room for 
> guesses, no room for mistakes, no room for anything half-way questionable.
> 
> And fsck actually enforces this. We do _not_ just use "gets()" to read one 
> line at a time. We literally verify that the lines are 46/48 bytes long, 
> and have the delimeters in the expected places.
> 
> Same goes for "tree" and "tag" objects. They all have fixed-format stuff. 
> A "tree" entry is always
> 
> 	"%o <space> %s" \0 [ 20 bytes of sha1 ]
> 
> with "%o" being "mode", and "%s" being "path". We don't guess. 
> 
> And this really is _important_. Exactly because we name things by the SHA1
> hash of the contents, we MUST NOT have flexible formats. Having a format
> which allows non-canonical representations (extra spaces etc) would mean
> that two trees that were identical would depend on how you happened to
> format them.
> 
> So there's really two issues:
>  - we don't guess or parse contents. We have strict rules, and that makes 
>    git more reliable. There are no gray areas. There's "right" and there 
>    is "wrong", and the right one works, and the wrong one gets flagged as 
>    being wrong and the tools refuse to touch it.
>  - there is only _one_ right way to do things, and that means that the 
>    the content is well-defined, and thus the SHA1 of the content is 
>    well-defined.
> 
> For example, another rule is that a "tree" object is always sorted by 
> the bytes in the filename (not by entry, btw: a directory called "foo" 
> will sort as "foo/", even though the _entry_ only shows "foo"). That rule 
> not only makes a lot of operations faster, but again, it means that there 
> is only _one_ way to represent a tree validly.
> 
> IOW, you _cannot_ represent a tree any other way (and I've been too lazy
> to check this in fsck, but it's alway sbeen my plan), and that is exactly 
> why we can just compare the hashes of the results - because there is no 
> random component of "layout" in the contents.
> 
> This really is important. It means that if you get to the same two tree
> contents in totally unrelated ways (you unpack a tar-file and encode it in
> git, or you have 5 years of git history and check it out), the "tree" will
> match _exactly_. There's no history. There's no "optional" stuff. Since
> the contents of the trees are the same, the SHA1 of the two trees will be
> the same. Exactly because git refuses to touch any free-format stuff.


  reply	other threads:[~2005-04-27 19:41 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-27  5:43 A shortcoming of the git repo format H. Peter Anvin
2005-04-27 15:00 ` C. Scott Ananian
2005-04-27 15:22 ` Linus Torvalds
2005-04-27 18:03   ` H. Peter Anvin
2005-04-27 18:32     ` Dave Jones
2005-04-27 18:47       ` H. Peter Anvin
2005-04-27 22:51         ` Jon Seymour
2005-04-27 19:15       ` Linus Torvalds
2005-04-27 19:39       ` Petr Baudis
2005-04-27 19:11     ` Linus Torvalds
2005-04-27 19:47       ` Brian O'Mahoney [this message]
2005-04-27 20:40       ` H. Peter Anvin
2005-04-27 20:49         ` Tom Lord
2005-04-27 20:59           ` H. Peter Anvin
2005-04-28  0:57           ` Linus Torvalds
2005-04-28  1:34             ` Paul Jackson
2005-04-28  2:14             ` Tom Lord
2005-04-28  3:37             ` Ryan Anderson
2005-04-28  8:31             ` Morgan Schweers
2005-04-28 15:08             ` Barry Silverman
2005-04-27 20:56         ` Linus Torvalds
2005-04-28  0:45           ` David A. Wheeler
2005-04-28  0:46             ` David Lang
2005-04-27 23:50         ` Daniel Barkalow
2005-04-27 23:56           ` H. Peter Anvin
2005-04-28  1:51             ` Daniel Barkalow
2005-04-28  1:56               ` H. Peter Anvin
2005-04-28 13:39     ` David Woodhouse
2005-04-27 20:58 ` Gerhard Schrenk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=426FEC38.1060507@khandalf.com \
    --to=omb@khandalf$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=omb@bluewin$(echo .)ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox