From: "Brian O'Mahoney" <omb@khandalf•com>
To: unlisted-recipients:; (no To-header on input)
Cc: Git Mailing List <git@vger•kernel.org>
Subject: The git repo format
Date: Wed, 27 Apr 2005 21:47:04 +0200 [thread overview]
Message-ID: <426FEC38.1060507@khandalf.com> (raw)
In-Reply-To: <Pine.LNX.4.58.0504271154470.18901@ppc970.osdl.org>
In understanding how to work with 'git' I had a number of initial
difficulties which are mostly covered by the e-mail from Linus below.
Most of these are already covered in the README:
for objects, ie blob, commit, tag, tree: inflate, then
<type>\s<size>\0<data>
where <data> is in the form, described by Linus below
when you look at them closely, all the formats are simple,
un-ambiguous, and very easy to parse.
The index is also easy to parse, but there is a detail,
after the 3-int header the records are padded to a multiple
of 8 bytes. The detail is in cache.h.
Maybe the README needs to re-inforce this.
Brian
> I repeat: git does not do any free-form parsin AT ALL.
========================================================
The links are in well-defined places, and you do not ever search for them.
And that's really very very important.
> For a "commit", the format is
>
> - first line is exactly 46 bytes: five bytes of "tree ", 40 bytes of hex
> sha1, and one byte of "\n".
>
> NOTHING ELSE. Not extra spaces at the end, not extra spaces at the
> beginning or the middle. It's ASCII, but it's not free-format ASCII.
>
> - the next <n> (where 'n' can be 0 or more) lines are _exactly_ 48 bytes
> each: seven bytes of "parent ", 40 bytes of hex sha1, and one byte of
> "\n".
>
> NOTHING ELSE.
>
> - the next lines are "author " and "committer ". They have well-defined
> delimters for their fields, and no sha1's. The fields cannot contain
> '<', '>' or newlines, since those are the field/line delimeters.
>
> There is no free-format text _anywhere_ that git parses. No room for
> guesses, no room for mistakes, no room for anything half-way questionable.
>
> And fsck actually enforces this. We do _not_ just use "gets()" to read one
> line at a time. We literally verify that the lines are 46/48 bytes long,
> and have the delimeters in the expected places.
>
> Same goes for "tree" and "tag" objects. They all have fixed-format stuff.
> A "tree" entry is always
>
> "%o <space> %s" \0 [ 20 bytes of sha1 ]
>
> with "%o" being "mode", and "%s" being "path". We don't guess.
>
> And this really is _important_. Exactly because we name things by the SHA1
> hash of the contents, we MUST NOT have flexible formats. Having a format
> which allows non-canonical representations (extra spaces etc) would mean
> that two trees that were identical would depend on how you happened to
> format them.
>
> So there's really two issues:
> - we don't guess or parse contents. We have strict rules, and that makes
> git more reliable. There are no gray areas. There's "right" and there
> is "wrong", and the right one works, and the wrong one gets flagged as
> being wrong and the tools refuse to touch it.
> - there is only _one_ right way to do things, and that means that the
> the content is well-defined, and thus the SHA1 of the content is
> well-defined.
>
> For example, another rule is that a "tree" object is always sorted by
> the bytes in the filename (not by entry, btw: a directory called "foo"
> will sort as "foo/", even though the _entry_ only shows "foo"). That rule
> not only makes a lot of operations faster, but again, it means that there
> is only _one_ way to represent a tree validly.
>
> IOW, you _cannot_ represent a tree any other way (and I've been too lazy
> to check this in fsck, but it's alway sbeen my plan), and that is exactly
> why we can just compare the hashes of the results - because there is no
> random component of "layout" in the contents.
>
> This really is important. It means that if you get to the same two tree
> contents in totally unrelated ways (you unpack a tar-file and encode it in
> git, or you have 5 years of git history and check it out), the "tree" will
> match _exactly_. There's no history. There's no "optional" stuff. Since
> the contents of the trees are the same, the SHA1 of the two trees will be
> the same. Exactly because git refuses to touch any free-format stuff.
next prev parent reply other threads:[~2005-04-27 19:41 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-04-27 5:43 A shortcoming of the git repo format H. Peter Anvin
2005-04-27 15:00 ` C. Scott Ananian
2005-04-27 15:22 ` Linus Torvalds
2005-04-27 18:03 ` H. Peter Anvin
2005-04-27 18:32 ` Dave Jones
2005-04-27 18:47 ` H. Peter Anvin
2005-04-27 22:51 ` Jon Seymour
2005-04-27 19:15 ` Linus Torvalds
2005-04-27 19:39 ` Petr Baudis
2005-04-27 19:11 ` Linus Torvalds
2005-04-27 19:47 ` Brian O'Mahoney [this message]
2005-04-27 20:40 ` H. Peter Anvin
2005-04-27 20:49 ` Tom Lord
2005-04-27 20:59 ` H. Peter Anvin
2005-04-28 0:57 ` Linus Torvalds
2005-04-28 1:34 ` Paul Jackson
2005-04-28 2:14 ` Tom Lord
2005-04-28 3:37 ` Ryan Anderson
2005-04-28 8:31 ` Morgan Schweers
2005-04-28 15:08 ` Barry Silverman
2005-04-27 20:56 ` Linus Torvalds
2005-04-28 0:45 ` David A. Wheeler
2005-04-28 0:46 ` David Lang
2005-04-27 23:50 ` Daniel Barkalow
2005-04-27 23:56 ` H. Peter Anvin
2005-04-28 1:51 ` Daniel Barkalow
2005-04-28 1:56 ` H. Peter Anvin
2005-04-28 13:39 ` David Woodhouse
2005-04-27 20:58 ` Gerhard Schrenk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=426FEC38.1060507@khandalf.com \
--to=omb@khandalf$(echo .)com \
--cc=git@vger$(echo .)kernel.org \
--cc=omb@bluewin$(echo .)ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox