From: "Julia Evans" <julia@jvns•ca>
To: "Junio C Hamano" <gitster@pobox•com>,
"Julia Evans" <gitgitgadget@gmail•com>
Cc: git@vger•kernel.org,
"Kristoffer Haugsbakk" <kristofferhaugsbakk@fastmail•com>,
"D. Ben Knoble" <ben.knoble@gmail•com>,
"Patrick Steinhardt" <ps@pks•im>
Subject: Re: [PATCH v3] doc: add a explanation of Git's data model
Date: Thu, 16 Oct 2025 11:19:46 -0400 [thread overview]
Message-ID: <0eb276ef-7b1a-4e79-93da-13a83226aa01@app.fastmail.com> (raw)
In-Reply-To: <xmqqv7kgszr1.fsf@gitster.g>
On Wed, Oct 15, 2025, at 3:58 PM, Junio C Hamano wrote:
> "Julia Evans via GitGitGadget" <gitgitgadget@gmail•com> writes:
>
>> +[[commit]]
>> +commits::
>> + A commit contains these required fields
>> + (though there are other optional fields):
>> ++
>> +1. All the *files* in the commit, stored as the *<<tree,tree>>* ID of
>> + the commit's base directory.
>
> "all the files' exact contents at the time of the commit" is what we
> mean here, and once readers know what a tree is, the above sentence
> would be understood as such, but "All the files" felt somewhat
> fuzzy. I wonder if presenting objects in bottom-up fashion makes it
> easier to see? Learn that a blob records exact content of a file,
> then learn that a tree records the set of paths with exact contents
> stored at these paths, and after that, learn that a commit records a
> tree, hence a snapshot of the whole set of contents. I dunno...
Will try "The contents of all the *files* in the commit..." to make it a little
more explicit that it's a snapshot.
>> +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
>> + regular commits have 1 parent, merge commits have 2 or more parents
>> +3. An *author* and the time the commit was authored
>> +4. A *committer* and the time the commit was committed.
>> + If you cherry-pick (linkgit:git-cherry-pick[1]) someone else's commit,
>> + then they will be the author and you'll be the committer.
>
> It felt a bit odd to single-out cherry-pick here.
>
> I think the important thing to become aware of for the readers at
> this point is that the author and committer can be different people,
> and it does not matter how one commits somebody else's patch at the
> mechanical level.
>
> Perhaps replace "If you cherry-pick..." with something like "note: a
> change authored by a person at some point in time can be committed
> by another person at a different time, and these fields are to
> record both persons' contributions separately", perhaps, if we
> really want to say more.
I'll just delete the comment about cherry-pick.
I think it's already obvious (from the fact that are two different fields)
that the author and committer can be different (and happen at
different times), and if we don't want to explain why that might
happen there's no need to say more.
>> +Git does not store the diff for a commit: when you ask Git for a
>> +diff it calculates it on the fly.
>
> I think this is an attempt to demystify "are we really storing
> snapshot for each commit?" thing, but then "when you ask Git to show
> the commit, it calculates the diff from its parent on the fly" might
> achieve that better, perhaps?
Sure, can change it to that.
>> +[[tree]]
>> +trees::
>> + A tree is how Git represents a directory. It lists, for each item in
>> + the tree:
>> ++
>> +[[file-mode]]
>> +1. The *file mode*, for example `100644`. The format is inspired by Unix
>> + permissions, but Git's modes are much more limited. Git only supports these file modes:
>> ++
>> + - `100644`: regular file (with type `blob`)
>> + - `100755`: executable file (with type `blob`)
>> + - `120000`: symbolic link (with type `blob`)
>> + - `040000`: directory (with type `tree`)
>> + - `160000`: gitlink, for use with submodules (with type `commit`)
>
> It is not really "supporting" file modes. Rather, Git only records
> 5 kinds of entities associated with each path in a tree object, and
> uses numbers taht remotely resemble POSIX file modes to represent
> these 5 kinds.
>
> Perhaps "supports" -> "uses"?
"Uses" sounds good to me.
>> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
>> + or <<commit,`commit`>> (a Git submodule, which is a
>> + commit from a different Git repository)
>> +3. The <<object-id,*object ID*>>
>> +4. The *filename*
>
> Here it may be worth noting that this "filename" is a single
> pathname component (roughly, what you would see in non-recursive
> "ls"). In other words, it may be a directory name.
>
> I wonder if we need to say "<blob> (a file, or a symbolic link)"?
I'm inclined to leave this alone because arguably a symbolic link is
a file but I don't feel strongly about this.
>> +[[blob]]
>> +blobs::
>> + A blob is how Git represents a file. A blob object contains the
>> + file's contents.
>
> "represents a file" hints as if the thing may know its name, but
> that is not the case (its name is given only by surrounding tree).
>
> "A blob is how Git represents uninterpreted series of bytes, and
> most commonly used to store file's contents." or something, perhaps?
I'll say "A blob is how Git represents a file's contents", unless Git has
another use for blobs that I don't know about (I think it's not
that much of a stretch to say that a symbolic link is a special kind
of file where the "contents" are the the link destination).
I think it's always clearer to be more specific when possible, if there's only
one purpose for blobs it's unnecessary (and IMO a bit misleading, because
it makes the reader wonder if there are other purposes that they should
know about) to say that blobs can be used to store any arbitrary bytes for
any purpose.
If there is another purpose I think we should give an example.
>> +When you make a new commit, Git only needs to store new versions of
>> +files which were changed in that commit. This means that commits
>> +can use relatively little disk space even in a very large repository.
>
> That invites the "aren't we storing a delta after all, then?"
> confusion.
>
> "Git only needs to newly store new versions of files and
> directories. Files and directories that were not modified by the
> commit are shared with its parent commit".
I agree it makes it sound a little bit like we're storing a delta.
Will think about how to phrase this differently.
>> +NOTE: All of the examples in this section were generated with
>> +`git cat-file -p <object-id>`, which shows the contents of a Git object.
>
> Was this necessary to say this? Blobs, Commits, and Tags are
> textual, so "-p" does very minimum thing, but Trees are binary
> garbage, so "-p" output is heavily massaged version of the contents.
Ah, I didn't know how trees were stored, thanks.
I can remove "which shows the contents of a Git object", people
can read the man page for `git cat-file` if they want details.
>> +[[branch]]
>> +branches: `refs/heads/<name>`::
>> + A branch is a name for a commit ID.
>
> Well a commit ID is an alternative way to refer to a commit object
> *name*, so it is a bit strange to say "a name for a commit ID".
>
> Perhaps "A branch ref stores a commit ID." is better?
I think I'll leave this alone, none of the many test readers reported
being confused by it.
>> +[[tag]]
>> +tags: `refs/tags/<name>`::
>> + A tag is a name for a commit ID, tag object ID, or other object ID.
>
> Likewise. "A tag ref stores any kind of object ID, but commonly
> they are commit objects or tag objects"
>
>> + Tags that reference a tag object ID are called "annotated tags",
>> + because the tag object contains a tag message.
>> + Tags that reference a commit, blob, or tree ID are
>> + called "lightweight tags".
>> ++
>> +Even though branches and tags are both "a name for a commit ID", Git
>> +treats them very differently.
>> +Branches are expected to change over time: when you make a commit, Git
>> +will update your <<HEAD,current branch>> to reference the new changes.
>
> This sentence talks about branch moving because it advances with
> more commits. Did we want to say "HEAD" here before we explain what
> it is? "HEAD" can move for another reason (i.e. branch switching)
> and using "HEAD" in the context of talking about growing history
> might invite confusion. I dunno.
The text says "current branch", it just cross-references the "HEAD" section in the
HTML version if someone wants to read about what is meant by "current branch".
>> +Tags are usually not changed after they're created.
>
>> +[[HEAD]]
>> +HEAD: `HEAD`::
>> + `HEAD` is where Git stores your current <<branch,branch>>.
>
> Hmm...
>
>> + `HEAD` can either be:
>> + 1. A symbolic reference to your current branch, for example `ref:
>> + refs/heads/main` if your current branch is `main`.
>> + 2. A direct reference to a commit ID. This is called "detached HEAD
>> + state", see the DETACHED HEAD section of linkgit:git-checkout[1] for more.
>
> These two are very reasonable. But "your current <<branch>>" refers
> only to #1.
>
> `HEAD` refers to the commit your current work is based on, and
> it is the commit that will become the first parent of the commit
> once your current work is concluded. It can either be ...
>
> perhaps.
I like the idea of mentioning that HEAD will be the parent commit
of any commit that you make. Will think about how to incorporate
that, and about how to resolve " `HEAD` is where Git stores your
current <<branch,branch>>." being not exactly true.
>> +[[remote-tracking-branch]]
>> +remote tracking branches: `refs/remotes/<remote>/<branch>`::
>
> Please always write "remote-tracking" with a hyphen (see glossary).
Will fix.
>> + A remote-tracking branch is a name for a commit ID.
>
> Either "A remote-tracking branch stores a commit object name" or "A
> remote-tracking branch points at a commit object", followed by "in
> order to keep track of the last-nown state of ..." in a single
> sentence.
I see that you don't like the "name for a commit ID" phrasing :)
Maybe there's another way to say it, though again none of the test
readers said they were confused by this or disagreed with the phrasing.
>> +[[index]]
>> +THE INDEX
>> +---------
>> +
>> +The index, also known as the "staging area", contains a list of every
>> +file in the repository and its contents. When you commit, the files in
>> +the index are used as the files in the next commit.
>
> It is hard to define what "every file in the repository" really is.
> Files that you removed last week do not count. Files added in your
> wip branch elsewhere are obviously not yet in the index when you are
> working on your primary branch.
Agreed, I'm not so happy with "every file in the repository" either.
My intent was to make it clear that it's not "just the files you `git add`ed".
I'll think about a different phrasing that communicates the same thing.
Perhaps mentioning how it relates to the HEAD commit would help.
>> +You can add files to the index or update the version in the index with
>> +linkgit:git-add[1]. Adding a file to the index or updating its version
>> +is called "staging" the file for commit.
>
> It may be worth to clarify by saying "staging the contents of the
> file" (you can edit the file further after you "git add") that you
> are taking a snapshot at the time you ran "git add", instead of
> giving a general instruction to "keey an eye on this file" to Git
> (if it were, then the next "git commit" would behave more like "git
> add -u && git commit").
Maybe, will think about this too.
>> +[[reflogs]]
>> +REFLOGS
>> +-------
>> +
>> +Git stores a history called a "reflog" for every branch, remote-tracking
>> +branch, and HEAD. This means that if you make a mistake and "lose" a
>> +commit, you can generally recover the commit ID by running
>> +`git reflog <reference>`.
>> +
>> +Each reflog entry has:
>> +
>> +1. Before/after *commit IDs*
>> +2. *User* who made the change, for example `Maya <maya@example•com>`
>> +3. *Timestamp* when the change was made
>> +4. *Log message*, for example `pull: Fast-forward`
>> +
>> +Reflogs only log changes made in your local repository.
>> +They are not shared with remotes.
>
> Technically it is correct that before/after are recorded, but there
> is no way for the end-user to interact with them. "git reflog"
> walking these entries will only give you a single commit object.
> The username is also recorded, but I do not think of a way to view
> the information, let alone using it for querying.
You can view the username with git reflog --format="%gn <%ge>".
(according to `man git-log`). I don't see a way to view the old commit ID.
Perhaps we should include the username but not the old commit ID then.
I'm not sure.
> Especially when the reftable backend is in use, you cannot even read
> the raw representation like you can do with files backend (where
> something like "cat .git/logs/HEAD" would let you peek into the
> details). I am not sure if we want to go into this detail.
>
> Perhaps drop everything after "Each reflog entry has:"?
Perhaps we could give a stripped down list, like
1. The new *commit ID* the reference points to
2. *Timestamp* when the change was made
3. *Log message*, for example `pull: Fast-forward`
And then instead of giving the contents of `.git/logs/HEAD`
(which as you say includes some fields that there's no way
for the user to interact with), instead we could just show the
output of `git reflog main`, like this:
You can view the reflog for `git reflog`, for example here's the reflog
for a `main` branch which has changed twice:
$ git reflog main --date=iso --no-decorate
750b4ea main@{2025-09-29 15:17:05 -0400}: commit: Add README
4ccb6d7 main@{2025-09-29 15:16:48 -0400}: commit (initial): Initial commit
I added `--no-decorate` there because the decorations are a distraction
when talking about the data model.
This version omits the username which is a little weird (it is possible to
access the username) but mentioning the username is a little weird
too because it raises some questions that are hard to answer about
what that field is for, and you have to pass an obscure format string
to view it. Not sure what's best here.
>> +For example, here's how the reflog for `HEAD` in a repository with 2
>> +commits is stored:
>> +
>> +----
>> +0000000000000000000000000000000000000000 4ccb6d7b8869a86aae2e84c56523f8705b50c647 Maya <maya@example•com> 1759173408 -0400 commit (initial): Initial commit
>> +4ccb6d7b8869a86aae2e84c56523f8705b50c647 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 Maya <maya@example•com> 1759173425 -0400 commit: Add README
>> +----
Thanks for the review.
- Julia
next prev parent reply other threads:[~2025-10-16 15:20 UTC|newest]
Thread overview: 89+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-03 17:34 [PATCH] doc: add a explanation of Git's data model Julia Evans via GitGitGadget
2025-10-03 21:46 ` Kristoffer Haugsbakk
2025-10-06 19:36 ` Julia Evans
2025-10-06 21:44 ` D. Ben Knoble
2025-10-06 21:46 ` Julia Evans
2025-10-06 21:55 ` D. Ben Knoble
2025-10-09 13:20 ` Julia Evans
2025-10-08 9:59 ` Kristoffer Haugsbakk
2025-10-06 3:32 ` Junio C Hamano
2025-10-06 19:03 ` Julia Evans
2025-10-07 12:37 ` Kristoffer Haugsbakk
2025-10-07 16:38 ` Junio C Hamano
2025-10-07 14:32 ` Patrick Steinhardt
2025-10-07 17:02 ` Junio C Hamano
2025-10-07 19:30 ` Julia Evans
2025-10-07 20:01 ` Junio C Hamano
2025-10-07 18:39 ` D. Ben Knoble
2025-10-07 18:55 ` Julia Evans
2025-10-08 4:18 ` Patrick Steinhardt
2025-10-08 15:53 ` Junio C Hamano
2025-10-08 19:06 ` Julia Evans
2025-10-08 13:53 ` [PATCH v2] " Julia Evans via GitGitGadget
2025-10-10 11:51 ` Patrick Steinhardt
2025-10-13 14:48 ` Junio C Hamano
2025-10-14 5:45 ` Patrick Steinhardt
2025-10-14 9:18 ` Julia Evans
2025-10-14 11:45 ` Patrick Steinhardt
2025-10-14 13:39 ` Junio C Hamano
2025-10-14 21:12 ` [PATCH v3] " Julia Evans via GitGitGadget
2025-10-15 6:24 ` Patrick Steinhardt
2025-10-15 15:34 ` Junio C Hamano
2025-10-15 17:20 ` Julia Evans
2025-10-15 20:42 ` Junio C Hamano
2025-10-16 14:21 ` Julia Evans
2025-10-15 19:58 ` Junio C Hamano
2025-10-16 15:19 ` Julia Evans [this message]
2025-10-16 16:54 ` Junio C Hamano
2025-10-16 18:59 ` Julia Evans
2025-10-16 20:48 ` Junio C Hamano
2025-10-16 15:24 ` Kristoffer Haugsbakk
2025-10-20 16:37 ` Kristoffer Haugsbakk
2025-10-20 18:01 ` Junio C Hamano
2025-10-27 19:32 ` [PATCH v4] doc: add an " Julia Evans via GitGitGadget
2025-10-27 21:54 ` Junio C Hamano
2025-10-28 20:10 ` Julia Evans
2025-10-28 20:31 ` Junio C Hamano
2025-10-30 20:32 ` [PATCH v5] " Julia Evans via GitGitGadget
2025-10-31 14:44 ` Junio C Hamano
2025-11-03 7:40 ` Patrick Steinhardt
2025-11-03 15:38 ` Junio C Hamano
2025-11-03 19:43 ` Julia Evans
2025-11-04 1:34 ` Junio C Hamano
2025-11-04 15:45 ` Julia Evans
2025-11-04 20:53 ` Junio C Hamano
2025-11-04 21:24 ` Julia Evans
2025-11-04 23:45 ` Junio C Hamano
2025-11-05 0:02 ` Julia Evans
2025-11-05 3:21 ` Ben Knoble
2025-11-05 16:26 ` Julia Evans
2025-11-06 3:07 ` Ben Knoble
2025-10-31 21:49 ` Junio C Hamano
2025-11-03 7:40 ` Patrick Steinhardt
2025-11-03 19:52 ` Julia Evans
2025-11-07 19:52 ` [PATCH v6] " Julia Evans via GitGitGadget
2025-11-07 21:03 ` Junio C Hamano
2025-11-07 21:23 ` Junio C Hamano
2025-11-07 21:40 ` Julia Evans
2025-11-07 23:07 ` Junio C Hamano
2025-11-08 19:43 ` Junio C Hamano
2025-11-09 0:48 ` Ben Knoble
2025-11-09 4:59 ` Junio C Hamano
2025-11-10 15:56 ` Julia Evans
2025-11-11 10:13 ` Junio C Hamano
2025-11-11 13:07 ` Ben Knoble
2025-11-11 15:24 ` Julia Evans
2025-11-12 19:16 ` Junio C Hamano
2025-11-12 22:49 ` Junio C Hamano
2025-11-13 19:50 ` Julia Evans
2025-11-13 20:07 ` Junio C Hamano
2025-11-13 20:18 ` Julia Evans
2025-11-13 20:34 ` Chris Torek
2025-11-13 23:11 ` Junio C Hamano
2025-11-12 19:53 ` [PATCH v7] " Julia Evans via GitGitGadget
2025-11-12 20:26 ` Junio C Hamano
2025-11-23 2:37 ` Junio C Hamano
2025-12-01 8:14 ` Patrick Steinhardt
2025-12-02 12:25 ` Junio C Hamano
2025-10-09 14:20 ` [PATCH] doc: add a " Julia Evans
2025-10-10 0:42 ` Ben Knoble
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0eb276ef-7b1a-4e79-93da-13a83226aa01@app.fastmail.com \
--to=julia@jvns$(echo .)ca \
--cc=ben.knoble@gmail$(echo .)com \
--cc=git@vger$(echo .)kernel.org \
--cc=gitgitgadget@gmail$(echo .)com \
--cc=gitster@pobox$(echo .)com \
--cc=kristofferhaugsbakk@fastmail$(echo .)com \
--cc=ps@pks$(echo .)im \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox