public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: "Julia Evans" <julia@jvns•ca>
To: "Junio C Hamano" <gitster@pobox•com>,
	"Julia Evans" <gitgitgadget@gmail•com>
Cc: git@vger•kernel.org,
	"Kristoffer Haugsbakk" <kristofferhaugsbakk@fastmail•com>,
	"D. Ben Knoble" <ben.knoble@gmail•com>,
	"Patrick Steinhardt" <ps@pks•im>
Subject: Re: [PATCH v4] doc: add an explanation of Git's data model
Date: Tue, 28 Oct 2025 16:10:52 -0400	[thread overview]
Message-ID: <5b078fae-6fe9-4fde-ba84-1070761c168b@app.fastmail.com> (raw)
In-Reply-To: <xmqqikg0f1tk.fsf@gitster.g>

>> +
>> +It's not necessary to understand Git's data model to use Git, but it's
>> +very helpful when reading Git's documentation so that you know what it
>> +means when the documentation says "object", "reference" or "index".
>
> "While it is not necessary ..., it is helpful ..." may flow better
> than "It is not necesary ..., but it is very helpful".
>
>> +This means that if you have an object's ID, you can always recover its
>> +exact contents as long as the object hasn't been deleted.
>
> Somewhere in distant footnote, we may want to mention that objects
> that are in use are never deleted, and when they get removed (i.e.,
> garbage collection).  As part of the data model, "everything is
> retained by default, until we can prove it is no longer reachable"
> probably belongs somewhere.

Agreed, I really like this idea. Came up with the following, which I'll put at
the bottom of the "References" section if I don't come up with a better idea.
(I don't feel strongly about where exactly it should go):

NOTE: Objects will only be deleted if they aren't "reachable" from any reference.
An object is "reachable" if we can find it by following tags to whatever
they tag, commits to their parents or trees, and trees to the trees or
blobs that they contain.
For example, if you amend a commit, with `git commit --amend`,
the old commit will usually not be reachable, so it may be deleted eventually.

>> +Here's how each type of object is structured:
>> +
>> +[[commit]]
>> +commit::
>> +    A commit contains the full directory structure of every file
>> +    in that version of the repository and each file's contents.
>
> What you are describing here is more of the property of a tree; a
> commit is a bit richer.
>
>     A commit records a snapshot of the every file in the project at
>     one point in time, records who contributed to create such a
>     snapshot and why, and how that particular snapshot relates to
>     other snapshots in the history.

I don't understand the goal of explaining a commit in detail in
paragraph form when we already explain everything in a commit right
below this.

My goal of this intro sentence is just to emphasize what I think is the
least obvious point in that list, which is that commits contain every file. 

Happy to change it to something shorter like
"A commit records a snapshot of the every file in the project" if you
prefer that wording.

>> +    It has these these required fields
>
> "these these".

Oops, will fix

>> +Like all other objects, commits can never be changed after they're created.
>> +For example, "amending" a commit with `git commit --amend` creates a new
>> +commit with the same parent.
>
> "same parent." -> "same parent, without modifying the original
> commit object at all"?  Maybe redundant?  I dunno.
>
>> +[[tree]]
>> +tree::
>> +    A tree is how Git represents a directory.
>
> "a directory" -> "contents in a directory"?  I dunno.
>
>> +    It can contain files or other trees (which are subdirectories).
>> +    It lists, for each item in the tree:
>> ++
>> +1. The *filename*, for example `hello.py`
>> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
>> +  or <<commit,`commit`>> (a Git submodule, which is a
>> +  commit from a different Git repository)
>
> This is a bit of white lie.  A tree object entry never stores the
> type of the object.  It records <mode, object name, path component>.
>
> The second field you see in git ls-tree output is computed from the
> object name (when the object is available) or inferred from the mode
> bits.

Thanks, I didn't realize how tree object entries were stored.
Will remove "type".

>> +3. The *file mode*. Git has these file modes. which are only
>> +   spiritually related to Unix permissions:
>
> In the cover letter part of the message I am responding to, I saw
> repeated mention of "permissions should be "file mode"; let's be
> consistent.
>
> "Git has these file modes, which are ..." -> 

Makes sense. Will change to "Unix file modes" from "Unix permissions".
I don't think this needs a more dramatic rewrite though.

>     Git uses the following file mode to represent what each tree
>     entry is (because an object of the same type, e.g. "blob", is
>     used to represent more than one kind of things).  The file mode
>     are assigned to resemble Unix file mode.
>
>     Note that Git does not _store_ permissions, and there are only
>     two kinds of regular files; non-executable (100644) or
>     executable (100755).  To Git, there are no files that are
>     "readable only by the owner" etc., so file mode bits like
>     100600, 100400, etc., are never used.
>
>> +[[tag-object]]
>> +tag object::
>> +    Tag objects contain these required fields
>> +    (though there are other optional fields):
>> ++
>> +1. The *ID* and *type* of the object (often a commit) that they reference
>
> Not wrong per-se, but it is a bit curious to lump these two into a
> single enumerated item here, unlike "author" and "committer" were
> enumerated separately for commit objects.  If you are going to show
> "cat-file -p" output for illustration, it may be help readers
> understand them if you had them separately listed here.

Agreed, I'll split them into two items.

>> +2. The *tagger* and tag date
>> +3. A *tag message*, similar to a commit message
>
>> +[[index]]
>> +THE INDEX
>> +---------
>> +The index, also known as the "staging area", is a list of files and
>> +the contents of each file, stored as a <<blob,blob>>.
>> +You can add files to the index or update the contents of a file in the
>> +index with linkgit:git-add[1]. This is called "staging" the file for commit.
>> +
>> +Unlike a <<tree,tree>>, the index is a flat list of files.
>
> This is a bit of white lie, as modern versions of Git could be
> collapsing uninteresting parts of the directory structure as a
> single tree in an index entry (this is called "sparse index"), and
> can expand such collapsed "tree" in the index on-demand into its
> constituent files and directories.  But I do not mind presenting the
> traditional world model for conceptual simplicity.

I didn't know that, thanks. I guess I'll leave it the way it is for now.
It could be good to add a footnote, but I don't actually know how
to add footnotes in this document format.

>> +When you commit, Git converts the list of files in the index to a
>> +directory <<tree,tree>> and uses that tree in the new <<commit,commit>>.
>> +
>> +Each index entry has 4 fields:
>> +
>> +1. The *<<tree,file mode>>*
>> +2. The *<<blob,blob>> ID* of the file
>
> If you were to collapse descriptions like you did for tag objects
> where ID and TYPE were treated as a unit, here is the place to do
> so.  With the mode bits and object ID, we can represent regular
> files that are non-executable, regular files that are executable,  
> symbolic links, and submodules (if a sparse-index is in use, an
> index entry could be a subdirectory, but I suggested above that we
> can ignore them for simplicity).
>
> But <<blob,blob>> is highly misleading.  Even if we ignore
> sparse-index, we may see a commit object there.

Thanks, I didn't realize that. Will change to say that it can be a blob
or commit ID. I don't think that collapsing will help, IMO it's
important to keep a consistent format.

>     Each index entry records
>
>     1. The object that occupies the path, as (file mode, object
>        name) tuple.  Most often, it is a regular file whose contents
>        are stored in a blob object, that is either non-executable
>        (100644), executable (100755), or a symbolic link (120000),
>        but the object can be a commit in another repository if it
>        represents a submodule.
>
>     2. The stage number, which is normally 0, but entries with
>        higher stages for the same path are used during a conflicted
>        merge.
>
>     3. The path name for the index entry.
>
>> +3. The *file path*, for example `src/hello.py`
>> +4. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if
>> +   there's a merge conflict there can be multiple versions of the same
>> +   filename in the index.
>
> If you are going by "ls-files -s" output, it may be better to swap 3
> and 4 above for ease of understanding.

Good point, will do.

>> +It's extremely uncommon to look at the index directly: normally you'd
>> +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>.
>> +But you can use `git ls-files --stage` to see the index.
>> +Here's the output of `git ls-files --stage` in a repository with 2 files:
>> +
>> +----
>> +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md
>> +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py
>> +----
>> +
>> +[[reflogs]]
>> +REFLOGS
>> +-------
>> +
>> +Every time a branch, remote-tracking branch, or HEAD is updated, Git
>> +updates a log called a "reflog" for that <<references,reference>>.
>
> If we want to avoid using word X while explaining X, then we can
> rephrase it as "Git updates a record in the reflog for that
> reference".

I think the current phrasing is okay. I also didn't respond to some of the
phrasing suggestions above if I didn't understand the goal of them.
Hope that's okay.

  reply	other threads:[~2025-10-28 20:11 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-03 17:34 [PATCH] doc: add a explanation of Git's data model Julia Evans via GitGitGadget
2025-10-03 21:46 ` Kristoffer Haugsbakk
2025-10-06 19:36   ` Julia Evans
2025-10-06 21:44     ` D. Ben Knoble
2025-10-06 21:46       ` Julia Evans
2025-10-06 21:55         ` D. Ben Knoble
2025-10-09 13:20           ` Julia Evans
2025-10-08  9:59     ` Kristoffer Haugsbakk
2025-10-06  3:32 ` Junio C Hamano
2025-10-06 19:03   ` Julia Evans
2025-10-07 12:37   ` Kristoffer Haugsbakk
2025-10-07 16:38     ` Junio C Hamano
2025-10-07 14:32 ` Patrick Steinhardt
2025-10-07 17:02   ` Junio C Hamano
2025-10-07 19:30     ` Julia Evans
2025-10-07 20:01       ` Junio C Hamano
2025-10-07 18:39   ` D. Ben Knoble
2025-10-07 18:55   ` Julia Evans
2025-10-08  4:18     ` Patrick Steinhardt
2025-10-08 15:53       ` Junio C Hamano
2025-10-08 19:06         ` Julia Evans
2025-10-08 13:53 ` [PATCH v2] " Julia Evans via GitGitGadget
2025-10-10 11:51   ` Patrick Steinhardt
2025-10-13 14:48     ` Junio C Hamano
2025-10-14  5:45       ` Patrick Steinhardt
2025-10-14  9:18         ` Julia Evans
2025-10-14 11:45           ` Patrick Steinhardt
2025-10-14 13:39           ` Junio C Hamano
2025-10-14 21:12   ` [PATCH v3] " Julia Evans via GitGitGadget
2025-10-15  6:24     ` Patrick Steinhardt
2025-10-15 15:34       ` Junio C Hamano
2025-10-15 17:20         ` Julia Evans
2025-10-15 20:42           ` Junio C Hamano
2025-10-16 14:21             ` Julia Evans
2025-10-15 19:58     ` Junio C Hamano
2025-10-16 15:19       ` Julia Evans
2025-10-16 16:54         ` Junio C Hamano
2025-10-16 18:59           ` Julia Evans
2025-10-16 20:48             ` Junio C Hamano
2025-10-16 15:24     ` Kristoffer Haugsbakk
2025-10-20 16:37     ` Kristoffer Haugsbakk
2025-10-20 18:01       ` Junio C Hamano
2025-10-27 19:32     ` [PATCH v4] doc: add an " Julia Evans via GitGitGadget
2025-10-27 21:54       ` Junio C Hamano
2025-10-28 20:10         ` Julia Evans [this message]
2025-10-28 20:31           ` Junio C Hamano
2025-10-30 20:32       ` [PATCH v5] " Julia Evans via GitGitGadget
2025-10-31 14:44         ` Junio C Hamano
2025-11-03  7:40           ` Patrick Steinhardt
2025-11-03 15:38             ` Junio C Hamano
2025-11-03 19:43           ` Julia Evans
2025-11-04  1:34             ` Junio C Hamano
2025-11-04 15:45               ` Julia Evans
2025-11-04 20:53                 ` Junio C Hamano
2025-11-04 21:24                   ` Julia Evans
2025-11-04 23:45                     ` Junio C Hamano
2025-11-05  0:02                       ` Julia Evans
2025-11-05  3:21                         ` Ben Knoble
2025-11-05 16:26                           ` Julia Evans
2025-11-06  3:07                             ` Ben Knoble
2025-10-31 21:49         ` Junio C Hamano
2025-11-03  7:40         ` Patrick Steinhardt
2025-11-03 19:52           ` Julia Evans
2025-11-07 19:52         ` [PATCH v6] " Julia Evans via GitGitGadget
2025-11-07 21:03           ` Junio C Hamano
2025-11-07 21:23           ` Junio C Hamano
2025-11-07 21:40             ` Julia Evans
2025-11-07 23:07               ` Junio C Hamano
2025-11-08 19:43                 ` Junio C Hamano
2025-11-09  0:48                 ` Ben Knoble
2025-11-09  4:59                   ` Junio C Hamano
2025-11-10 15:56                     ` Julia Evans
2025-11-11 10:13                       ` Junio C Hamano
2025-11-11 13:07                         ` Ben Knoble
2025-11-11 15:24                         ` Julia Evans
2025-11-12 19:16                           ` Junio C Hamano
2025-11-12 22:49                             ` Junio C Hamano
2025-11-13 19:50                               ` Julia Evans
2025-11-13 20:07                                 ` Junio C Hamano
2025-11-13 20:18                                 ` Julia Evans
2025-11-13 20:34                                   ` Chris Torek
2025-11-13 23:11                                   ` Junio C Hamano
2025-11-12 19:53           ` [PATCH v7] " Julia Evans via GitGitGadget
2025-11-12 20:26             ` Junio C Hamano
2025-11-23  2:37             ` Junio C Hamano
2025-12-01  8:14               ` Patrick Steinhardt
2025-12-02 12:25                 ` Junio C Hamano
2025-10-09 14:20 ` [PATCH] doc: add a " Julia Evans
2025-10-10  0:42   ` Ben Knoble

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5b078fae-6fe9-4fde-ba84-1070761c168b@app.fastmail.com \
    --to=julia@jvns$(echo .)ca \
    --cc=ben.knoble@gmail$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=gitgitgadget@gmail$(echo .)com \
    --cc=gitster@pobox$(echo .)com \
    --cc=kristofferhaugsbakk@fastmail$(echo .)com \
    --cc=ps@pks$(echo .)im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox