From: "Julia Evans" <julia@jvns•ca>
To: "Junio C Hamano" <gitster@pobox•com>
Cc: "Julia Evans" <gitgitgadget@gmail•com>,
git@vger•kernel.org,
"Kristoffer Haugsbakk" <kristofferhaugsbakk@fastmail•com>,
"D. Ben Knoble" <ben.knoble@gmail•com>,
"Patrick Steinhardt" <ps@pks•im>
Subject: Re: [PATCH v3] doc: add a explanation of Git's data model
Date: Thu, 16 Oct 2025 14:59:01 -0400 [thread overview]
Message-ID: <03db91a6-148b-436f-8afa-0273a1f5d508@app.fastmail.com> (raw)
In-Reply-To: <xmqq347i948a.fsf@gitster.g>
On Thu, Oct 16, 2025, at 12:54 PM, Junio C Hamano wrote:
> "Julia Evans" <julia@jvns•ca> writes:
>
>>>> +[[tree]]
>>>> +trees::
>>>> + A tree is how Git represents a directory. It lists, for each item in
>>>> + the tree:
>>>> ++
>>>> +[[file-mode]]
>>>> +1. The *file mode*, for example `100644`. The format is inspired by Unix
>>>> + permissions, but Git's modes are much more limited. Git only supports these file modes:
>>>> ++
>>>> + - `100644`: regular file (with type `blob`)
>>>> + - `100755`: executable file (with type `blob`)
>>>> + - `120000`: symbolic link (with type `blob`)
>>>> + - `040000`: directory (with type `tree`)
>>>> + - `160000`: gitlink, for use with submodules (with type `commit`)
>>>
>>> It is not really "supporting" file modes. Rather, Git only records
>>> 5 kinds of entities associated with each path in a tree object, and
>>> uses numbers taht remotely resemble POSIX file modes to represent
>>> these 5 kinds.
>>>
>>> Perhaps "supports" -> "uses"?
>>
>> "Uses" sounds good to me.
>
> Also "much more limited" is misleading. We only represent 5 kinds
> of things, so we use only 5 mode-bits-looking numbers.
What does it mislead the reader to think? My goal is to communicate that
if you want to tell Git to remember that a file's Unix permissions were
700, that's not possible.
>>>> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
>>>> + or <<commit,`commit`>> (a Git submodule, which is a
>>>> + commit from a different Git repository)
>>>> +3. The <<object-id,*object ID*>>
>>>> +4. The *filename*
>>>
>>> Here it may be worth noting that this "filename" is a single
>>> pathname component (roughly, what you would see in non-recursive
>>> "ls"). In other words, it may be a directory name.
>
> Comments?
Oops, missed this in my first pass.
I looked at them man pages for a couple of commands ("mv", "cp")
and it looks like it's normal to refer to files and directories jointly
as "files", or refer to them as having a "file name". So I think it's okay
to call it a "file name" even if the "file" may be a directory.
>>>> +[[blob]]
>>>> +blobs::
>>>> + A blob is how Git represents a file. A blob object contains the
>>>> + file's contents.
>>>
>>> "represents a file" hints as if the thing may know its name, but
>>> that is not the case (its name is given only by surrounding tree).
>>>
>>> "A blob is how Git represents uninterpreted series of bytes, and
>>> most commonly used to store file's contents." or something, perhaps?
>>
>> I'll say "A blob is how Git represents a file's contents", unless Git has
>> another use for blobs that I don't know about (I think it's not
>> that much of a stretch to say that a symbolic link is a special kind
>> of file where the "contents" are the the link destination).
>
> A few configuration variables like mailmap.blob name a blob object,
> for which _only_ its contents, i.e., the sequence of bytes, matter
> and where they originally were stored does not matter.
>
> But we are falling into the area of tautology, as any sequence of
> bytes can be stored in a file so they can be called "contents of a
> file". But the point is that these bytes do not have to be stored
> to become a blob (think: "git cat-file -t blob -w --stdin").
I'm trying to think through what the goal of explaining the nature of
a "blob" is.
To me describing blobs primarily as "bytes" makes it sound a bit like
"Git will treat this as opaque binary data, Git will not attempt to
interpret the contents of a blob in any way" (which is certainly true
for many blob storage systems!).
But it's not true that Git treats blobs as opaque binary data, unlike
other blob storage systems, Git has diff and merge algorithms to
interpret the contents of the file to some extent and try to do useful
things with them.
Another goal we could have is to be clear that there are no limits to
what kind of files you can store in Git: you can equally well store text
files and binary files.
>> I think it's always clearer to be more specific when possible, if there's only
>> one purpose for blobs it's unnecessary (and IMO a bit misleading, because
>> it makes the reader wonder if there are other purposes that they should
>> know about) to say that blobs can be used to store any arbitrary bytes for
>> any purpose.
>
> I do not think describing other use cases is unnecessary. Even if
> we limit ourselves to discuss a single purpose for blob, i.e. to
> represent the contents of a file, we should stress that blob is to
> store _only_ contents, and not other aspects of the file (e.g., in
> what paths with what mode), and that is where my reaction to "how
> Git reprsents a file" comes from.
I think it does make sense to say the blob stores only the contents,
though IMO that's fairly clear already since we've already explained
where the other parts of the file are stored by the time we get to
explaining "blob".
>>>> +[[branch]]
>>>> +branches: `refs/heads/<name>`::
>>>> + A branch is a name for a commit ID.
>>>
>>> Well a commit ID is an alternative way to refer to a commit object
>>> *name*, so it is a bit strange to say "a name for a commit ID".
>>>
>>> Perhaps "A branch ref stores a commit ID." is better?
>>
>> I think I'll leave this alone, none of the many test readers reported
>> being confused by it.
>
> Would a confused person report that they are confused? ;-)
Everyone leaving feedback gets a prompt something like this
asking them to categorize their feedback,
and "I'm confused" is one of the options.
https://jvns.ca/images/feedback-categories.png
I definitely got many "I'm confused" and "I have a question"
comments about other things that were confusing to readers.
>> I see that you don't like the "name for a commit ID" phrasing :)
>> Maybe there's another way to say it, though again none of the test
>> readers said they were confused by this or disagreed with the phrasing.
>
> Yes, I get that given "refs/heads/main", you want to say "main" is
> one of the ways to have repo_get_oid() to yield the commit object,
> and you are using "name" in that sense, but it is more like a ref
> can be used to name an object. It is *not* the name of the object,
> because the object can have other names, and more importantly, it
> (i.e., to give a name for an object) is not the only thing that a
> ref can do.
That's interesting, what else can a ref do other than to give a name to
an object?
> And that is why I do not like that phrasing, combined
> with the target of giving that name is spelled "a commit ID". The
> commit ID is already another way to name the thing the refname can
> be also used to name: a commit object. A commit object and a commit
> object name are different things. The latter is a name that can
> refer to the former.
I'm curious about why it's important to you to make this distinction
between a commit ID and a commit object. To me the commit ID and the
commit object come as a package, since the commit ID is calculated from
the commit object.
> And a ref can be used just like the latter to
> refer to the former (i.e. "commit object").
> By the way, I do like the way many of your responses are "will think
> about it more", not "I'll take your version".
>
> Very much appreciated.
I'm glad to hear that! It's a fun puzzle to figure out how to express
things clearly and accurately and concisely.
- Julia
next prev parent reply other threads:[~2025-10-16 19:00 UTC|newest]
Thread overview: 89+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-03 17:34 [PATCH] doc: add a explanation of Git's data model Julia Evans via GitGitGadget
2025-10-03 21:46 ` Kristoffer Haugsbakk
2025-10-06 19:36 ` Julia Evans
2025-10-06 21:44 ` D. Ben Knoble
2025-10-06 21:46 ` Julia Evans
2025-10-06 21:55 ` D. Ben Knoble
2025-10-09 13:20 ` Julia Evans
2025-10-08 9:59 ` Kristoffer Haugsbakk
2025-10-06 3:32 ` Junio C Hamano
2025-10-06 19:03 ` Julia Evans
2025-10-07 12:37 ` Kristoffer Haugsbakk
2025-10-07 16:38 ` Junio C Hamano
2025-10-07 14:32 ` Patrick Steinhardt
2025-10-07 17:02 ` Junio C Hamano
2025-10-07 19:30 ` Julia Evans
2025-10-07 20:01 ` Junio C Hamano
2025-10-07 18:39 ` D. Ben Knoble
2025-10-07 18:55 ` Julia Evans
2025-10-08 4:18 ` Patrick Steinhardt
2025-10-08 15:53 ` Junio C Hamano
2025-10-08 19:06 ` Julia Evans
2025-10-08 13:53 ` [PATCH v2] " Julia Evans via GitGitGadget
2025-10-10 11:51 ` Patrick Steinhardt
2025-10-13 14:48 ` Junio C Hamano
2025-10-14 5:45 ` Patrick Steinhardt
2025-10-14 9:18 ` Julia Evans
2025-10-14 11:45 ` Patrick Steinhardt
2025-10-14 13:39 ` Junio C Hamano
2025-10-14 21:12 ` [PATCH v3] " Julia Evans via GitGitGadget
2025-10-15 6:24 ` Patrick Steinhardt
2025-10-15 15:34 ` Junio C Hamano
2025-10-15 17:20 ` Julia Evans
2025-10-15 20:42 ` Junio C Hamano
2025-10-16 14:21 ` Julia Evans
2025-10-15 19:58 ` Junio C Hamano
2025-10-16 15:19 ` Julia Evans
2025-10-16 16:54 ` Junio C Hamano
2025-10-16 18:59 ` Julia Evans [this message]
2025-10-16 20:48 ` Junio C Hamano
2025-10-16 15:24 ` Kristoffer Haugsbakk
2025-10-20 16:37 ` Kristoffer Haugsbakk
2025-10-20 18:01 ` Junio C Hamano
2025-10-27 19:32 ` [PATCH v4] doc: add an " Julia Evans via GitGitGadget
2025-10-27 21:54 ` Junio C Hamano
2025-10-28 20:10 ` Julia Evans
2025-10-28 20:31 ` Junio C Hamano
2025-10-30 20:32 ` [PATCH v5] " Julia Evans via GitGitGadget
2025-10-31 14:44 ` Junio C Hamano
2025-11-03 7:40 ` Patrick Steinhardt
2025-11-03 15:38 ` Junio C Hamano
2025-11-03 19:43 ` Julia Evans
2025-11-04 1:34 ` Junio C Hamano
2025-11-04 15:45 ` Julia Evans
2025-11-04 20:53 ` Junio C Hamano
2025-11-04 21:24 ` Julia Evans
2025-11-04 23:45 ` Junio C Hamano
2025-11-05 0:02 ` Julia Evans
2025-11-05 3:21 ` Ben Knoble
2025-11-05 16:26 ` Julia Evans
2025-11-06 3:07 ` Ben Knoble
2025-10-31 21:49 ` Junio C Hamano
2025-11-03 7:40 ` Patrick Steinhardt
2025-11-03 19:52 ` Julia Evans
2025-11-07 19:52 ` [PATCH v6] " Julia Evans via GitGitGadget
2025-11-07 21:03 ` Junio C Hamano
2025-11-07 21:23 ` Junio C Hamano
2025-11-07 21:40 ` Julia Evans
2025-11-07 23:07 ` Junio C Hamano
2025-11-08 19:43 ` Junio C Hamano
2025-11-09 0:48 ` Ben Knoble
2025-11-09 4:59 ` Junio C Hamano
2025-11-10 15:56 ` Julia Evans
2025-11-11 10:13 ` Junio C Hamano
2025-11-11 13:07 ` Ben Knoble
2025-11-11 15:24 ` Julia Evans
2025-11-12 19:16 ` Junio C Hamano
2025-11-12 22:49 ` Junio C Hamano
2025-11-13 19:50 ` Julia Evans
2025-11-13 20:07 ` Junio C Hamano
2025-11-13 20:18 ` Julia Evans
2025-11-13 20:34 ` Chris Torek
2025-11-13 23:11 ` Junio C Hamano
2025-11-12 19:53 ` [PATCH v7] " Julia Evans via GitGitGadget
2025-11-12 20:26 ` Junio C Hamano
2025-11-23 2:37 ` Junio C Hamano
2025-12-01 8:14 ` Patrick Steinhardt
2025-12-02 12:25 ` Junio C Hamano
2025-10-09 14:20 ` [PATCH] doc: add a " Julia Evans
2025-10-10 0:42 ` Ben Knoble
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=03db91a6-148b-436f-8afa-0273a1f5d508@app.fastmail.com \
--to=julia@jvns$(echo .)ca \
--cc=ben.knoble@gmail$(echo .)com \
--cc=git@vger$(echo .)kernel.org \
--cc=gitgitgadget@gmail$(echo .)com \
--cc=gitster@pobox$(echo .)com \
--cc=kristofferhaugsbakk@fastmail$(echo .)com \
--cc=ps@pks$(echo .)im \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox