public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: "brian m. carlson" <sandals@crustytoothpaste•net>
To: <git@vger•kernel.org>
Cc: Junio C Hamano <gitster@pobox•com>,
	Patrick Steinhardt <ps@pks•im>, Derrick Stolee <stolee@gmail•com>
Subject: [PATCH 4/9] docs: improve ambiguous areas of pack format documentation
Date: Fri, 19 Sep 2025 01:09:06 +0000	[thread overview]
Message-ID: <20250919010911.649831-5-sandals@crustytoothpaste.net> (raw)
In-Reply-To: <20250919010911.649831-1-sandals@crustytoothpaste.net>

It is fair to say that our pack and indexing code is quite complex.
Contributors who wish to work on this code or implementors of other
implementations would benefit from clear, unambiguous documentation
about how our data formats are structured and encoded and what data is
used in the computation of certain values.  Unfortunately, some of this
data is missing, which leads to confusion and frustration.

Let's document some of this data to help clarify things.  Specify over
what data CRC32 values are computed and also note which CRC32 algorithm
is used, since Wikipedia mentions at least four 32-bit CRC algorithms
and notes that it's possible to use different bit orderings.

In addition, note how we encode objects in the pack.  One might be led
to believe that packed objects are always stored with the "<type>
<size>\0" prefix of loose objects, but that is not the case, although
for obvious reasons this data is included in the computation of the
object ID.  Explain why this is for the curious reader.

Finally, indicate what the size field of the packed object represents.
Otherwise, a reader might think that the size of a delta is the size of
the full object or that it might contain the offset or object ID,
neither of which are the case.  Explain clearly, however, that the
values represent uncompressed sizes to avoid confusion.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste•net>
---
 Documentation/gitformat-pack.adoc | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/Documentation/gitformat-pack.adoc b/Documentation/gitformat-pack.adoc
index d6ae229be5..9b7af5c184 100644
--- a/Documentation/gitformat-pack.adoc
+++ b/Documentation/gitformat-pack.adoc
@@ -32,6 +32,10 @@ In a repository using the traditional SHA-1, pack checksums, index checksums,
 and object IDs (object names) mentioned below are all computed using SHA-1.
 Similarly, in SHA-256 repositories, these values are computed using SHA-256.
 
+CRC32 checksums are always computed over the entire packed object, including
+the header (n-byte type and length); the base object name or offset, if any;
+and the entire compressed object.  The CRC32 algorithm used is that of zlib.
+
 == pack-*.pack files have the following format:
 
    - A header appears at the beginning and consists of the following:
@@ -80,6 +84,15 @@ Valid object types are:
 
 Type 5 is reserved for future expansion. Type 0 is invalid.
 
+=== Object encoding
+
+Unlike loose objects, packed objects do not have a prefix containing the type,
+size, and a NUL byte. These are not necessary because they can be determined by
+the n-byte type and length that prefixes the data and so they are omitted from
+the compressed and deltified data.
+
+The computation of the object ID still uses this prefix, however.
+
 === Size encoding
 
 This document uses the following "size encoding" of non-negative
@@ -92,6 +105,11 @@ values are more significant.
 This size encoding should not be confused with the "offset encoding",
 which is also used in this document.
 
+When encoding the size of an undeltified object in a pack, the size is that of
+the uncompressed raw object. For deltified objects, it is the size of the
+uncompressed delta.  The base object name or offset is not included in the size
+computation.
+
 === Deltified representation
 
 Conceptually there are only four object types: commit, tree, tag and

  parent reply	other threads:[~2025-09-19  1:09 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-19  1:09 [PATCH 0/9] SHA-1/SHA-256 interoperability, part 1 brian m. carlson
2025-09-19  1:09 ` [PATCH 1/9] docs: update pack index v3 format brian m. carlson
2025-09-19 22:08   ` Junio C Hamano
2025-09-20 15:23     ` brian m. carlson
2025-09-20 17:01       ` Junio C Hamano
2025-09-24  7:55   ` Patrick Steinhardt
2025-09-25 21:39     ` brian m. carlson
2025-09-19  1:09 ` [PATCH 2/9] docs: update offset order for pack index v3 brian m. carlson
2025-09-19  1:09 ` [PATCH 3/9] docs: reflect actual double signature for tags brian m. carlson
2025-09-19 22:34   ` Junio C Hamano
2025-09-20 15:29     ` brian m. carlson
2025-09-20 17:04       ` Junio C Hamano
2025-09-24  7:55       ` Patrick Steinhardt
2025-09-25 21:46         ` brian m. carlson
2025-09-19  1:09 ` brian m. carlson [this message]
2025-09-19 23:04   ` [PATCH 4/9] docs: improve ambiguous areas of pack format documentation Junio C Hamano
2025-09-19  1:09 ` [PATCH 5/9] docs: add documentation for loose objects brian m. carlson
2025-09-19 19:10   ` Junio C Hamano
2025-09-19 19:13     ` Junio C Hamano
2025-09-19 19:15       ` brian m. carlson
2025-09-19 20:18       ` Junio C Hamano
2025-09-24  7:55       ` Patrick Steinhardt
2025-09-25 21:40         ` brian m. carlson
2025-09-19 23:16   ` Junio C Hamano
2025-09-24  7:55   ` Patrick Steinhardt
2025-09-30 16:39     ` brian m. carlson
2025-09-19  1:09 ` [PATCH 6/9] rev-parse: allow printing compatibility hash brian m. carlson
2025-09-19 23:24   ` Junio C Hamano
2025-09-24  7:55   ` Patrick Steinhardt
2025-09-25 21:48     ` brian m. carlson
2025-09-19  1:09 ` [PATCH 7/9] fsck: consider gpgsig headers expected in tags brian m. carlson
2025-09-19 23:31   ` Junio C Hamano
2025-09-22 21:38     ` brian m. carlson
2025-09-19  1:09 ` [PATCH 8/9] Allow specifying compatibility hash brian m. carlson
2025-09-24  7:56   ` Patrick Steinhardt
2025-09-30 16:44     ` brian m. carlson
2025-09-19  1:09 ` [PATCH 9/9] t: add a prerequisite for a " brian m. carlson
2025-09-24  7:56   ` Patrick Steinhardt
2025-10-02 22:38 ` [PATCH v2 0/9] SHA-1/SHA-256 interoperability, part 1 brian m. carlson
2025-10-02 22:38   ` [PATCH v2 1/9] docs: update pack index v3 format brian m. carlson
2025-10-03 17:00     ` Junio C Hamano
2025-10-02 22:38   ` [PATCH v2 2/9] docs: update offset order for pack index v3 brian m. carlson
2025-10-02 22:38   ` [PATCH v2 3/9] docs: reflect actual double signature for tags brian m. carlson
2025-10-02 22:38   ` [PATCH v2 4/9] docs: improve ambiguous areas of pack format documentation brian m. carlson
2025-10-03 17:07     ` Junio C Hamano
2025-10-03 21:06       ` brian m. carlson
2025-10-02 22:38   ` [PATCH v2 5/9] docs: add documentation for loose objects brian m. carlson
2025-10-03 17:05     ` Junio C Hamano
2025-10-02 22:38   ` [PATCH v2 6/9] rev-parse: allow printing compatibility hash brian m. carlson
2025-10-02 22:38   ` [PATCH v2 7/9] fsck: consider gpgsig headers expected in tags brian m. carlson
2025-10-02 22:38   ` [PATCH v2 8/9] t: allow specifying compatibility hash brian m. carlson
2025-10-03 17:14     ` Junio C Hamano
2025-10-03 20:45       ` brian m. carlson
2025-10-02 22:38   ` [PATCH v2 9/9] t1010: use BROKEN_OBJECTS prerequisite brian m. carlson
2025-10-09 21:56 ` [PATCH v3 0/9] SHA-1/SHA-256 interoperability, part 1 brian m. carlson
2025-10-09 21:56   ` [PATCH v3 1/9] docs: update pack index v3 format brian m. carlson
2025-10-09 21:56   ` [PATCH v3 2/9] docs: update offset order for pack index v3 brian m. carlson
2025-10-09 21:56   ` [PATCH v3 3/9] docs: reflect actual double signature for tags brian m. carlson
2025-10-09 21:56   ` [PATCH v3 4/9] docs: improve ambiguous areas of pack format documentation brian m. carlson
2025-10-09 21:56   ` [PATCH v3 5/9] docs: add documentation for loose objects brian m. carlson
2025-10-09 21:56   ` [PATCH v3 6/9] rev-parse: allow printing compatibility hash brian m. carlson
2025-10-09 21:56   ` [PATCH v3 7/9] fsck: consider gpgsig headers expected in tags brian m. carlson
2025-10-09 21:56   ` [PATCH v3 8/9] t: allow specifying compatibility hash brian m. carlson
2025-10-09 21:56   ` [PATCH v3 9/9] t1010: use BROKEN_OBJECTS prerequisite brian m. carlson
2025-10-13 15:24   ` [PATCH v3 0/9] SHA-1/SHA-256 interoperability, part 1 Junio C Hamano
2025-10-13 16:34     ` brian m. carlson
2025-10-14  5:53       ` Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250919010911.649831-5-sandals@crustytoothpaste.net \
    --to=sandals@crustytoothpaste$(echo .)net \
    --cc=git@vger$(echo .)kernel.org \
    --cc=gitster@pobox$(echo .)com \
    --cc=ps@pks$(echo .)im \
    --cc=stolee@gmail$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox