public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail•com>
To: "brian m. carlson" <sandals@crustytoothpaste•net>,
	git@vger•kernel.org, Jeff King <peff@peff•net>,
	Taylor Blau <me@ttaylorr•com>, Patrick Steinhardt <ps@pks•im>,
	Jonathan Nieder <jrnieder@gmail•com>
Subject: Re: Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode
Date: Fri, 15 Aug 2025 11:27:45 -0400	[thread overview]
Message-ID: <f61c4070-3d29-465e-8dbf-55c4562c0342@gmail.com> (raw)
In-Reply-To: <aJ03RTHaE_JvHA1t@fruit.crustytoothpaste.net>

On 8/13/2025 9:09 PM, brian m. carlson wrote:
> TL;DR: We need a different datastore than a flat file for storing
> mappings between SHA-1 and SHA-256 in compatibility mode.  Advice and
> opinions sought.
...> Our approach for mapping object IDs between algorithms uses data in pack
> index v3 (outlined in the transition document), plus a flat file called
> `loose-object-idx` for loose objects.  However, we didn't anticipate
> that we'd need to handle mappings long-term for data that is neither a
> loose object nor a packed object.

I'm generally not a fan of this approach to (ab)use the pack index format
for this, especially when the translation needs to expand beyond "objects
in the local repo".

The requirements, as I see them, are:

1. Given an OID in Hash1, load its mapped OID in Hash2 in O(log N) time.
2. Given an OID in Hash2, load its mapped OID in Hash1 in O(log N) time.
3. As OID pairs are discovered, add them to the data structure in ~O(new)
   time.

> Some rough ideas of what this could look like:
> 
> * We could repurpose the top-bit of the pack order value in pack index
>   v3 to indicate an object that's not in the pack (this would limit us
>   to 2^31 items per pack).
> * We could put this in new entries in multi-pack index and require that
>   (although I'm not sure that I love the idea of requiring multi-pack
>   index in all repositories and I have yet to implement compatibility
>   mode there).
> * We could write some sort of quadratic rollup format like reftable.

My thought is that the last option is going to be best. It does require
starting a new file format from scratch, but it doesn't need to be
complicated:

* Header information includes:
	- file version info.
	- hash versions in mapping.
	- the number of OIDs in the format
	- the previous mapping file(s) in the chain
	- offsets to the Hash1 and Hash2 tables.
	- room for expansion to other data being added to the format,
	  as necessary in the future.
* Hash1 table has a lex-ordered list of Hash1 OIDs and int IDs to do
  lookups of the mapped Hash2 OIDs from the second table (by position).
* Hash2 table has a lex-ordered list of Hash2 OIDs and int IDs to do
  lookups of the mapped Hash1 OIDs from the first table (by position).

Lookup time would be O(L * log N) where L is the number of layers in
the collection of files. Writing time could be as low as the size of
a new layer on top, with squashing of layers handled in the background
or in the foreground (opportunistically for small layers or as needed
if background maintenance is not available).

I'm sure that things are more complicated than I'm making it out to
be in this email. I haven't looked at your branch to see the subtle
details around this. Hopefully this just gives you ideas that you
can use as you compare options.

Thanks,
-Stolee


  parent reply	other threads:[~2025-08-15 15:27 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-14  1:09 Efficiently storing SHA-1 ↔ SHA-256 mappings in compatibility mode brian m. carlson
2025-08-14 14:22 ` Junio C Hamano
2025-08-14 22:06   ` brian m. carlson
2025-08-14 22:51     ` Junio C Hamano
2025-08-15 15:27 ` Derrick Stolee [this message]
2025-09-03  6:43   ` Patrick Steinhardt
2025-08-27 19:08 ` Eric Wong
2025-08-28 14:53   ` Junio C Hamano
2025-08-28 21:43   ` brian m. carlson
2025-08-29 19:51     ` Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f61c4070-3d29-465e-8dbf-55c4562c0342@gmail.com \
    --to=stolee@gmail$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=jrnieder@gmail$(echo .)com \
    --cc=me@ttaylorr$(echo .)com \
    --cc=peff@peff$(echo .)net \
    --cc=ps@pks$(echo .)im \
    --cc=sandals@crustytoothpaste$(echo .)net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox