* proposal: delta based git archival
@ 2005-04-22 9:03 Michel Lespinasse
2005-04-22 9:12 ` Jeffrey E. Hundstad
2005-04-22 9:49 ` Jaime Medrano
0 siblings, 2 replies; 3+ messages in thread
From: Michel Lespinasse @ 2005-04-22 9:03 UTC (permalink / raw)
To: git
I noticed people on this mailing list start talking about using blob deltas
for compression, and the basic issue that the resulting files are too small
for efficient filesystem storage. I thought about this a little and decided
I should send out my ideas for discussion.
In my proposal, the current git object storage model (one compressed object
per file) remains as the primary storage mechanism, however there would be
some kind of backup mechanism based on multiple deltas grouped in one file.
For example, suppose you're looking for an object with a hash of
eab75ce51622aa312bb0b03572d43769f420c347
First you'd look at .git/objects/ea/b75ce51622aa312bb0b03572d43769f420c347 -
if the file exists, that's your object.
If the file does not exist, you'd then look for .git/deltas/ea/b,
.git/deltas/ea/b7, .git/deltas/ea/b75, .git/deltas/ea/b75c, ...
up to some maximum search path lenght. You stop at the first file you can
find.
Supposing that file is .git/deltas/ea/b7, it would contain a diff
(let's assume unified format for now, though ideally it'd be better to
have something that allows binary file deltas too) of many archived
objects with hashes starting with eab7, compared to a different object
(presumably some direct or indirect ancestor):
diff -u 8f5ba0203e31204c5c052d995a5b4449226bcfb5 eab75ce51622aa312bb0b03572d43769f420c347
--- 8f5ba0203e31204c5c052d995a5b4449226bcfb5
+++ eab75ce51622aa312bb0b03572d43769f420c347
@@ -522,7 +522,7 @@
....
diff -u 77dc2cb94930017f62b55b9706cbadda8c90f650 eab71c51dbc62797d6c903203de44cc6a734c05c
--- 77dc2cb94930017f62b55b9706cbadda8c90f650
+++ eab71c51dbc62797d6c903203de44cc6a734c05c
@@ -560,13 +563,17 @@
...
Based on this delta file, we'd then look for the object
8f5ba0203e31204c5c052d995a5b4449226bcfb5 (this process could require
recursively rebuilding that object) and try to build
eab75ce51622aa312bb0b03572d43769f420c347 by applying the delta and then
double checking the hash.
To me the strenghts of this proposal would be:
* It does not muddy the git object model - it just acts independently of it,
as a way to rebuild git objects from deltas
* Old objects can be compressed by creating a delta with a close ancestor,
then erasing the original file storage for that object. The object delta
can be appended to an existing delta file (which avoids the small-file
storage issue), or if the delta file gets too big, it can be split off
into 16 smaller files based on the hashes of the objects this file stores
deltas for.
* The system is flexible enough to explore different delta
strategies. For example one could decide to keep one object every 10
in the database and store other 9 as deltas based on the immediate
object ancestor, or any other tradeoff - and the system would still
work the same (with different performance tradeoffs though).
Does this sound insane ? Too complicated maybe ?
Is there any kind of semi-standard binary-capable multiple-file diff format
that could be used for this application instead of unified diffs ?
--
Michel "Walken" Lespinasse
"Bill Gates is a monocle and a Persian cat away from being the villain
in a James Bond movie." -- Dennis Miller
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: proposal: delta based git archival
2005-04-22 9:03 proposal: delta based git archival Michel Lespinasse
@ 2005-04-22 9:12 ` Jeffrey E. Hundstad
2005-04-22 9:49 ` Jaime Medrano
1 sibling, 0 replies; 3+ messages in thread
From: Jeffrey E. Hundstad @ 2005-04-22 9:12 UTC (permalink / raw)
To: Michel Lespinasse; +Cc: git
Michel Lespinasse wrote:
>Does this sound insane ? Too complicated maybe ?
>
>
My vote is YES on both counts.
Simplicity and flexibility is what makes git a good thing; and imho this
works against that quite aggressively.
--
Jeffrey Hundstad
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: proposal: delta based git archival
2005-04-22 9:03 proposal: delta based git archival Michel Lespinasse
2005-04-22 9:12 ` Jeffrey E. Hundstad
@ 2005-04-22 9:49 ` Jaime Medrano
1 sibling, 0 replies; 3+ messages in thread
From: Jaime Medrano @ 2005-04-22 9:49 UTC (permalink / raw)
To: Michel Lespinasse; +Cc: git
On 4/22/05, Michel Lespinasse <walken@zoy•org> wrote:
> I noticed people on this mailing list start talking about using blob deltas
> for compression, and the basic issue that the resulting files are too small
> for efficient filesystem storage. I thought about this a little and decided
> I should send out my ideas for discussion.
>
I've been thinking in another simpler approach.
The main benefit of using deltas is reducing the bandwith use in
pull/push. My idea is leaving the blob storage as it is by now and
adding a new kind of object (remote) that acts as a link to an object
in another repository.
So that, when you rsync, you don't have to get all the blobs (which
can be a lot of data), but only the sha1 of the new objects created.
Then a remote object is created for each new object in the local
repository pointing to its location in the external repository.
Once the rsync is done, when git has to access any of the new objects
they can be fetched from the original location, so that only necessary
objects are transfered.
This way, the cost of a sync in terms of bandwith is nearly zero.
I've been working on this, so if you think it to be a good idea, I can
send a patch when I get it fully working.
Regards,
Jaime Medrano.
http://jmedrano.sl-form.com
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2005-04-22 9:45 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-22 9:03 proposal: delta based git archival Michel Lespinasse
2005-04-22 9:12 ` Jeffrey E. Hundstad
2005-04-22 9:49 ` Jaime Medrano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox