Re: Resolving deltas dominates clone time - Ævar Arnfjörð Bjarmason

public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed

From: "Ævar Arnfjörð Bjarmason" <avarab@gmail•com>
To: Jeff King <peff@peff•net>
Cc: Martin Fick <mfick@codeaurora•org>,
	Git Mailing List <git@vger•kernel.org>, "Jansen\,
	Geert" <gerardu@amazon•com>,
	Jonathan Tan <jonathantanmy@google•com>
Subject: Re: Resolving deltas dominates clone time
Date: Mon, 22 Apr 2019 20:01:15 +0200	[thread overview]
Message-ID: <874l6pudg4.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20190422155716.GA9680@sigill.intra.peff.net>


On Mon, Apr 22 2019, Jeff King wrote:

> On Sat, Apr 20, 2019 at 09:59:12AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > If you don't mind losing the collision-detection, using openssl's sha1
>> > might help. The delta resolution should be threaded, too. So in _theory_
>> > you're using 66 minutes of CPU time, but that should only take 1-2
>> > minutes on your 56-core machine. I don't know at what point you'd run
>> > into lock contention, though. The locking there is quite coarse.
>>
>> There's also my (been meaning to re-roll)
>> https://public-inbox.org/git/20181113201910.11518-1-avarab@gmail.com/
>> *that* part of the SHA-1 checking is part of what's going on here. It'll
>> help a *tiny* bit, but of course is part of the "trust remote" risk
>> management...
>
> I think we're talking about two different collision detections, and your
> patch wouldn't help at all here.
>
> Your patch is optionally removing the "woah, we got an object with a
> duplicate sha1, let's check that the bytes are the same in both copies"
> check. But Martin's problem is a clone, so we wouldn't have any existing
> objects to duplicate in the first place.

Right, but we do anyway, as reported by Geert at @amazon.com preceding
that patch of mine. But it is 99.99% irrelevant to *performance* in this
case after the loose object cache you added (but before that could make
all the difference depending on the FS).

I just mentioned it to plant a flag on another bit of the code where
index-pack in general has certain paranoias/validation the user might be
willing to optionally drop just at "clone" time.

> The problem in his case is literally just that the actual SHA-1 is
> expensive, and that can be helped by using the optimized openssl
> implementation rather than the sha1dc (which checks not collisions with
> objects we _have_, but evidence of somebody trying to exploit weaknesses
> in sha1).
>
> One thing we could do to make that easier is a run-time flag to switch
> between sha1dc and a faster implementation (either openssl or blk_sha1,
> depending on the build). That would let you flip the "trust" bit per
> operation, rather than having it baked into your build.

Yeah, this would be neat.

> (Note that the oft-discussed "use a faster sha1 implementation for
> checksums, but sha1dc for object hashing" idea would not help here,
> because these really are object hashes whose time is dominating. We have
> to checksum 8GB of raw packfile but 2TB of object data).
>
>> I started to write:
>>
>>     I wonder if there's room for some tacit client/server cooperation
>>     without such a protocol change.
>>
>>     E.g. the server sending over a pack constructed in such a way that
>>     everything required for a checkout is at the beginning of the
>>     data. Now we implicitly tend to do it mostly the other way around
>>     for delta optimization purposes.
>>
>>     That would allow a smart client in a hurry to index-pack it as they
>>     go along, and as soon as they have enough to check out HEAD return
>>     to the client, and continue the rest in the background
>
> Interesting idea. You're not reducing the total client effort, but
> you're improving latency of getting the user to a checkout. Of course
> that doesn't help if they want to run "git log" as their first
> operation. ;)
>
>> But realized I was just starting to describe something like 'clone
>> --depth=1' followed by a 'fetch --unshallow' in the background, except
>> that would work better (if you did "just the tip" naïvely you'd get
>> 'missing object' on e.g. 'git log', with that ad-hoc hack we'd need to
>> write out two packs etc...).
>
> Right, that would work. I will note one thing, though: the total time to
> do a 1-depth clone followed by an unshallow is probably much higher than
> doing the whole clone as one unit, for two reasons:

Indeed. The hypothesis is that the user doesn't really care about the
clone-time, but the clone-to-repo-mostly-usable time.

>   1. The server won't use reachability bitmaps when serving the
>      follow-up fetch (because shallowness invalidates the reachability
>      data they're caching), so it will spend much more time in the
>      "Counting objects" phase.
>
>   2. The server has to throw away some deltas. Imagine version X of a
>      file in the tip commit is stored as a delta against version Y in
>      that commit's parent. The initial clone has to throw away the
>      on-disk delta of X and send you the whole object (because you are
>      not requesting Y at all). And then in the follow-up fetch, it must
>      either send you Y as a base object (wasting bandwidth), or it must
>      on-the-fly generate a delta from Y to X (wasting CPU).
>
>> But at this point I'm just starting to describe some shoddy version of
>> Documentation/technical/partial-clone.txt :), OTOH there's no "narrow
>> clone and fleshen right away" option.
>
> Yes. And partial-clone suffers from the problems above to an even
> greater extent. ;)
>
>> On protocol extensions: Just having a way to "wget" the corresponding
>> *.idx file from the server would be great, and reduce clone times by a
>> lot. There's the risk of trusting the server, but most people's use-case
>> is going to be pushing right back to the same server, which'll be doing
>> a full validation.
>
> One tricky thing is that the server may be handing you a bespoke .pack
> file. There is no matching ".idx" at all, neither in-memory nor on disk.
> And you would not want the whole on-disk .pack/.idx pair from a site
> like GitHub, where there are objects from many forks.
>
> So in general, I think you'd need some cooperation from the server side
> to ask it to generate and send the .idx that matches the .pack it is
> sending you. Or even if not the .idx format itself, some stable list of
> sha1s that you could use to reproduce it without hashing each
> uncompressed byte yourself.

Yeah, depending on how jt/fetch-cdn-offload is designed (see my
https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/) it
could be (ab)used to do this. I.e. you'd keep a "base" *.{pack,idx}
around for such a purpose.

So in such a case you'd serve up that recent-enough *.{pack,idx} for the
client to "wget", and the client would then trust it (or not) and do the
equivalent of a "fetch" from that point to be 100% up-to-date.

> This could even be stuffed into the pack format and stripped out by
> the receiving index-pack (i.e., each entry is prefixed with "and by
> the way, here is my sha1...").

That would be really interesting. I.e. just having room for that (or
anything else) in the pack format.

I wonder if it could be added to the delta-chain in the current format
as a nasty hack :)

>> We could also defer that validation instead of skipping it. E.g. wget
>> *.{pack,idx} followed by a 'fsck' in the background. I've sometimes
>> wanted that anyway, i.e. "fsck --auto" similar to "gc --auto"
>> periodically to detect repository bitflips.
>>
>> Or, do some "narrow" validation of such an *.idx file right
>> away. E.g. for all the trees/blobs required for the current checkout,
>> and background the rest.
>
> The "do we have all of the objects we need" is already separate from
> "figure out the sha1 of each object", so I think you'd get that
> naturally if you just took in an untrusted .idx (it also demonstrates
> that any .idx cost is really focusing on blobs, because the "do we have
> all objects" check is going to decompress every commit and tree in the
> repo anyway).
>
> -Peff

next prev parent reply	other threads:[~2019-04-22 18:01 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-19 21:47 Resolving deltas dominates clone time Martin Fick
2019-04-20  3:58 ` Jeff King
2019-04-20  7:59   ` Ævar Arnfjörð Bjarmason
2019-04-22 15:57     ` Jeff King
2019-04-22 18:01       ` Ævar Arnfjörð Bjarmason [this message]
2019-04-22 18:43         ` Jeff King
2019-04-23  7:07           ` Ævar Arnfjörð Bjarmason
2019-04-22 20:21   ` Martin Fick
2019-04-22 20:56     ` Jeff King
2019-04-22 21:02       ` Jeff King
2019-04-22 21:19       ` [PATCH] p5302: create the repo in each index-pack test Jeff King
2019-04-23  1:09         ` Junio C Hamano
2019-04-23  2:07           ` Jeff King
2019-04-23  2:27             ` Junio C Hamano
2019-04-23  2:36               ` Jeff King
2019-04-23  2:40                 ` Junio C Hamano
2019-04-22 22:32       ` Resolving deltas dominates clone time Martin Fick
2019-04-23  1:55         ` Jeff King
2019-04-23  4:21           ` Jeff King
2019-04-23 10:08             ` Duy Nguyen
2019-04-23 20:09               ` Martin Fick
2019-04-30 18:02                 ` Jeff King
2019-04-30 22:08                   ` Martin Fick
2019-04-30 17:50               ` Jeff King
2019-04-30 18:48                 ` Ævar Arnfjörð Bjarmason
2019-04-30 20:33                   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874l6pudg4.fsf@evledraar.gmail.com \
    --to=avarab@gmail$(echo .)com \
    --cc=gerardu@amazon$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=jonathantanmy@google$(echo .)com \
    --cc=mfick@codeaurora$(echo .)org \
    --cc=peff@peff$(echo .)net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox