From: Jeff King <peff@peff•net>
To: Junio C Hamano <gitster@pobox•com>
Cc: Patrick Steinhardt <ps@pks•im>,
git@vger•kernel.org, Taylor Blau <me@ttaylorr•com>
Subject: Re: [PATCH] packfile: avoid access(3p) calls for missing packs
Date: Mon, 19 May 2025 02:52:21 -0400 [thread overview]
Message-ID: <20250519065221.GC102701@coredump.intra.peff.net> (raw)
In-Reply-To: <xmqq7c2gv1zx.fsf@gitster.g>
On Fri, May 16, 2025 at 11:34:10AM -0700, Junio C Hamano wrote:
> > The reason why we do this is likely because we reuse `p->pack_name` to
> > derive the other files' names, as well, so that we only have to allocate
> > this buffer once. As such, we have to compute the packfile name last so
> > that it doesn't get overwritten by the other names.
>
> I vaguely recall that the order of checks between .idx and .pack
> were deliberately chosen, as we do not want to add and attempt to
> use .pack file before its associated .idx file is ready since
> without the latter the former is unusable at runtime.
It is definitely intentional and important that we discover packs by
their .idx file first. The writing side uses the creation of the .idx
file (via atomic rename) as the final step in making a pack accessible.
Before that, a reader could see a .pack without its various accessory
files (and so not realize it has a .keep file, for example).
> Side note: It may have been more important back when the name of
> the packfile was a hash of names of the objects in the pack,
> which meant that when you repack a quiescent and fully repacked
> repository twice with different parameters, you could have ended
> up with a packfile whose name is the same but the contents as a
> bytestream (hence the object offset) are different, making a
> stale .idx not pointing into the correct position in the new
> .pack file. These days the name of the packfile is based on the
> contents of the pack as a bytestream, so we no longer suffer
> in such a scenario.
I don't think reading the .idx file first was sufficient to protect us
there, since the .pack could be replaced racily after the reader opened
the .idx. I don't recall if we were subject to that race back then (and
it just didn't come up enough to worry about), or if repack would throw
away an identically-named file. Anyway, as you note, it's no longer an
issue.
But I think that is mostly orthogonal to how the auxiliary files are
handled. I think Patrick's guess is correct that the order we have is
simply because it was most convenient to end up with ".pack" in the
name.
> > All of this likely doesn't matter in the general case: Git shouldn't end
> > up looking too many nonexistent packfiles anyway.
>
> In any case, we shouldn't attempt to use .pack when its associated
> files are missing. It is not about nonexistent but about incomplete
> (including "in the process of being written").
This shouldn't be a race for files being written. The .idx is written
last, so the .pack should always be there. And the midx is written after
that, so again, the pack should always be there. At least for newly
written packs.
We might see the opposite: packs going away because they were repacked
(and perhaps even a new midx generated). In that case you might see
a race like:
- process R opens the existing midx, which mentions pack P (but does
not yet open P)
- process W repacks, deleting P and generating a new midx with P'
- process R wants object X; the midx claims it is in pack P, so it
tries to open that but fails
What should happen here is that R will fail to find the object at all,
should hit reprepare_packed_git(), and will then pick up the new pack.
I don't recall offhand whether reprepare_packed_git() will open the new
midx. If it doesn't, we'd still find the object via the actual pack
.idx, but we may end up consulting the now-stale midx over and over (so
we'll get the right answer, but there may be some room for
optimization).
> > while the issue was entirely self-made because the multi-pack index
> > should have been regenerated, we can still reduce the number of syscalls
> > by 75% in the case of nonexistent packfiles by reordering these calls.
>
> That sounds more like a band-aid, if we still do the remaining 25%
> that we somehow know would be unnecessary.
Yeah, that was my gut reaction, too. add_packed_git() is not
optimized[1] at all. But it shouldn't have to be, because it's meant to
be called once per process. Even with the re-ordering, we'd still make a
bunch of pointless stat() calls for the .pack file we know is not there.
The code in prepare_midx_pack() converts a numeric pack id (referenced
inside the midx) into a "struct packed_git" pointer, caching the results
in multi_pack_index->packs. That field holds NULL for "we have not
looked it up yet" or a valid pointer to a packed_git. It probably needs
to hold a third state: "we tried and failed".
Something like this (large untested) patch:
diff --git a/midx.c b/midx.c
index 3d0015f782..354b1f886c 100644
--- a/midx.c
+++ b/midx.c
@@ -405,7 +405,7 @@ void close_midx(struct multi_pack_index *m)
munmap((unsigned char *)m->data, m->data_len);
for (i = 0; i < m->num_packs; i++) {
- if (m->packs[i])
+ if (m->packs[i] && m->packs[i] != (void *)(intptr_t)-1)
m->packs[i]->multi_pack_index = 0;
}
FREE_AND_NULL(m->packs);
@@ -458,6 +458,8 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m,
pack_int_id = midx_for_pack(&m, pack_int_id);
+ if (m->packs[pack_int_id] == (void *)(intptr_t)-1)
+ return 1;
if (m->packs[pack_int_id])
return 0;
@@ -482,8 +484,10 @@ int prepare_midx_pack(struct repository *r, struct multi_pack_index *m,
strbuf_release(&pack_name);
strbuf_release(&key);
- if (!p)
+ if (!p) {
+ m->packs[pack_int_id] = (void *)(intptr_t)-1;
return 1;
+ }
p->multi_pack_index = 1;
m->packs[pack_int_id] = p;
-Peff
[1] There probably are optimization opportunities in add_packed_git(). I
don't think re-ordering will help much in the common case that we
actually do have the pack. But really, most callers do not care
about these auxiliary files at all! We could simply skip them during
the initial setup and lazy-load them via accessor functions.
I _think_ that should be OK with respect to races. For a newly added
pack, we know they will always be in place before the matching .idx
file (per the logic I outlined above). For a pack that goes away, we
might racily fail to see its auxiliary file. But that is mostly true
now (we might see its .idx, and then the .promisor file is deleted
before we call access()). It does increase the size of that window,
though (and in particular lets it happen even if the pack has
actually been opened).
I'm not sure how much that would matter in practice. OTOH, I'm not
sure that saving a few access() calls would be all that meaningful,
unless you have a zillion packs. And if you have a zillion packs,
you're likely to get much bigger speedups in other areas by
repacking them anyway. So it hasn't really been an area I've
pursued before.
next prev parent reply other threads:[~2025-05-19 6:52 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-16 8:55 [PATCH] packfile: avoid access(3p) calls for missing packs Patrick Steinhardt
2025-05-16 18:34 ` Junio C Hamano
2025-05-19 6:52 ` Jeff King [this message]
2025-05-19 15:46 ` Junio C Hamano
2025-05-20 6:45 ` Patrick Steinhardt
2025-05-22 5:28 ` Jeff King
2025-05-23 1:02 ` Taylor Blau
2025-05-23 2:03 ` Jeff King
2025-05-20 9:53 ` [PATCH v2 0/2] " Patrick Steinhardt
2025-05-20 9:53 ` [PATCH v2 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt
2025-05-23 1:03 ` Taylor Blau
2025-05-20 9:53 ` [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles Patrick Steinhardt
2025-05-22 5:32 ` Jeff King
2025-05-22 15:47 ` Junio C Hamano
2025-05-22 16:59 ` Jeff King
2025-05-22 18:44 ` Junio C Hamano
2025-05-23 1:22 ` Taylor Blau
2025-05-23 2:08 ` Jeff King
2025-05-23 17:46 ` Taylor Blau
2025-05-25 18:41 ` [PATCH 0/5] midx: improve prepare_midx_pack() ergonomics Taylor Blau
2025-05-25 18:41 ` [PATCH 1/5] pack-bitmap.c: fix broken warning() when missing MIDX'd pack Taylor Blau
2025-05-26 7:23 ` Patrick Steinhardt
2025-05-28 2:00 ` Taylor Blau
2025-05-25 18:41 ` [PATCH 2/5] midx-write.c: guard against incremental MIDXs in want_included_pack() Taylor Blau
2025-05-26 7:23 ` Patrick Steinhardt
2025-05-28 2:08 ` Taylor Blau
2025-05-25 18:41 ` [PATCH 3/5] midx-write.c: simplify fill_packs_from_midx() Taylor Blau
2025-05-26 7:23 ` Patrick Steinhardt
2025-05-28 2:15 ` Taylor Blau
2025-05-25 18:42 ` [PATCH 4/5] midx-write.c: extract inner loop from fill_packs_from_midx() Taylor Blau
2025-05-25 18:42 ` [PATCH 5/5] midx: return a `packed_git` pointer from `prepare_midx_pack()` Taylor Blau
2025-05-26 7:24 ` Patrick Steinhardt
2025-05-28 2:18 ` Taylor Blau
2025-05-28 11:53 ` Patrick Steinhardt
2025-05-28 22:58 ` [PATCH v2 0/4] midx: improve prepare_midx_pack() ergonomics Taylor Blau
2025-05-28 22:59 ` [PATCH v2 1/4] midx: access pack names through `nth_midxed_pack_name()` Taylor Blau
2025-05-29 20:47 ` Junio C Hamano
2025-06-03 22:22 ` Taylor Blau
2025-05-29 20:51 ` Junio C Hamano
2025-06-03 22:23 ` Taylor Blau
2025-05-28 22:59 ` [PATCH v2 2/4] midx-write.c: guard against incremental MIDXs in want_included_pack() Taylor Blau
2025-05-28 22:59 ` [PATCH v2 3/4] midx-write.c: extract inner loop from fill_packs_from_midx() Taylor Blau
2025-05-28 22:59 ` [PATCH v2 4/4] midx: return a `packed_git` pointer from `prepare_midx_pack()` Taylor Blau
2025-05-30 6:50 ` Jeff King
2025-06-03 22:27 ` Taylor Blau
2025-08-28 23:25 ` Junio C Hamano
2025-05-23 1:31 ` [PATCH v2 2/2] midx: stop repeatedly looking up nonexistent packfiles Taylor Blau
2025-05-23 2:18 ` Jeff King
2025-05-21 13:24 ` [PATCH v2 0/2] packfile: avoid access(3p) calls for missing packs Junio C Hamano
2025-05-28 12:24 ` [PATCH v3 " Patrick Steinhardt
2025-05-28 12:24 ` [PATCH v3 1/2] packfile: explain ordering of how we look up auxiliary pack files Patrick Steinhardt
2025-05-28 12:24 ` [PATCH v3 2/2] midx: stop repeatedly looking up nonexistent packfiles Patrick Steinhardt
2025-05-30 6:27 ` [PATCH v3 0/2] packfile: avoid access(3p) calls for missing packs Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250519065221.GC102701@coredump.intra.peff.net \
--to=peff@peff$(echo .)net \
--cc=git@vger$(echo .)kernel.org \
--cc=gitster@pobox$(echo .)com \
--cc=me@ttaylorr$(echo .)com \
--cc=ps@pks$(echo .)im \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox