From: Taylor Blau <me@ttaylorr•com>
To: Jeff King <peff@peff•net>
Cc: git@vger•kernel.org, Junio C Hamano <gitster@pobox•com>,
Elijah Newren <newren@gmail•com>, Patrick Steinhardt <ps@pks•im>
Subject: Re: [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init()
Date: Fri, 17 Jan 2025 16:30:05 -0500 [thread overview]
Message-ID: <Z4rL3TbEeR8EiUOi@nand.local> (raw)
In-Reply-To: <Z4GWIZkJOUa278VA@nand.local>
On Fri, Jan 10, 2025 at 04:50:25PM -0500, Taylor Blau wrote:
> On Fri, Jan 10, 2025 at 05:37:56AM -0500, Jeff King wrote:
> > So in the new constructor:
> >
> > > +void hashfile_checkpoint_init(struct hashfile *f,
> > > + struct hashfile_checkpoint *checkpoint)
> > > +{
> > > + memset(checkpoint, 0, sizeof(*checkpoint));
> > > + f->algop->init_fn(&checkpoint->ctx);
> > > +}
> >
> > ...should we actually record "f" itself? And then in the existing
> > functions:
> >
> > > void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
> >
> > ...they'd no longer need to take the extra parameter.
> >
> > It creates a lifetime dependency of the checkpoint struct on the "f" it
> > is checkpointing, but I think that is naturally modeling the domain.
>
> Thanks, I really like these suggestions. I adjusted the series
> accordingly to do this cleanup in two patches (one for
> hashfile_checkpoint(), another for hashfile_truncate()) after the patch
> introducing hashfile_checkpoint_init().
Hmm. I'm not sure that I like this as much as I thought I did.
I agree with you that ultimately the hashfile_checkpoint is (or should
be) tied to the lifetime of the hashfile that it is checkpointing
underneath. But in practice things are a little funky.
Let's suppose I did something like the following:
--- 8< ---
diff --git a/csum-file.c b/csum-file.c
index ebffc80ef7..47b8317a1f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -206,6 +206,15 @@ struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
return hashfd_internal(fd, name, tp, 8 * 1024);
}
+void hashfile_checkpoint_init(struct hashfile *f,
+ struct hashfile_checkpoint *checkpoint)
+{
+ memset(checkpoint, 0, sizeof(*checkpoint));
+
+ checkpoint->f = f;
+ checkpoint->f->algop->init_fn(&checkpoint->ctx);
+}
+
void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
{
hashflush(f);
diff --git a/csum-file.h b/csum-file.h
index 2b45f4673a..8016509c71 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -34,8 +34,10 @@ struct hashfile {
struct hashfile_checkpoint {
off_t offset;
git_hash_ctx ctx;
+ struct hashfile *f;
};
+void hashfile_checkpoint_init(struct hashfile *, struct hashfile_checkpoint *);
void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *);
int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
--- >8 ---
, where I'm eliding the trivial changes necessary to make this work at
the two callers. Let's look a little closer at the bulk-checkin caller.
If I do this on top:
--- 8< ---
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bd..892176d23d 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
git_hash_ctx ctx;
unsigned char obuf[16384];
unsigned header_len;
- struct hashfile_checkpoint checkpoint = {0};
+ struct hashfile_checkpoint checkpoint;
struct pack_idx_entry *idx = NULL;
seekback = lseek(fd, 0, SEEK_CUR);
@@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
OBJ_BLOB, size);
the_hash_algo->init_fn(&ctx);
the_hash_algo->update_fn(&ctx, obuf, header_len);
- the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
/* Note: idx is non-NULL when we are writing */
- if ((flags & HASH_WRITE_OBJECT) != 0)
+ if ((flags & HASH_WRITE_OBJECT) != 0) {
CALLOC_ARRAY(idx, 1);
+ prepare_to_stream(state, flags);
+ hashfile_checkpoint_init(state->f, &checkpoint);
+ }
+
already_hashed_to = 0;
while (1) {
--- >8 ---
then test 14 in t1050-large.sh fails because of a segfault in 'git add'.
If we compile with SANITIZE=address, we can see that there's a
use-after-free in hashflush(), which is called by hashfile_checkpoint().
That is a result of the max pack-size code. So we could try something
like:
--- 8< ---
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 7b8a6eb2df..9dc114d132 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
git_hash_ctx ctx;
unsigned char obuf[16384];
unsigned header_len;
- struct hashfile_checkpoint checkpoint;
+ struct hashfile_checkpoint checkpoint = { 0 };
struct pack_idx_entry *idx = NULL;
seekback = lseek(fd, 0, SEEK_CUR);
@@ -274,17 +274,14 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
the_hash_algo->update_fn(&ctx, obuf, header_len);
/* Note: idx is non-NULL when we are writing */
- if ((flags & HASH_WRITE_OBJECT) != 0) {
+ if ((flags & HASH_WRITE_OBJECT) != 0)
CALLOC_ARRAY(idx, 1);
-
- prepare_to_stream(state, flags);
- hashfile_checkpoint_init(state->f, &checkpoint);
- }
-
already_hashed_to = 0;
while (1) {
prepare_to_stream(state, flags);
+ if (checkpoint.f != state->f)
+ hashfile_checkpoint_init(state->f, &checkpoint);
if (idx) {
hashfile_checkpoint(&checkpoint);
idx->offset = state->offset;
--- >8 ---
which would do the trick, but it feels awfully hacky to have the "if
(checkpoint.f != state->f)" check in there, since that feels too
intimately tied to the implementation of the hashfile_checkpoint API for
my comfort.
It would be nice if we could make the checkpoint only declared within
the loop body itself, but we can't because we need to call
hashfile_truncate() outside of the loop.
Anyway, that's all to say that I think that while this is probably
doable in theory, in practice it's kind of a mess, at least currently.
I would rather see if there are other ways to clean up the
deflate_blob_to_pack() function first in a way that made this change
less awkward.
I think the most reasonable course here would be to pursue a minimal
change like the one presented here and then think about further clean up
as a separate step.
Thanks,
Taylor
next prev parent reply other threads:[~2025-01-17 21:30 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-20 19:13 [PATCH 0/6] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
2024-11-20 19:13 ` [PATCH 1/6] csum-file: store the hash algorithm as a struct field Taylor Blau
2024-11-21 9:18 ` Jeff King
2024-11-20 19:13 ` [PATCH 2/6] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
2024-11-20 19:13 ` [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
2024-11-21 9:37 ` Jeff King
2024-11-22 0:39 ` brian m. carlson
2024-11-22 8:25 ` Jeff King
2024-11-22 20:37 ` brian m. carlson
2025-01-10 21:38 ` Taylor Blau
2025-01-11 2:45 ` Jeff King
2024-11-20 19:13 ` [PATCH 4/6] csum-file.c: use unsafe_hash_algo() Taylor Blau
2024-11-20 19:13 ` [PATCH 5/6] t/helper/test-hash.c: " Taylor Blau
2024-11-20 19:13 ` [PATCH 6/6] hash.h: drop unsafe_ function variants Taylor Blau
2024-11-21 9:41 ` Jeff King
2025-01-08 19:14 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Taylor Blau
2025-01-08 19:14 ` [PATCH v2 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
2025-01-08 19:14 ` [PATCH v2 2/8] csum-file: store the hash algorithm as a struct field Taylor Blau
2025-01-16 11:48 ` Patrick Steinhardt
2025-01-17 21:17 ` Taylor Blau
2025-01-08 19:14 ` [PATCH v2 3/8] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
2025-01-08 19:14 ` [PATCH v2 4/8] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
2025-01-16 11:49 ` Patrick Steinhardt
2025-01-17 21:18 ` Taylor Blau
2025-01-08 19:14 ` [PATCH v2 5/8] csum-file.c: use unsafe_hash_algo() Taylor Blau
2025-01-08 19:14 ` [PATCH v2 6/8] t/helper/test-hash.c: " Taylor Blau
2025-01-08 19:14 ` [PATCH v2 7/8] csum-file: introduce hashfile_checkpoint_init() Taylor Blau
2025-01-10 10:37 ` Jeff King
2025-01-10 21:50 ` Taylor Blau
2025-01-17 21:30 ` Taylor Blau [this message]
2025-01-18 12:15 ` Jeff King
2025-01-08 19:14 ` [PATCH v2 8/8] hash.h: drop unsafe_ function variants Taylor Blau
2025-01-10 10:41 ` [PATCH v2 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Jeff King
2025-01-10 21:29 ` Taylor Blau
2025-01-11 2:42 ` Jeff King
2025-01-11 0:14 ` Junio C Hamano
2025-01-11 17:14 ` Taylor Blau
2025-01-17 22:03 ` [PATCH v3 " Taylor Blau
2025-01-17 22:03 ` [PATCH v3 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
2025-01-17 22:03 ` [PATCH v3 2/8] csum-file: store the hash algorithm as a struct field Taylor Blau
2025-01-17 22:03 ` [PATCH v3 3/8] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
2025-01-17 22:03 ` [PATCH v3 4/8] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
2025-01-17 22:03 ` [PATCH v3 5/8] csum-file.c: use unsafe_hash_algo() Taylor Blau
2025-01-17 22:03 ` [PATCH v3 6/8] t/helper/test-hash.c: " Taylor Blau
2025-01-17 22:03 ` [PATCH v3 7/8] csum-file: introduce hashfile_checkpoint_init() Taylor Blau
2025-01-17 22:03 ` [PATCH v3 8/8] hash.h: drop unsafe_ function variants Taylor Blau
2025-01-18 12:28 ` [PATCH v3 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Jeff King
2025-01-18 12:43 ` Jeff King
2025-01-22 21:31 ` Junio C Hamano
2025-01-23 17:34 ` [PATCH v4 " Taylor Blau
2025-01-23 17:34 ` [PATCH v4 1/8] t/helper/test-tool: implement sha1-unsafe helper Taylor Blau
2025-01-23 17:34 ` [PATCH v4 2/8] csum-file: store the hash algorithm as a struct field Taylor Blau
2025-01-23 17:34 ` [PATCH v4 3/8] csum-file.c: extract algop from hashfile_checksum_valid() Taylor Blau
2025-01-23 17:34 ` [PATCH v4 4/8] hash.h: introduce `unsafe_hash_algo()` Taylor Blau
2025-01-23 17:34 ` [PATCH v4 5/8] csum-file.c: use unsafe_hash_algo() Taylor Blau
2025-01-23 17:34 ` [PATCH v4 6/8] t/helper/test-hash.c: " Taylor Blau
2025-01-23 17:34 ` [PATCH v4 7/8] csum-file: introduce hashfile_checkpoint_init() Taylor Blau
2025-01-23 17:34 ` [PATCH v4 8/8] hash.h: drop unsafe_ function variants Taylor Blau
2025-01-23 18:30 ` [PATCH v4 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants Junio C Hamano
2025-01-23 18:50 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z4rL3TbEeR8EiUOi@nand.local \
--to=me@ttaylorr$(echo .)com \
--cc=git@vger$(echo .)kernel.org \
--cc=gitster@pobox$(echo .)com \
--cc=newren@gmail$(echo .)com \
--cc=peff@peff$(echo .)net \
--cc=ps@pks$(echo .)im \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox