From: Patrick Steinhardt <ps@pks•im>
To: Tsahi Elkayam <Tsahi.Elkayam@protonmail•com>
Cc: "git@vger•kernel.org" <git@vger•kernel.org>
Subject: Re: [PATCH v2] reftable/iter: fix undefined behavior in indexed_table_ref_iter_next
Date: Mon, 5 Jan 2026 15:59:53 +0100 [thread overview]
Message-ID: <aVvR6U6EJ9wfKk8l@pks.im> (raw)
In-Reply-To: <iaPdageDbUKEIQVlnOugIRhoojxnFo3j-WJFWY0eC5el1Epu3sxEnto6Lrd3bhAYL0Ry8T3czP5UPhLHX_gfWCDiCoLuMofdRkqfOSYP-Jk=@protonmail.com>
On Sun, Jan 04, 2026 at 10:46:40AM +0000, Tsahi Elkayam wrote:
> The indexed_table_ref_iter_next() function accesses ref->value.val2
> without first checking the ref's value_type. This is undefined behavior
> when the ref is not of type REFTABLE_REF_VAL2.
>
> The correct pattern is already used in filtering_ref_iterator_next()
> which checks value_type before accessing the appropriate union member.
> Apply the same pattern here:
>
> - Check for REFTABLE_REF_VAL2 before accessing val2 members
> - Add missing check for REFTABLE_REF_VAL1 to handle single-value refs
One missing bit is to explain what this is actually supposed to do. That
is, why do we even compare the data?
> This was marked with a "/* BUG */" comment indicating the issue was
> known but not yet fixed.
That's an indicator that something was wrong, true. But it doesn't
really say what the bug was. What puzzles me is that if the bug was so
easy to fix, then why didn't the original author already do it?
Unfortunately, blaming the line points to 46bc0e731a (reftable: read
reftable files, 2021-10-07), and that commit doesn't really provide much
context either.
> diff --git a/reftable/iter.c b/reftable/iter.c
> index 2ecc52b336..2eee65bb1e 100644
> --- a/reftable/iter.c
> +++ b/reftable/iter.c
> @@ -171,12 +171,15 @@ static int indexed_table_ref_iter_next(void *p, struct reftable_record *rec)
> }
> continue;
> }
> - /* BUG */
> - if (!memcmp(it->oid.buf, ref->value.val2.target_value,
> - it->oid.len) ||
> - !memcmp(it->oid.buf, ref->value.val2.value, it->oid.len)) {
> + if (ref->value_type == REFTABLE_REF_VAL2 &&
> + (!memcmp(it->oid.buf, ref->value.val2.target_value,
> + it->oid.len) ||
> + !memcmp(it->oid.buf, ref->value.val2.value, it->oid.len)))
> + return 0;
> +
> + if (ref->value_type == REFTABLE_REF_VAL1 &&
> + !memcmp(it->oid.buf, ref->value.val1, it->oid.len))
> return 0;
> - }
> }
> }
So let's take a step back -- what are we even trying to do here?
The indexed table is basically a table that provides reverse mappings.
Given an object ID, it allows us to quickly look up any reference that
points to this object ID. We don't make any use of that feature in Git
right now, but historically it was designed to speed up
"uploadpack.allowTipSHA1InWant". This setting is a lot less relevant
nowadays, as most forges set "uploadpack.allowAnySHA1InWant" to support
partial clones.
So this interface is somewhat confusingly named, as the term "index" is
overloaded: we have the "ref" and "log" indices that enable fast lookup
of those record types. But what this here refers to is the "obj" table.
Oh, well.
The iterator for those objects takes as input the object ID we're
searching for. Given that object ID, it is expected to yield only those
ref records that reference this object ID, either peeled or unpeeled in
case it is a tag.
In the `next()` function we essentially have a nested loop:
- The outer loop iterates through the "obj" blocks. This gives us the
offsets of the ref records that we need to look up and that contain.
- The inner loop iterates through the records in the "ref" block that
was referenced by the "obj" block.
Honestly, the whole logic doesn't really make any sense though, as we
never filter by the caller-provided object ID at all! So we still end up
churning through all references, which kind of destroys the purpose of
this whole "obj" reverse index. And that's also why we have the check:
we verify that the object ID of the reference we have looked up matches
the caller's query.
So the fix you have here is correct: we may end up with a "ref" record
that is unpeeled, and in that case it's wrong to treat it as a peeled
one. And if we fix that, the result should at least be correct and void
of any kind of undefined behaviour. But the whole infrastructure is
still quite broken. What we should be doing is to:
1. Seek to the obj record that has the desired object ID prefix.
2. For each obj record starting with the desired object ID prefix:
1. Look up the respective "ref" block indicated by the offset.
2. Iterate through all "ref" records and yield all those whose value
or peeled value match.
3. Abort once there are no more obj records matching the given prefix.
All of this is naturally outside the scope of this patch series. I'd
argue though that we shouldn't just remove the BUG comment, but instead
add some TODO comment that explains why the current logic is still very
suboptimal.
Thanks!
Patrick
next prev parent reply other threads:[~2026-01-05 15:00 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-04 10:46 [PATCH v2] reftable/iter: fix undefined behavior in indexed_table_ref_iter_next Tsahi Elkayam
2026-01-05 14:59 ` Patrick Steinhardt [this message]
2026-01-08 16:52 ` [PATCH v2] reftable/iter: fix UB " Tsahi Elkayam
2026-01-09 14:19 ` Patrick Steinhardt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aVvR6U6EJ9wfKk8l@pks.im \
--to=ps@pks$(echo .)im \
--cc=Tsahi.Elkayam@protonmail$(echo .)com \
--cc=git@vger$(echo .)kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox