public inbox for git@vger.kernel.org 
 help / color / mirror / Atom feed
From: Phillip Wood <phillip.wood123@gmail•com>
To: Jeff King <peff@peff•net>, Junio C Hamano <gitster@pobox•com>
Cc: git@vger•kernel.org, Patrick Steinhardt <ps@pks•im>,
	correctmost <cmlists@sent•com>, Taylor Blau <me@ttaylorr•com>
Subject: Re: [PATCH 2/4] parse: add functions for parsing from non-string buffers
Date: Fri, 5 Dec 2025 16:11:15 +0000	[thread overview]
Message-ID: <4d83375b-76e2-4420-80dd-6a04d3201532@gmail.com> (raw)
In-Reply-To: <20251130131537.GB199335@coredump.intra.peff.net>

On 30/11/2025 13:15, Jeff King wrote:
> If you have a buffer that is not NUL-terminated but want to parse an
> integer, there aren't many good options. If you use strtol() and
> friends, you risk running off the end of the buffer if there is no
> non-digit terminating character. And even if you carefully make sure
> that there is such a character, ASan's strict-string-check mode will
> still complain.
> 
> You can copy bytes into a temporary buffer, terminate it, and then call
> strtol(), but doing so adds some pitfalls (like making sure you soak up
> whitespace and leading +/- signs, and reporting overflow for overly long
> input). Or you can hand-parse the digits, but then you need to take some
> care to handle overflow (and again, whitespace and +/- signs).
> 
> These things aren't impossible to do right, but it's error-prone to have
> to do them in every spot that wants to do such parsing. So let's add
> some functions which can be used across the code base.
> 
> There are a few choices regarding the interface and the implementation.
> 
> First, the implementation:
> 
>    - I went with with parsing the digits (rather than buffering and
>      passing to libc functions). It ends up being a similar amount of
>      code because we have to do some parsing either way. And likewise
>      overflow detection depends on the exact type the caller wants, so we
>      either have to do it by hand or write a separate wrapper for
>      strtol(), strtoumax(), and so on.
> 
>    - Unsigned overflow detection is done using the same techniques as in
>      unsigned_add_overflows(), etc. We can't use those macros directly
>      because our core function is type-agnostic (so the caller passes in
>      the max value, rather than us deriving it on the fly). This is
>      similar to how git_parse_int(), etc, work.
> 
>    - Signed overflow detection assumes that we can express a negative
>      value with magnitude one larger than our maximum positive value
>      (e.g., -128..127 for a signed 8-bit value). I doubt this is
>      guaranteed by the standard, but it should hold in practice, and we
>      make the same assumption in git_parse_int(), etc. The nice thing
>      about this is that we can derive the range from the number of bits
>      in the type. For ints, you obviously could use INT_MIN..INT_MAX, but
>      for an arbitrary type, we can use maximum_signed_value_of_type().
> 
>    - I didn't bother with handling bases other than 10. It would
>      complicate the code, and I suspect it won't be needed. We could
>      probably retro-fit it later without too much work, if need be.

This all sounds sensible to me and an does the interface description.

> +bool parse_unsigned_from_buf(const char *buf, size_t len, const char **ep,
> +			     uintmax_t *ret, uintmax_t max)
> +{
> +	return parse_from_buf_internal(buf, len, ep, NULL, ret, max);
> +}
> +
> +bool parse_signed_from_buf(const char *buf, size_t len, const char **ep,
> +			   intmax_t *ret, intmax_t max)
> +{
> +	uintmax_t u_ret;
> +	bool negate;
> +
> +	if (!parse_from_buf_internal(buf, len, ep, &negate, &u_ret, max))
> +		return false;
> +	/*
> +	 * Range already checked internally, but we must apply negation
> +	 * ourselves since only we have the signed integer type.
> +	 */
> +	if (negate) {
> +		*ret = u_ret;
> +		*ret = -*ret;

If we're parsing INTMAX_MIN then this negation tries to calculate 
-INTMAX_MIN which is undefined (I've added some tests for parsing 
INTMAX_MAX and INTMAX_MIN at [1] and verified that UBSAN is triggered 
when parsing INTMAX_MIN). We could do

		*ret = u_ret;
		if (*ret != INTMAX_MIN)
			*ret = -*ret;

but I think it might be easier to alter parse_from_buf_internal() to 
make "negate" a local variable, change the function argument to "bool 
allow_negative" and do

		*ret = negate ? 0u - val : val;

Then parse_signed_from_buf() can do "*ret = *u_ret;" to convert the 
output of parse_from_buf_internal() to a signed value.

> diff --git a/t/unit-tests/u-parse-int.c b/t/unit-tests/u-parse-int.c
> new file mode 100644
> index 0000000000..a1601bb16b
> --- /dev/null
> +++ b/t/unit-tests/u-parse-int.c
> @@ -0,0 +1,98 @@
> +#include "unit-test.h"
> +#include "parse.h"
> +
> +static void check_int(const char *buf, size_t len,
> +		      size_t expect_ep_ofs, int expect_errno,
> +		      int expect_result)
> +{
> +	const char *ep;
> +	int result;

Do we want to set errno=0 here so that we can be sure it has been set by 
parse_int_from_buf() when we check it below?

Thanks

Phillip

[1] 
https://github.com/phillipwood/git/commit/e061e3e640db01d4fcf54d265d33352235151973

> +	bool ok = parse_int_from_buf(buf, len, &ep, &result);
> +
> +	if (expect_errno) {
> +		cl_assert(!ok);
> +		cl_assert_equal_i(expect_errno, errno);
> +		return;
> +	}
> +
> +	cl_assert(ok);
> +	cl_assert_equal_i(expect_result, result);
> +	cl_assert_equal_i(expect_ep_ofs, ep - buf);
> +}
> +
> +static void check_int_str(const char *buf, size_t ofs, int err, int res)
> +{
> +	check_int(buf, strlen(buf), ofs, err, res);
> +}
> +
> +static void check_int_full(const char *buf, int res)
> +{
> +	check_int_str(buf, strlen(buf), 0, res);
> +}
> +
> +static void check_int_err(const char *buf, int err)
> +{
> +	check_int(buf, strlen(buf), 0, err, 0);
> +}
> +
> +void test_parse_int__basic(void)
> +{
> +	cl_invoke(check_int_full("0", 0));
> +	cl_invoke(check_int_full("11", 11));
> +	cl_invoke(check_int_full("-23", -23));
> +	cl_invoke(check_int_full("+23", 23));
> +
> +	cl_invoke(check_int_str("  31337  ", 7, 0, 31337));
> +
> +	cl_invoke(check_int_err("  garbage", EINVAL));
> +	cl_invoke(check_int_err("", EINVAL));
> +	cl_invoke(check_int_err("-", EINVAL));
> +
> +	cl_invoke(check_int("123", 2, 2, 0, 12));
> +}
> +
> +void test_parse_int__range(void)
> +{
> +	/*
> +	 * These assume a 32-bit int. We could avoid that with some
> +	 * conditionals, but it's probably better for the test to
> +	 * fail noisily and we can decide how to handle it then.
> +	 */
> +	cl_invoke(check_int_full("2147483647", 2147483647));
> +	cl_invoke(check_int_err("2147483648", ERANGE));
> +	cl_invoke(check_int_full("-2147483647", -2147483647));
> +	cl_invoke(check_int_full("-2147483648", -2147483648));
> +	cl_invoke(check_int_err("-2147483649", ERANGE));
> +}
> +
> +static void check_unsigned(const char *buf, uintmax_t max,
> +			   int expect_errno, uintmax_t expect_result)
> +{
> +	const char *ep;
> +	uintmax_t result;
> +	bool ok = parse_unsigned_from_buf(buf, strlen(buf), &ep, &result, max);
> +
> +	if (expect_errno) {
> +		cl_assert(!ok);
> +		cl_assert_equal_i(expect_errno, errno);
> +		return;
> +	}
> +
> +	cl_assert(ok);
> +	cl_assert_equal_s(ep, "");
> +	/*
> +	 * Do not use cl_assert_equal_i_fmt(..., PRIuMAX) here. The macro
> +	 * casts to int under the hood, corrupting the values.
> +	 */
> +	clar__assert_equal(CLAR_CURRENT_FILE, CLAR_CURRENT_FUNC,
> +			   CLAR_CURRENT_LINE,
> +			   "expect_result != result", 1,
> +			   "%"PRIuMAX, expect_result, result);
> +}
> +
> +void test_parse_int__unsigned(void)
> +{
> +	cl_invoke(check_unsigned("4294967295", UINT_MAX, 0, 4294967295U));
> +	cl_invoke(check_unsigned("1053", 1000, ERANGE, 0));
> +	cl_invoke(check_unsigned("-17", UINT_MAX, EINVAL, 0));
> +}


  parent reply	other threads:[~2025-12-05 16:11 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-12  7:55 [PATCH 0/9] asan bonanza Jeff King
2025-11-12  7:56 ` [PATCH 1/9] compat/mmap: mark unused argument in git_munmap() Jeff King
2025-11-12  8:01 ` [PATCH 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps Jeff King
2025-11-12 11:25   ` Patrick Steinhardt
2025-11-13  2:55   ` Taylor Blau
2025-11-18  8:59     ` Jeff King
2025-11-12  8:02 ` [PATCH 3/9] Makefile: turn on NO_MMAP when building with ASan Jeff King
2025-11-12  8:17   ` Collin Funk
2025-11-12 10:31     ` Jeff King
2025-11-12 20:06       ` Collin Funk
2025-11-12 11:26   ` Patrick Steinhardt
2025-11-13  3:12     ` Taylor Blau
2025-11-13  6:34       ` Patrick Steinhardt
2025-11-18  8:49       ` Jeff King
2025-11-13 16:30     ` Junio C Hamano
2025-11-14  7:00       ` Patrick Steinhardt
2025-11-15  2:13         ` Jeff King
2025-11-12  8:05 ` [PATCH 4/9] cache-tree: avoid strtol() on non-string buffer Jeff King
2025-11-12 11:26   ` Patrick Steinhardt
2025-11-13  3:09     ` Taylor Blau
2025-11-18  8:40       ` Jeff King
2025-11-18  8:38     ` Jeff King
2025-11-12  8:06 ` [PATCH 5/9] fsck: assert newline presence in fsck_ident() Jeff King
2025-11-12  8:06 ` [PATCH 6/9] fsck: avoid strcspn() " Jeff King
2025-11-12  8:06 ` [PATCH 7/9] fsck: remove redundant date timestamp check Jeff King
2025-11-12  8:10 ` [PATCH 8/9] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated Jeff King
2025-11-12 11:25   ` Patrick Steinhardt
2025-11-12 19:36     ` Junio C Hamano
2025-11-15  2:12     ` Jeff King
2025-11-12  8:10 ` [PATCH 9/9] t: enable ASan's strict_string_checks option Jeff King
2025-11-13  3:17 ` [PATCH 0/9] asan bonanza Taylor Blau
2025-11-18  9:11 ` [PATCH v2 " Jeff King
2025-11-18  9:11   ` [PATCH v2 1/9] compat/mmap: mark unused argument in git_munmap() Jeff King
2025-11-18  9:12   ` [PATCH v2 2/9] pack-bitmap: handle name-hash lookups in incremental bitmaps Jeff King
2025-11-18  9:12   ` [PATCH v2 3/9] Makefile: turn on NO_MMAP when building with ASan Jeff King
2025-11-18  9:12   ` [PATCH v2 4/9] cache-tree: avoid strtol() on non-string buffer Jeff King
2025-11-18 14:30     ` Phillip Wood
2025-11-23  6:19       ` Junio C Hamano
2025-11-23 15:51         ` Phillip Wood
2025-11-23 18:06           ` Junio C Hamano
2025-11-24 22:30         ` Jeff King
2025-11-24 23:09           ` Junio C Hamano
2025-11-26 15:09             ` Jeff King
2025-11-26 17:22               ` Junio C Hamano
2025-11-30 13:13                 ` [PATCH 0/4] more robust functions for parsing int from buf Jeff King
2025-11-30 13:14                   ` [PATCH 1/4] parse: prefer bool to int for boolean returns Jeff King
2025-12-04 11:23                     ` Patrick Steinhardt
2025-11-30 13:15                   ` [PATCH 2/4] parse: add functions for parsing from non-string buffers Jeff King
2025-11-30 13:46                     ` my complaints with clar Jeff King
2025-12-01 14:16                       ` Phillip Wood
2025-12-04 11:09                         ` Patrick Steinhardt
2025-12-05 18:30                           ` Jeff King
2025-12-04 11:23                     ` [PATCH 2/4] parse: add functions for parsing from non-string buffers Patrick Steinhardt
2025-12-05 16:11                     ` Phillip Wood [this message]
2026-01-20 20:54                       ` Junio C Hamano
2026-01-21  5:27                         ` Jeff King
2025-11-30 13:15                   ` [PATCH 3/4] cache-tree: use parse_int_from_buf() Jeff King
2025-11-30 13:16                   ` [PATCH 4/4] fsck: use parse_unsigned_from_buf() for parsing timestamp Jeff King
2025-11-18  9:12   ` [PATCH v2 5/9] fsck: assert newline presence in fsck_ident() Jeff King
2025-11-18  9:12   ` [PATCH v2 6/9] fsck: avoid strcspn() " Jeff King
2025-11-18  9:12   ` [PATCH v2 7/9] fsck: remove redundant date timestamp check Jeff King
2025-11-18  9:12   ` [PATCH v2 8/9] fsck: avoid parse_timestamp() on buffer that isn't NUL-terminated Jeff King
2025-11-18  9:12   ` [PATCH v2 9/9] t: enable ASan's strict_string_checks option Jeff King
2025-11-23  5:49   ` [PATCH v2 0/9] asan bonanza Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4d83375b-76e2-4420-80dd-6a04d3201532@gmail.com \
    --to=phillip.wood123@gmail$(echo .)com \
    --cc=cmlists@sent$(echo .)com \
    --cc=git@vger$(echo .)kernel.org \
    --cc=gitster@pobox$(echo .)com \
    --cc=me@ttaylorr$(echo .)com \
    --cc=peff@peff$(echo .)net \
    --cc=phillip.wood@dunelm$(echo .)org.uk \
    --cc=ps@pks$(echo .)im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox