* [PATCH 1/6] test-sha1: add a binary output mode
2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
@ 2013-08-22 23:12 ` Jeff King
2013-08-22 23:14 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
` (5 subsequent siblings)
6 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-22 23:12 UTC (permalink / raw)
To: Git Mailing List
Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre
The test-sha1 helper program will run our internal sha1
routines over its input and output the 40-byte hex sha1.
Sometimes, however, it is useful to have the binary 20-byte
sha1 (and it's a pain to convert back in the shell). Let's
add a "-b" option to output the binary version.
Signed-off-by: Jeff King <peff@peff•net>
---
test-sha1.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/test-sha1.c b/test-sha1.c
index 80daba9..e57eae1 100644
--- a/test-sha1.c
+++ b/test-sha1.c
@@ -5,10 +5,15 @@ int main(int ac, char **av)
git_SHA_CTX ctx;
unsigned char sha1[20];
unsigned bufsz = 8192;
+ int binary = 0;
char *buffer;
- if (ac == 2)
- bufsz = strtoul(av[1], NULL, 10) * 1024 * 1024;
+ if (ac == 2) {
+ if (!strcmp(av[1], "-b"))
+ binary = 1;
+ else
+ bufsz = strtoul(av[1], NULL, 10) * 1024 * 1024;
+ }
if (!bufsz)
bufsz = 8192;
@@ -42,6 +47,10 @@ int main(int ac, char **av)
git_SHA1_Update(&ctx, buffer, this_sz);
}
git_SHA1_Final(sha1, &ctx);
- puts(sha1_to_hex(sha1));
+
+ if (binary)
+ fwrite(sha1, 1, 20, stdout);
+ else
+ puts(sha1_to_hex(sha1));
exit(0);
}
--
1.8.4.rc2.28.g6bb5f3f
^ permalink raw reply related [flat|nested] 37+ messages in thread* [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
2013-08-22 23:12 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
@ 2013-08-22 23:14 ` Jeff King
2013-08-23 16:41 ` Junio C Hamano
2013-08-23 19:41 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Johannes Sixt
2013-08-22 23:14 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
` (4 subsequent siblings)
6 siblings, 2 replies; 37+ messages in thread
From: Jeff King @ 2013-08-22 23:14 UTC (permalink / raw)
To: Git Mailing List
Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre
The sha1_entry_pos function tries to be smart about
selecting the middle of a range for its binary search by
looking at the value differences between the "lo" and "hi"
constraints. However, it is unable to cope with entries with
duplicate keys in the sorted list.
We may hit a point in the search where both our "lo" and
"hi" point to the same key. In this case, the range of
values between our endpoints is 0, and trying to scale the
difference between our key and the endpoints over that range
is undefined (i.e., divide by zero). The current code
catches this with an "assert(lov < hiv)".
Moreover, after seeing that the first 20 byte of the key are
the same, we will try to establish a value from the 21st
byte. Which is nonsensical.
Instead, we can detect the case that we are in a run of
duplicates, and simply do a final comparison against any one
of them (since they are all the same, it does not matter
which). If the keys match, we have found our entry (or one
of them, anyway). If not, then we know that we do not need
to look further, as we must be in a run of the duplicate
key.
Furthermore, we know that one of our endpoints must be
the edge of the run of duplicates. For example, given this
sequence:
idx 0 1 2 3 4 5
key A C C C C D
If we are searching for "B", we might hit the duplicate run
at lo=1, hi=3 (e.g., by first mi=3, then mi=0). But we can
never have lo > 1, because B < C. That is, if our key is
less than the run, we know that "lo" is the edge, but we can
say nothing of "hi". Similarly, if our key is greater than
the run, we know that "hi" is the edge, but we can say
nothing of "lo". But that is enough for us to return not
only "not found", but show the position at which we would
insert the new item.
Signed-off-by: Jeff King <peff@peff•net>
---
sha1-lookup.c | 23 ++++++++++++
t/lib-pack.sh | 78 +++++++++++++++++++++++++++++++++++++++
t/t5308-pack-detect-duplicates.sh | 73 ++++++++++++++++++++++++++++++++++++
3 files changed, 174 insertions(+)
create mode 100644 t/lib-pack.sh
create mode 100755 t/t5308-pack-detect-duplicates.sh
diff --git a/sha1-lookup.c b/sha1-lookup.c
index c4dc55d..614cbb6 100644
--- a/sha1-lookup.c
+++ b/sha1-lookup.c
@@ -204,7 +204,30 @@ int sha1_entry_pos(const void *table,
* byte 0 thru (ofs-1) are the same between
* lo and hi; ofs is the first byte that is
* different.
+ *
+ * If ofs==20, then no bytes are different,
+ * meaning we have entries with duplicate
+ * keys. We know that we are in a solid run
+ * of this entry (because the entries are
+ * sorted, and our lo and hi are the same,
+ * there can be nothing but this single key
+ * in between). So we can stop the search.
+ * Either one of these entries is it (and
+ * we do not care which), or we do not have
+ * it.
*/
+ if (ofs == 20) {
+ mi = lo;
+ mi_key = base + elem_size * mi + key_offset;
+ cmp = memcmp(mi_key, key, 20);
+ if (!cmp)
+ return mi;
+ if (cmp < 0)
+ return -1 - hi;
+ else
+ return -1 - lo;
+ }
+
hiv = hi_key[ofs_0];
if (ofs_0 < 19)
hiv = (hiv << 8) | hi_key[ofs_0+1];
diff --git a/t/lib-pack.sh b/t/lib-pack.sh
new file mode 100644
index 0000000..61c5d19
--- /dev/null
+++ b/t/lib-pack.sh
@@ -0,0 +1,78 @@
+#!/bin/sh
+#
+# Support routines for hand-crafting weird or malicious packs.
+#
+# You can make a complete pack like:
+#
+# pack_header 2 >foo.pack &&
+# pack_obj e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 >>foo.pack &&
+# pack_obj e68fe8129b546b101aee9510c5328e7f21ca1d18 >>foo.pack &&
+# pack_trailer foo.pack
+
+# Print the big-endian 4-byte octal representation of $1
+uint32_octal() {
+ n=$1
+ printf '\%o' $(($n / 16777216)); n=$((n % 16777216))
+ printf '\%o' $(($n / 65536)); n=$((n % 65536))
+ printf '\%o' $(($n / 256)); n=$((n % 256))
+ printf '\%o' $(($n ));
+}
+
+# Print the big-endian 4-byte binary representation of $1
+uint32_binary() {
+ printf "$(uint32_octal "$1")"
+}
+
+# Print a pack header, version 2, for a pack with $1 objects
+pack_header() {
+ printf 'PACK' &&
+ printf '\0\0\0\2' &&
+ uint32_binary "$1"
+}
+
+# Print the pack data for object $1, as a delta against object $2 (or as a full
+# object if $2 is missing or empty). The output is suitable for including
+# directly in the packfile, and represents the entirety of the object entry.
+# Doing this on the fly (especially picking your deltas) is quite tricky, so we
+# have hardcoded some well-known objects. See the case statements below for the
+# complete list.
+pack_obj() {
+ case "$1" in
+ # empty blob
+ e69de29bb2d1d6434b8b29ae775ad8c2e48c5391)
+ case "$2" in
+ '')
+ printf '\060\170\234\003\0\0\0\0\1'
+ return
+ ;;
+ esac
+ ;;
+
+ # blob containing "\7\76"
+ e68fe8129b546b101aee9510c5328e7f21ca1d18)
+ case "$2" in
+ '')
+ printf '\062\170\234\143\267\3\0\0\116\0\106'
+ return
+ ;;
+ esac
+ ;;
+ esac
+
+ echo >&2 "BUG: don't know how to print $1${2:+ (from $2)}"
+ return 1
+}
+
+# Compute and append pack trailer to "$1"
+pack_trailer() {
+ test-sha1 -b <"$1" >trailer.tmp &&
+ cat trailer.tmp >>"$1" &&
+ rm -f trailer.tmp
+}
+
+# Remove any existing packs to make sure that
+# whatever we index next will be the pack that we
+# actually use.
+clear_packs() {
+ rm -f .git/objects/pack/*
+}
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
new file mode 100755
index 0000000..719d48f
--- /dev/null
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -0,0 +1,73 @@
+#!/bin/sh
+
+test_description='handling of duplicate objects in incoming packfiles'
+. ./test-lib.sh
+. ../lib-pack.sh
+
+# The sha1s we have in our pack. It's important that these have the same
+# starting byte, so that they end up in the same fanout section of the index.
+# That lets us make sure we are exercising the binary search with both sets.
+LO_SHA1=e68fe8129b546b101aee9510c5328e7f21ca1d18
+HI_SHA1=e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
+
+# And here's a "missing sha1" which will produce failed lookups. It must also
+# be in the same fanout section, and should be between the two (so that during
+# our binary search, we are sure to end up looking at one or the other of the
+# duplicate runs).
+MISSING_SHA1='e69d000000000000000000000000000000000000'
+
+# git will never intentionally create packfiles with
+# duplicate objects, so we have to construct them by hand.
+#
+# $1 is the name of the packfile to create
+#
+# $2 is the number of times to duplicate each object
+create_pack() {
+ pack_header "$((2 * $2))" >"$1" &&
+ for i in $(test_seq 1 "$2"); do
+ pack_obj $LO_SHA1 &&
+ pack_obj $HI_SHA1
+ done >>"$1" &&
+ pack_trailer "$1"
+}
+
+# double-check that create_pack actually works
+test_expect_success 'pack with no duplicates' '
+ create_pack no-dups.pack 1 &&
+ git index-pack --stdin <no-dups.pack
+'
+
+test_expect_success 'index-pack will allow duplicate objects by default' '
+ clear_packs &&
+ create_pack dups.pack 100 &&
+ git index-pack --stdin <dups.pack
+'
+
+test_expect_success 'create batch-check test vectors' '
+ cat >input <<-EOF &&
+ $LO_SHA1
+ $HI_SHA1
+ $MISSING_SHA1
+ EOF
+ cat >expect <<-EOF
+ $LO_SHA1 blob 2
+ $HI_SHA1 blob 0
+ $MISSING_SHA1 missing
+ EOF
+'
+
+test_expect_success 'lookup in duplicated pack (binary search)' '
+ git cat-file --batch-check <input >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' '
+ (
+ GIT_USE_LOOKUP=1 &&
+ export GIT_USE_LOOKUP &&
+ git cat-file --batch-check <input >actual
+ ) &&
+ test_cmp expect actual
+'
+
+test_done
--
1.8.4.rc2.28.g6bb5f3f
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
2013-08-22 23:14 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
@ 2013-08-23 16:41 ` Junio C Hamano
2013-08-23 18:24 ` Jeff King
2013-08-23 19:41 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Johannes Sixt
1 sibling, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2013-08-23 16:41 UTC (permalink / raw)
To: Jeff King; +Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre
Jeff King <peff@peff•net> writes:
> Furthermore, we know that one of our endpoints must be
> the edge of the run of duplicates. For example, given this
> sequence:
>
> idx 0 1 2 3 4 5
> key A C C C C D
>
> If we are searching for "B", we might hit the duplicate run
> at lo=1, hi=3 (e.g., by first mi=3, then mi=0). But we can
> never have lo > 1, because B < C. That is, if our key is
> less than the run, we know that "lo" is the edge, but we can
> say nothing of "hi". Similarly, if our key is greater than
> the run, we know that "hi" is the edge, but we can say
> nothing of "lo". But that is enough for us to return not
> only "not found", but show the position at which we would
> insert the new item.
This is somewhat tricky and may deserve an in-code comment.
> diff --git a/sha1-lookup.c b/sha1-lookup.c
> index c4dc55d..614cbb6 100644
> --- a/sha1-lookup.c
> +++ b/sha1-lookup.c
> @@ -204,7 +204,30 @@ int sha1_entry_pos(const void *table,
> * byte 0 thru (ofs-1) are the same between
> * lo and hi; ofs is the first byte that is
> * different.
> + *
> + * If ofs==20, then no bytes are different,
> + * meaning we have entries with duplicate
> + * keys. We know that we are in a solid run
> + * of this entry (because the entries are
> + * sorted, and our lo and hi are the same,
> + * there can be nothing but this single key
> + * in between). So we can stop the search.
> + * Either one of these entries is it (and
> + * we do not care which), or we do not have
> + * it.
> */
> + if (ofs == 20) {
> + mi = lo;
> + mi_key = base + elem_size * mi + key_offset;
> + cmp = memcmp(mi_key, key, 20);
It think we already know that mi_key[0:ofs_0] and key[0:ofs_0] are
the same at this point and we do not have to compare full 20 bytes
again, but this is done only once and a better readablity of the
above trumps micro-optimization possibility, I think.
> + if (!cmp)
> + return mi;
> + if (cmp < 0)
> + return -1 - hi;
> + else
> + return -1 - lo;
> + }
> +
> hiv = hi_key[ofs_0];
> if (ofs_0 < 19)
> hiv = (hiv << 8) | hi_key[ofs_0+1];
> diff --git a/t/lib-pack.sh b/t/lib-pack.sh
> new file mode 100644
> index 0000000..61c5d19
> --- /dev/null
> +++ b/t/lib-pack.sh
> @@ -0,0 +1,78 @@
> +#!/bin/sh
> +#
> +# Support routines for hand-crafting weird or malicious packs.
> +#
> +# You can make a complete pack like:
> +#
> +# pack_header 2 >foo.pack &&
> +# pack_obj e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 >>foo.pack &&
> +# pack_obj e68fe8129b546b101aee9510c5328e7f21ca1d18 >>foo.pack &&
> +# pack_trailer foo.pack
> +
> +# Print the big-endian 4-byte octal representation of $1
> +uint32_octal() {
micronit (style):
uint32_octal () {
> + n=$1
> + printf '\%o' $(($n / 16777216)); n=$((n % 16777216))
> + printf '\%o' $(($n / 65536)); n=$((n % 65536))
> + printf '\%o' $(($n / 256)); n=$((n % 256))
> + printf '\%o' $(($n ));
> +}
> +
> +# Print the big-endian 4-byte binary representation of $1
> +uint32_binary() {
> + printf "$(uint32_octal "$1")"
> +}
> +
> +# Print a pack header, version 2, for a pack with $1 objects
> +pack_header() {
> + printf 'PACK' &&
> + printf '\0\0\0\2' &&
> + uint32_binary "$1"
> +}
> +
> +# Print the pack data for object $1, as a delta against object $2 (or as a full
> +# object if $2 is missing or empty). The output is suitable for including
> +# directly in the packfile, and represents the entirety of the object entry.
> +# Doing this on the fly (especially picking your deltas) is quite tricky, so we
> +# have hardcoded some well-known objects. See the case statements below for the
> +# complete list.
Cute ;-) I like the idea of having this function with a right API in
place, and cheating by limiting its implementation to what is
necessary.
Thanks.
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
2013-08-23 16:41 ` Junio C Hamano
@ 2013-08-23 18:24 ` Jeff King
2013-08-23 18:54 ` Nicolas Pitre
2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
0 siblings, 2 replies; 37+ messages in thread
From: Jeff King @ 2013-08-23 18:24 UTC (permalink / raw)
To: Junio C Hamano
Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre
On Fri, Aug 23, 2013 at 09:41:57AM -0700, Junio C Hamano wrote:
> Jeff King <peff@peff•net> writes:
>
> > Furthermore, we know that one of our endpoints must be
> > the edge of the run of duplicates. For example, given this
> > sequence:
> >
> > idx 0 1 2 3 4 5
> > key A C C C C D
> >
> > If we are searching for "B", we might hit the duplicate run
> > at lo=1, hi=3 (e.g., by first mi=3, then mi=0). But we can
> > never have lo > 1, because B < C. That is, if our key is
> > less than the run, we know that "lo" is the edge, but we can
> > say nothing of "hi". Similarly, if our key is greater than
> > the run, we know that "hi" is the edge, but we can say
> > nothing of "lo". But that is enough for us to return not
> > only "not found", but show the position at which we would
> > insert the new item.
>
> This is somewhat tricky and may deserve an in-code comment.
Do you want me to re-roll, pushing it down into the comment, or do you
want to mark it up yourself? I think there might be some value in the
latter as your re-writing of it as a comment may cross-check that my
logic is sound.
> > + if (ofs == 20) {
> > + mi = lo;
> > + mi_key = base + elem_size * mi + key_offset;
> > + cmp = memcmp(mi_key, key, 20);
>
> It think we already know that mi_key[0:ofs_0] and key[0:ofs_0] are
> the same at this point and we do not have to compare full 20 bytes
> again, but this is done only once and a better readablity of the
> above trumps micro-optimization possibility, I think.
Yes, I had the same idea, and came to the same conclusion. Though if
anybody did want to try it, note that we have just overwritten the old
ofs_0, so you would want to bump the new code up above that line).
> > +uint32_octal() {
>
> micronit (style):
>
> uint32_octal () {
Hmph. I always forget which one we prefer, and we seem to have equal
numbers of both already. Again, want a re-roll or to mark it up
yourself?
> > +# Print the pack data for object $1, as a delta against object $2 (or as a full
> > +# object if $2 is missing or empty). The output is suitable for including
> > +# directly in the packfile, and represents the entirety of the object entry.
> > +# Doing this on the fly (especially picking your deltas) is quite tricky, so we
> > +# have hardcoded some well-known objects. See the case statements below for the
> > +# complete list.
>
> Cute ;-) I like the idea of having this function with a right API in
> place, and cheating by limiting its implementation to what is
> necessary.
Just for reference, the procedure I used to generate the "base" data is
reasonably straight forward:
sha1=$(printf %s "$content" | git hash-object -w --stdin)
echo $sha1 | git pack-objects --stdout >tmp.pack
tail -c +13 tmp.pack >no-header.pack
head -c -20 no-header.pack >no-trailer.pack
od -b no-trailer.pack | grep ' ' | cut -d' ' -f2- | tr ' ' '\\'
Since we want binary, we can skip the "od" call at the end (I needed it
to convert to something readable to hand "printf"). But "head -c" is not
portable, nor is head with a negative count.
To find items in the same fanout, I just used for-loops to calculate the
sha1s of all 2-byte blobs. And that is why we have the odd magic "\7\76"
blob.
Making the deltas was considerably less elegant, since we cannot provoke
pack-objects to pick arbitrary deltas (and it will not even try to delta
tiny objects, anyway, which would bloat our samples). I ended up with
the horrible patch below. We _could_ clean it up (error-checking? Who
needs it?) and make it a debug-and-testing-only option for pack-objects,
but I just didn't think the grossness was worth it. Still, it's probably
worth documenting here on the list in case somebody else ever needs to
add new samples to lib-pack.sh.
---
builtin/pack-objects.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 8da2a66..e8937f5 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2442,6 +2442,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
const char *rp_av[6];
int rp_ac = 0;
int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
+ int magic = 0;
struct option pack_objects_options[] = {
OPT_SET_INT('q', "quiet", &progress,
N_("do not show progress meter"), 0),
@@ -2505,6 +2506,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
N_("pack compression level")),
OPT_SET_INT(0, "keep-true-parents", &grafts_replace_parents,
N_("do not hide commits by grafts"), 0),
+ OPT_BOOL(0, "magic", &magic, "make deltas"),
OPT_END(),
};
@@ -2520,6 +2522,34 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
argc = parse_options(argc, argv, prefix, pack_objects_options,
pack_usage, 0);
+ if (magic) {
+ unsigned char sha1[20];
+ struct delta_index *index;
+ void *src, *trg, *delta;
+ enum object_type src_type, trg_type;
+ unsigned long src_size, trg_size, delta_size, z_delta_size;
+ unsigned char header[10];
+ unsigned long header_len;
+
+ get_sha1(argv[0], sha1);
+ trg = read_sha1_file(sha1, &trg_type, &trg_size);
+
+ get_sha1(argv[1], sha1);
+ src = read_sha1_file(sha1, &src_type, &src_size);
+
+ index = create_delta_index(src, src_size);
+ delta = create_delta(index, trg, trg_size, &delta_size, 8192);
+
+ z_delta_size = do_compress(&delta, delta_size);
+ header_len = encode_in_pack_object_header(OBJ_REF_DELTA, delta_size,
+ header);
+ fwrite(header, 1, header_len, stdout);
+ fwrite(sha1, 1, 20, stdout);
+ fwrite(delta, 1, z_delta_size, stdout);
+ return 0;
+ }
+
+
if (argc) {
base_name = argv[0];
argc--;
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
2013-08-23 18:24 ` Jeff King
@ 2013-08-23 18:54 ` Nicolas Pitre
2013-08-23 18:56 ` Jeff King
2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
1 sibling, 1 reply; 37+ messages in thread
From: Nicolas Pitre @ 2013-08-23 18:54 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, Git Mailing List, Duy Nguyen, Shawn O. Pearce
On Fri, 23 Aug 2013, Jeff King wrote:
> Making the deltas was considerably less elegant, since we cannot provoke
> pack-objects to pick arbitrary deltas (and it will not even try to delta
> tiny objects, anyway, which would bloat our samples). I ended up with
> the horrible patch below. We _could_ clean it up (error-checking? Who
> needs it?) and make it a debug-and-testing-only option for pack-objects,
> but I just didn't think the grossness was worth it. Still, it's probably
> worth documenting here on the list in case somebody else ever needs to
> add new samples to lib-pack.sh.
Maybe using test-delta (from test-delta.c) would have helped here?
In any case, if something needs to be permanently added into the code to
help in the creation of test objects, I think test-delta.c is a far
better place than pack-objects.c.
Nicolas
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
2013-08-23 18:54 ` Nicolas Pitre
@ 2013-08-23 18:56 ` Jeff King
0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-23 18:56 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Junio C Hamano, Git Mailing List, Duy Nguyen, Shawn O. Pearce
On Fri, Aug 23, 2013 at 02:54:19PM -0400, Nicolas Pitre wrote:
> On Fri, 23 Aug 2013, Jeff King wrote:
>
> > Making the deltas was considerably less elegant, since we cannot provoke
> > pack-objects to pick arbitrary deltas (and it will not even try to delta
> > tiny objects, anyway, which would bloat our samples). I ended up with
> > the horrible patch below. We _could_ clean it up (error-checking? Who
> > needs it?) and make it a debug-and-testing-only option for pack-objects,
> > but I just didn't think the grossness was worth it. Still, it's probably
> > worth documenting here on the list in case somebody else ever needs to
> > add new samples to lib-pack.sh.
>
> Maybe using test-delta (from test-delta.c) would have helped here?
>
> In any case, if something needs to be permanently added into the code to
> help in the creation of test objects, I think test-delta.c is a far
> better place than pack-objects.c.
*forehead palm*
I didn't even know we had test-delta. Yes, that is obviously a way
better place (I initially looked at pack-objects because it has the
helpers to do compression and the type/size header properly).
-Peff
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCHv3 0/6] duplicate objects and delta cycles
2013-08-23 18:24 ` Jeff King
2013-08-23 18:54 ` Nicolas Pitre
@ 2013-08-24 0:01 ` Jeff King
2013-08-24 0:01 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
` (5 more replies)
1 sibling, 6 replies; 37+ messages in thread
From: Jeff King @ 2013-08-24 0:01 UTC (permalink / raw)
To: Junio C Hamano
Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre
On Fri, Aug 23, 2013 at 02:24:10PM -0400, Jeff King wrote:
> > This is somewhat tricky and may deserve an in-code comment.
>
> Do you want me to re-roll[...]
Since there were a few things to fix, I went ahead and re-rolled. Here's
v3, which changes:
1. Move "edge of run" logic description into the in-code comment
rather than the commit message.
2. Include space between shell function names and parentheses.
3. Use $TEST_DIRECTORY to find lib-pack so that "--root" works.
I added in Nicolas's acks, too, as this version makes no changes of
substance from the previous ack'd version. Interdiff is below.
sha1-lookup.c | 24 ++++++++++++++++++++++++
t/lib-pack.sh | 12 ++++++------
t/t5308-pack-detect-duplicates.sh | 4 ++--
t/t5309-pack-delta-cycles.sh | 2 +-
4 files changed, 33 insertions(+), 9 deletions(-)
diff --git a/sha1-lookup.c b/sha1-lookup.c
index 614cbb6..2dd8515 100644
--- a/sha1-lookup.c
+++ b/sha1-lookup.c
@@ -215,6 +215,30 @@ int sha1_entry_pos(const void *table,
* Either one of these entries is it (and
* we do not care which), or we do not have
* it.
+ *
+ * Furthermore, we know that one of our
+ * endpoints must be the edge of the run of
+ * duplicates. For example, given this
+ * sequence:
+ *
+ * idx 0 1 2 3 4 5
+ * key A C C C C D
+ *
+ * If we are searching for "B", we might
+ * hit the duplicate run at lo=1, hi=3
+ * (e.g., by first mi=3, then mi=0). But we
+ * can never have lo > 1, because B < C.
+ * That is, if our key is less than the
+ * run, we know that "lo" is the edge, but
+ * we can say nothing of "hi". Similarly,
+ * if our key is greater than the run, we
+ * know that "hi" is the edge, but we can
+ * say nothing of "lo".
+ *
+ * Therefore if we do not find it, we also
+ * know where it would go if it did exist:
+ * just on the far side of the edge that we
+ * know about.
*/
if (ofs == 20) {
mi = lo;
diff --git a/t/lib-pack.sh b/t/lib-pack.sh
index e028c40..7e8685b 100644
--- a/t/lib-pack.sh
+++ b/t/lib-pack.sh
@@ -10,7 +10,7 @@
# pack_trailer foo.pack
# Print the big-endian 4-byte octal representation of $1
-uint32_octal() {
+uint32_octal () {
n=$1
printf '\%o' $(($n / 16777216)); n=$((n % 16777216))
printf '\%o' $(($n / 65536)); n=$((n % 65536))
@@ -19,12 +19,12 @@ uint32_binary() {
}
# Print the big-endian 4-byte binary representation of $1
-uint32_binary() {
+uint32_binary () {
printf "$(uint32_octal "$1")"
}
# Print a pack header, version 2, for a pack with $1 objects
-pack_header() {
+pack_header () {
printf 'PACK' &&
printf '\0\0\0\2' &&
uint32_binary "$1"
@@ -36,7 +36,7 @@ pack_header() {
# Doing this on the fly (especially picking your deltas) is quite tricky, so we
# have hardcoded some well-known objects. See the case statements below for the
# complete list.
-pack_obj() {
+pack_obj () {
case "$1" in
# empty blob
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391)
@@ -86,7 +86,7 @@ pack_obj() {
}
# Compute and append pack trailer to "$1"
-pack_trailer() {
+pack_trailer () {
test-sha1 -b <"$1" >trailer.tmp &&
cat trailer.tmp >>"$1" &&
rm -f trailer.tmp
@@ -95,6 +95,6 @@ pack_trailer() {
# Remove any existing packs to make sure that
# whatever we index next will be the pack that we
# actually use.
-clear_packs() {
+clear_packs () {
rm -f .git/objects/pack/*
}
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
index e982095..e0ba5e1 100755
--- a/t/t5308-pack-detect-duplicates.sh
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -2,7 +2,7 @@ test_description='handling of duplicate objects in incoming packfiles'
test_description='handling of duplicate objects in incoming packfiles'
. ./test-lib.sh
-. ../lib-pack.sh
+. "$TEST_DIRECTORY"/lib-pack.sh
# The sha1s we have in our pack. It's important that these have the same
# starting byte, so that they end up in the same fanout section of the index.
@@ -22,7 +22,7 @@ MISSING_SHA1='e69d000000000000000000000000000000000000'
# $1 is the name of the packfile to create
#
# $2 is the number of times to duplicate each object
-create_pack() {
+create_pack () {
pack_header "$((2 * $2))" >"$1" &&
for i in $(test_seq 1 "$2"); do
pack_obj $LO_SHA1 &&
diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh
index 38e1809..3e7861b 100755
--- a/t/t5309-pack-delta-cycles.sh
+++ b/t/t5309-pack-delta-cycles.sh
@@ -2,7 +2,7 @@ test_description='test index-pack handling of delta cycles in packfiles'
test_description='test index-pack handling of delta cycles in packfiles'
. ./test-lib.sh
-. ../lib-pack.sh
+. "$TEST_DIRECTORY"/lib-pack.sh
# Two similar-ish objects that we have computed deltas between.
A=01d7713666f4de822776c7622c10f1b07de280dc
^ permalink raw reply related [flat|nested] 37+ messages in thread* [PATCH 1/6] test-sha1: add a binary output mode
2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
@ 2013-08-24 0:01 ` Jeff King
2013-08-24 0:02 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
` (4 subsequent siblings)
5 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-24 0:01 UTC (permalink / raw)
To: Junio C Hamano
Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre
The test-sha1 helper program will run our internal sha1
routines over its input and output the 40-byte hex sha1.
Sometimes, however, it is useful to have the binary 20-byte
sha1 (and it's a pain to convert back in the shell). Let's
add a "-b" option to output the binary version.
Signed-off-by: Jeff King <peff@peff•net>
Acked-by: Nicolas Pitre <nico@fluxnic•net>
---
test-sha1.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/test-sha1.c b/test-sha1.c
index 80daba9..e57eae1 100644
--- a/test-sha1.c
+++ b/test-sha1.c
@@ -5,10 +5,15 @@ int main(int ac, char **av)
git_SHA_CTX ctx;
unsigned char sha1[20];
unsigned bufsz = 8192;
+ int binary = 0;
char *buffer;
- if (ac == 2)
- bufsz = strtoul(av[1], NULL, 10) * 1024 * 1024;
+ if (ac == 2) {
+ if (!strcmp(av[1], "-b"))
+ binary = 1;
+ else
+ bufsz = strtoul(av[1], NULL, 10) * 1024 * 1024;
+ }
if (!bufsz)
bufsz = 8192;
@@ -42,6 +47,10 @@ int main(int ac, char **av)
git_SHA1_Update(&ctx, buffer, this_sz);
}
git_SHA1_Final(sha1, &ctx);
- puts(sha1_to_hex(sha1));
+
+ if (binary)
+ fwrite(sha1, 1, 20, stdout);
+ else
+ puts(sha1_to_hex(sha1));
exit(0);
}
--
1.8.4.rc2.28.g6bb5f3f
^ permalink raw reply related [flat|nested] 37+ messages in thread* [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
2013-08-24 0:01 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
@ 2013-08-24 0:02 ` Jeff King
2013-08-24 0:02 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
` (3 subsequent siblings)
5 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-24 0:02 UTC (permalink / raw)
To: Junio C Hamano
Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre
The sha1_entry_pos function tries to be smart about
selecting the middle of a range for its binary search by
looking at the value differences between the "lo" and "hi"
constraints. However, it is unable to cope with entries with
duplicate keys in the sorted list.
We may hit a point in the search where both our "lo" and
"hi" point to the same key. In this case, the range of
values between our endpoints is 0, and trying to scale the
difference between our key and the endpoints over that range
is undefined (i.e., divide by zero). The current code
catches this with an "assert(lov < hiv)".
Moreover, after seeing that the first 20 byte of the key are
the same, we will try to establish a value from the 21st
byte. Which is nonsensical.
Instead, we can detect the case that we are in a run of
duplicates, and simply do a final comparison against any one
of them (since they are all the same, it does not matter
which). If the keys match, we have found our entry (or one
of them, anyway). If not, then we know that we do not need
to look further, as we must be in a run of the duplicate
key.
Signed-off-by: Jeff King <peff@peff•net>
Acked-by: Nicolas Pitre <nico@fluxnic•net>
---
sha1-lookup.c | 47 +++++++++++++++++++++++
t/lib-pack.sh | 78 +++++++++++++++++++++++++++++++++++++++
t/t5308-pack-detect-duplicates.sh | 73 ++++++++++++++++++++++++++++++++++++
3 files changed, 198 insertions(+)
create mode 100644 t/lib-pack.sh
create mode 100755 t/t5308-pack-detect-duplicates.sh
diff --git a/sha1-lookup.c b/sha1-lookup.c
index c4dc55d..2dd8515 100644
--- a/sha1-lookup.c
+++ b/sha1-lookup.c
@@ -204,7 +204,54 @@ int sha1_entry_pos(const void *table,
* byte 0 thru (ofs-1) are the same between
* lo and hi; ofs is the first byte that is
* different.
+ *
+ * If ofs==20, then no bytes are different,
+ * meaning we have entries with duplicate
+ * keys. We know that we are in a solid run
+ * of this entry (because the entries are
+ * sorted, and our lo and hi are the same,
+ * there can be nothing but this single key
+ * in between). So we can stop the search.
+ * Either one of these entries is it (and
+ * we do not care which), or we do not have
+ * it.
+ *
+ * Furthermore, we know that one of our
+ * endpoints must be the edge of the run of
+ * duplicates. For example, given this
+ * sequence:
+ *
+ * idx 0 1 2 3 4 5
+ * key A C C C C D
+ *
+ * If we are searching for "B", we might
+ * hit the duplicate run at lo=1, hi=3
+ * (e.g., by first mi=3, then mi=0). But we
+ * can never have lo > 1, because B < C.
+ * That is, if our key is less than the
+ * run, we know that "lo" is the edge, but
+ * we can say nothing of "hi". Similarly,
+ * if our key is greater than the run, we
+ * know that "hi" is the edge, but we can
+ * say nothing of "lo".
+ *
+ * Therefore if we do not find it, we also
+ * know where it would go if it did exist:
+ * just on the far side of the edge that we
+ * know about.
*/
+ if (ofs == 20) {
+ mi = lo;
+ mi_key = base + elem_size * mi + key_offset;
+ cmp = memcmp(mi_key, key, 20);
+ if (!cmp)
+ return mi;
+ if (cmp < 0)
+ return -1 - hi;
+ else
+ return -1 - lo;
+ }
+
hiv = hi_key[ofs_0];
if (ofs_0 < 19)
hiv = (hiv << 8) | hi_key[ofs_0+1];
diff --git a/t/lib-pack.sh b/t/lib-pack.sh
new file mode 100644
index 0000000..fecd5a0
--- /dev/null
+++ b/t/lib-pack.sh
@@ -0,0 +1,78 @@
+#!/bin/sh
+#
+# Support routines for hand-crafting weird or malicious packs.
+#
+# You can make a complete pack like:
+#
+# pack_header 2 >foo.pack &&
+# pack_obj e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 >>foo.pack &&
+# pack_obj e68fe8129b546b101aee9510c5328e7f21ca1d18 >>foo.pack &&
+# pack_trailer foo.pack
+
+# Print the big-endian 4-byte octal representation of $1
+uint32_octal () {
+ n=$1
+ printf '\%o' $(($n / 16777216)); n=$((n % 16777216))
+ printf '\%o' $(($n / 65536)); n=$((n % 65536))
+ printf '\%o' $(($n / 256)); n=$((n % 256))
+ printf '\%o' $(($n ));
+}
+
+# Print the big-endian 4-byte binary representation of $1
+uint32_binary () {
+ printf "$(uint32_octal "$1")"
+}
+
+# Print a pack header, version 2, for a pack with $1 objects
+pack_header () {
+ printf 'PACK' &&
+ printf '\0\0\0\2' &&
+ uint32_binary "$1"
+}
+
+# Print the pack data for object $1, as a delta against object $2 (or as a full
+# object if $2 is missing or empty). The output is suitable for including
+# directly in the packfile, and represents the entirety of the object entry.
+# Doing this on the fly (especially picking your deltas) is quite tricky, so we
+# have hardcoded some well-known objects. See the case statements below for the
+# complete list.
+pack_obj () {
+ case "$1" in
+ # empty blob
+ e69de29bb2d1d6434b8b29ae775ad8c2e48c5391)
+ case "$2" in
+ '')
+ printf '\060\170\234\003\0\0\0\0\1'
+ return
+ ;;
+ esac
+ ;;
+
+ # blob containing "\7\76"
+ e68fe8129b546b101aee9510c5328e7f21ca1d18)
+ case "$2" in
+ '')
+ printf '\062\170\234\143\267\3\0\0\116\0\106'
+ return
+ ;;
+ esac
+ ;;
+ esac
+
+ echo >&2 "BUG: don't know how to print $1${2:+ (from $2)}"
+ return 1
+}
+
+# Compute and append pack trailer to "$1"
+pack_trailer () {
+ test-sha1 -b <"$1" >trailer.tmp &&
+ cat trailer.tmp >>"$1" &&
+ rm -f trailer.tmp
+}
+
+# Remove any existing packs to make sure that
+# whatever we index next will be the pack that we
+# actually use.
+clear_packs () {
+ rm -f .git/objects/pack/*
+}
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
new file mode 100755
index 0000000..04fe242
--- /dev/null
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -0,0 +1,73 @@
+#!/bin/sh
+
+test_description='handling of duplicate objects in incoming packfiles'
+. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-pack.sh
+
+# The sha1s we have in our pack. It's important that these have the same
+# starting byte, so that they end up in the same fanout section of the index.
+# That lets us make sure we are exercising the binary search with both sets.
+LO_SHA1=e68fe8129b546b101aee9510c5328e7f21ca1d18
+HI_SHA1=e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
+
+# And here's a "missing sha1" which will produce failed lookups. It must also
+# be in the same fanout section, and should be between the two (so that during
+# our binary search, we are sure to end up looking at one or the other of the
+# duplicate runs).
+MISSING_SHA1='e69d000000000000000000000000000000000000'
+
+# git will never intentionally create packfiles with
+# duplicate objects, so we have to construct them by hand.
+#
+# $1 is the name of the packfile to create
+#
+# $2 is the number of times to duplicate each object
+create_pack () {
+ pack_header "$((2 * $2))" >"$1" &&
+ for i in $(test_seq 1 "$2"); do
+ pack_obj $LO_SHA1 &&
+ pack_obj $HI_SHA1
+ done >>"$1" &&
+ pack_trailer "$1"
+}
+
+# double-check that create_pack actually works
+test_expect_success 'pack with no duplicates' '
+ create_pack no-dups.pack 1 &&
+ git index-pack --stdin <no-dups.pack
+'
+
+test_expect_success 'index-pack will allow duplicate objects by default' '
+ clear_packs &&
+ create_pack dups.pack 100 &&
+ git index-pack --stdin <dups.pack
+'
+
+test_expect_success 'create batch-check test vectors' '
+ cat >input <<-EOF &&
+ $LO_SHA1
+ $HI_SHA1
+ $MISSING_SHA1
+ EOF
+ cat >expect <<-EOF
+ $LO_SHA1 blob 2
+ $HI_SHA1 blob 0
+ $MISSING_SHA1 missing
+ EOF
+'
+
+test_expect_success 'lookup in duplicated pack (binary search)' '
+ git cat-file --batch-check <input >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' '
+ (
+ GIT_USE_LOOKUP=1 &&
+ export GIT_USE_LOOKUP &&
+ git cat-file --batch-check <input >actual
+ ) &&
+ test_cmp expect actual
+'
+
+test_done
--
1.8.4.rc2.28.g6bb5f3f
^ permalink raw reply related [flat|nested] 37+ messages in thread* [PATCH 3/6] add tests for indexing packs with delta cycles
2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
2013-08-24 0:01 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
2013-08-24 0:02 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
@ 2013-08-24 0:02 ` Jeff King
2013-08-24 0:02 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
` (2 subsequent siblings)
5 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-24 0:02 UTC (permalink / raw)
To: Junio C Hamano
Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre
If we receive a broken or malicious pack from a remote, we
will feed it to index-pack. As index-pack processes the
objects as a stream, reconstructing and hashing each object
to get its name, it is not very susceptible to doing the
wrong with bad data (it simply notices that the data is
bogus and aborts).
However, one question raised on the list is whether it could
be susceptible to problems during the delta-resolution
phase. In particular, can a cycle in the packfile deltas
cause us to go into an infinite loop or cause any other
problem?
The answer is no.
We cannot have a cycle of delta-base offsets, because they
go only in one direction (the OFS_DELTA object mentions its
base by an offset towards the beginning of the file, and we
explicitly reject negative offsets).
We can have a cycle of REF_DELTA objects, which refer to
base objects by sha1 name. However, index-pack does not know
these sha1 names ahead of time; it has to reconstruct the
objects to get their names, and it cannot do so if there is
a delta cycle (in other words, it does not even realize
there is a cycle, but only that there are items that cannot
be resolved).
Even though we can reason out that index-pack should handle
this fine, let's add a few tests to make sure it behaves
correctly.
Signed-off-by: Jeff King <peff@peff•net>
Acked-by: Nicolas Pitre <nico@fluxnic•net>
---
t/lib-pack.sh | 22 +++++++++++++++++
t/t5309-pack-delta-cycles.sh | 59 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 81 insertions(+)
create mode 100755 t/t5309-pack-delta-cycles.sh
diff --git a/t/lib-pack.sh b/t/lib-pack.sh
index fecd5a0..7e8685b 100644
--- a/t/lib-pack.sh
+++ b/t/lib-pack.sh
@@ -55,6 +55,28 @@ pack_obj () {
printf '\062\170\234\143\267\3\0\0\116\0\106'
return
;;
+ 01d7713666f4de822776c7622c10f1b07de280dc)
+ printf '\165\1\327\161\66\146\364\336\202\47\166' &&
+ printf '\307\142\54\20\361\260\175\342\200\334\170' &&
+ printf '\234\143\142\142\142\267\003\0\0\151\0\114'
+ return
+ ;;
+ esac
+ ;;
+
+ # blob containing "\7\0"
+ 01d7713666f4de822776c7622c10f1b07de280dc)
+ case "$2" in
+ '')
+ printf '\062\170\234\143\147\0\0\0\20\0\10'
+ return
+ ;;
+ e68fe8129b546b101aee9510c5328e7f21ca1d18)
+ printf '\165\346\217\350\22\233\124\153\20\32\356' &&
+ printf '\225\20\305\62\216\177\41\312\35\30\170\234' &&
+ printf '\143\142\142\142\147\0\0\0\53\0\16'
+ return
+ ;;
esac
;;
esac
diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh
new file mode 100755
index 0000000..1640bf7
--- /dev/null
+++ b/t/t5309-pack-delta-cycles.sh
@@ -0,0 +1,59 @@
+#!/bin/sh
+
+test_description='test index-pack handling of delta cycles in packfiles'
+. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-pack.sh
+
+# Two similar-ish objects that we have computed deltas between.
+A=01d7713666f4de822776c7622c10f1b07de280dc
+B=e68fe8129b546b101aee9510c5328e7f21ca1d18
+
+# double-check our hand-constucted packs
+test_expect_success 'index-pack works with a single delta (A->B)' '
+ clear_packs &&
+ {
+ pack_header 2 &&
+ pack_obj $A $B &&
+ pack_obj $B
+ } >ab.pack &&
+ pack_trailer ab.pack &&
+ git index-pack --stdin <ab.pack &&
+ git cat-file -t $A &&
+ git cat-file -t $B
+'
+
+test_expect_success 'index-pack works with a single delta (B->A)' '
+ clear_packs &&
+ {
+ pack_header 2 &&
+ pack_obj $A &&
+ pack_obj $B $A
+ } >ba.pack &&
+ pack_trailer ba.pack &&
+ git index-pack --stdin <ba.pack &&
+ git cat-file -t $A &&
+ git cat-file -t $B
+'
+
+test_expect_success 'index-pack detects missing base objects' '
+ clear_packs &&
+ {
+ pack_header 1 &&
+ pack_obj $A $B
+ } >missing.pack &&
+ pack_trailer missing.pack &&
+ test_must_fail git index-pack --fix-thin --stdin <missing.pack
+'
+
+test_expect_success 'index-pack detects REF_DELTA cycles' '
+ clear_packs &&
+ {
+ pack_header 2 &&
+ pack_obj $A $B &&
+ pack_obj $B $A
+ } >cycle.pack &&
+ pack_trailer cycle.pack &&
+ test_must_fail git index-pack --fix-thin --stdin <cycle.pack
+'
+
+test_done
--
1.8.4.rc2.28.g6bb5f3f
^ permalink raw reply related [flat|nested] 37+ messages in thread* [PATCH 4/6] test index-pack on packs with recoverable delta cycles
2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
` (2 preceding siblings ...)
2013-08-24 0:02 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
@ 2013-08-24 0:02 ` Jeff King
2013-08-24 0:02 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
2013-08-24 0:02 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
5 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-24 0:02 UTC (permalink / raw)
To: Junio C Hamano
Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre
The previous commit added tests to show that index-pack
correctly bails in unrecoverable situations. There are some
situations where the data could be recovered, but it is not
currently:
1. If we can break the cycle using an object from another
pack via --fix-thin.
2. If we can break the cycle using a duplicate of one of
the objects found in the same pack.
Note that neither of these is particularly high priority; a
delta cycle within a pack should never occur, and we have no
record of even a buggy git implementation creating such a
pack.
However, it's worth adding these tests for two reasons. One,
to document that we do not currently handle the situation,
even though it is possible. And two, to exercise the code
that runs in this situation; even though it fails, by
running it we can confirm that index-pack detects the
situation and aborts, and does not misbehave (e.g., by
following the cycle in an infinite loop).
In both cases, we hit an assert that aborts index-pack.
Signed-off-by: Jeff King <peff@peff•net>
Acked-by: Nicolas Pitre <nico@fluxnic•net>
---
t/t5309-pack-delta-cycles.sh | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh
index 1640bf7..3e7861b 100755
--- a/t/t5309-pack-delta-cycles.sh
+++ b/t/t5309-pack-delta-cycles.sh
@@ -56,4 +56,22 @@ test_expect_success 'index-pack detects REF_DELTA cycles' '
test_must_fail git index-pack --fix-thin --stdin <cycle.pack
'
+test_expect_failure 'failover to an object in another pack' '
+ clear_packs &&
+ git index-pack --stdin <ab.pack &&
+ git index-pack --stdin --fix-thin <cycle.pack
+'
+
+test_expect_failure 'failover to a duplicate object in the same pack' '
+ clear_packs &&
+ {
+ pack_header 3 &&
+ pack_obj $A $B &&
+ pack_obj $B $A &&
+ pack_obj $A
+ } >recoverable.pack &&
+ pack_trailer recoverable.pack &&
+ git index-pack --fix-thin --stdin <recoverable.pack
+'
+
test_done
--
1.8.4.rc2.28.g6bb5f3f
^ permalink raw reply related [flat|nested] 37+ messages in thread* [PATCH 5/6] index-pack: optionally reject packs with duplicate objects
2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
` (3 preceding siblings ...)
2013-08-24 0:02 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
@ 2013-08-24 0:02 ` Jeff King
2013-08-24 0:02 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
5 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-24 0:02 UTC (permalink / raw)
To: Junio C Hamano
Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre
Git should never generate packs with duplicate objects.
However, we may see such packs due to bugs in Git or other
implementations (e.g., JGit had such a bug a few years ago).
In theory, such packs should not be a problem for us (we
will simply find one of the instances of the object when
looking in the pack). However, the JGit bug report mentioned
possible infinite loops during repacking due to cycles in
the delta chain. Though this problem hasn't specifically
been reproduced on modern git, there is no reason not to be
careful with incoming packs, given that only buggy
implementations should be producing such packs, anyway.
This patch introduces the pack.indexDuplicates option to
allow or reject such packs from index-pack. The default
remains to allow it.
Signed-off-by: Jeff King <peff@peff•net>
Acked-by: Nicolas Pitre <nico@fluxnic•net>
---
builtin/index-pack.c | 4 ++++
pack-write.c | 24 ++++++++++++++++++++++++
pack.h | 2 ++
t/t5308-pack-detect-duplicates.sh | 8 ++++++++
4 files changed, 38 insertions(+)
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 79dfe47..72e19a0 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1364,6 +1364,10 @@ static int git_index_pack_config(const char *k, const char *v, void *cb)
#endif
return 0;
}
+ if (!strcmp(k, "pack.indexduplicates")) {
+ opts->allow_duplicates = git_config_bool(k, v);
+ return 0;
+ }
return git_default_config(k, v, cb);
}
diff --git a/pack-write.c b/pack-write.c
index ca9e63b..da946a7 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -7,6 +7,7 @@ void reset_pack_idx_option(struct pack_idx_option *opts)
memset(opts, 0, sizeof(*opts));
opts->version = 2;
opts->off32_limit = 0x7fffffff;
+ opts->allow_duplicates = 1;
}
static int sha1_compare(const void *_a, const void *_b)
@@ -37,6 +38,19 @@ static int need_large_offset(off_t offset, const struct pack_idx_option *opts)
sizeof(ofsval), cmp_uint32);
}
+static void *find_duplicate(void *vbase, size_t n, size_t size,
+ int (*cmp)(const void *, const void *))
+{
+ unsigned char *base = vbase;
+ while (n > 1) {
+ if (!cmp(base, base + size))
+ return base;
+ base += size;
+ n--;
+ }
+ return NULL;
+}
+
/*
* On entry *sha1 contains the pack content SHA1 hash, on exit it is
* the SHA1 hash of sorted object names. The objects array passed in
@@ -68,6 +82,16 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
else
sorted_by_sha = list = last = NULL;
+ if (!opts->allow_duplicates) {
+ struct pack_idx_entry **dup;
+
+ dup = find_duplicate(sorted_by_sha, nr_objects,
+ sizeof(*sorted_by_sha), sha1_compare);
+ if (dup)
+ die("pack has duplicate entries for %s",
+ sha1_to_hex((*dup)->sha1));
+ }
+
if (opts->flags & WRITE_IDX_VERIFY) {
assert(index_name);
f = sha1fd_check(index_name);
diff --git a/pack.h b/pack.h
index aa6ee7d..45217b6 100644
--- a/pack.h
+++ b/pack.h
@@ -44,6 +44,8 @@ struct pack_idx_option {
uint32_t version;
uint32_t off32_limit;
+ int allow_duplicates;
+
/*
* List of offsets that would fit within off32_limit but
* need to be written out as 64-bit entity for byte-for-byte
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
index 04fe242..97ce2e0 100755
--- a/t/t5308-pack-detect-duplicates.sh
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -70,4 +70,12 @@ test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' '
test_cmp expect actual
'
+test_expect_success 'index-pack can reject packs with duplicates' '
+ clear_packs &&
+ create_pack dups.pack 2 &&
+ test_must_fail \
+ git -c pack.indexDuplicates=0 index-pack --stdin <dups.pack &&
+ test_expect_code 1 git cat-file -e $LO_SHA1
+'
+
test_done
--
1.8.4.rc2.28.g6bb5f3f
^ permalink raw reply related [flat|nested] 37+ messages in thread* [PATCH 6/6] default pack.indexDuplicates to false
2013-08-24 0:01 ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
` (4 preceding siblings ...)
2013-08-24 0:02 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
@ 2013-08-24 0:02 ` Jeff King
5 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-24 0:02 UTC (permalink / raw)
To: Junio C Hamano
Cc: Git Mailing List, Duy Nguyen, Shawn O. Pearce, Nicolas Pitre
We should never see duplicate objects in packs, and it is
unknown what kind of complications such packs could create
during the repacking process. The previous commit introduced
a safety valve for checking packs coming into the repository
and being indexed by index-pack.
Let's turn the safety valve on by default. This helps
protect sites receiving packfiles from potentially malicious
strangers, and shouldn't affect normal use (and if it does,
it will have helped us identify a buggy packfile writer).
In a pinch, users can recover by turning off the option, or
by resorting to unpack-objects to explode the pack.
Note that this also turns the option on for packs we write
ourselves (e.g., via pack-objects, fast-import, or streaming
large blobs). We do not expect maliciously constructed
packfiles in these code paths, of course, but it can serve
as an extra check that we have no accidentally created a
buggy pack (and the check itself is cheap to perform).
Signed-off-by: Jeff King <peff@peff•net>
Acked-by: Nicolas Pitre <nico@fluxnic•net>
---
pack-write.c | 1 -
t/t5308-pack-detect-duplicates.sh | 9 ++++-----
2 files changed, 4 insertions(+), 6 deletions(-)
diff --git a/pack-write.c b/pack-write.c
index da946a7..1e3c459 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -7,7 +7,6 @@ void reset_pack_idx_option(struct pack_idx_option *opts)
memset(opts, 0, sizeof(*opts));
opts->version = 2;
opts->off32_limit = 0x7fffffff;
- opts->allow_duplicates = 1;
}
static int sha1_compare(const void *_a, const void *_b)
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
index 97ce2e0..e0ba5e1 100755
--- a/t/t5308-pack-detect-duplicates.sh
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -37,10 +37,10 @@ test_expect_success 'index-pack will allow duplicate objects by default' '
git index-pack --stdin <no-dups.pack
'
-test_expect_success 'index-pack will allow duplicate objects by default' '
+test_expect_success 'index-pack will allow duplicate objects if asked' '
clear_packs &&
create_pack dups.pack 100 &&
- git index-pack --stdin <dups.pack
+ git -c pack.indexDuplicates=true index-pack --stdin <dups.pack
'
test_expect_success 'create batch-check test vectors' '
@@ -70,11 +70,10 @@ test_expect_success 'index-pack can reject packs with duplicates' '
test_cmp expect actual
'
-test_expect_success 'index-pack can reject packs with duplicates' '
+test_expect_success 'index-pack rejects packs with duplicates by default' '
clear_packs &&
create_pack dups.pack 2 &&
- test_must_fail \
- git -c pack.indexDuplicates=0 index-pack --stdin <dups.pack &&
+ test_must_fail git index-pack --stdin <dups.pack &&
test_expect_code 1 git cat-file -e $LO_SHA1
'
--
1.8.4.rc2.28.g6bb5f3f
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
2013-08-22 23:14 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
2013-08-23 16:41 ` Junio C Hamano
@ 2013-08-23 19:41 ` Johannes Sixt
2013-08-23 23:37 ` Jeff King
1 sibling, 1 reply; 37+ messages in thread
From: Johannes Sixt @ 2013-08-23 19:41 UTC (permalink / raw)
To: Jeff King
Cc: Git Mailing List, Duy Nguyen, Junio C Hamano, Shawn O. Pearce,
Nicolas Pitre
Am 23.08.2013 01:14, schrieb Jeff King:
> +++ b/t/t5308-pack-detect-duplicates.sh
> @@ -0,0 +1,73 @@
> +#!/bin/sh
> +
> +test_description='handling of duplicate objects in incoming packfiles'
> +. ./test-lib.sh
> +. ../lib-pack.sh
This should be
. "$TEST_DIRECTORY"/lib-pack.sh
to support running tests with --root (also in patch 3/6).
-- Hannes
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
2013-08-23 19:41 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Johannes Sixt
@ 2013-08-23 23:37 ` Jeff King
0 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-23 23:37 UTC (permalink / raw)
To: Johannes Sixt
Cc: Git Mailing List, Duy Nguyen, Junio C Hamano, Shawn O. Pearce,
Nicolas Pitre
On Fri, Aug 23, 2013 at 09:41:39PM +0200, Johannes Sixt wrote:
> Am 23.08.2013 01:14, schrieb Jeff King:
> >+++ b/t/t5308-pack-detect-duplicates.sh
> >@@ -0,0 +1,73 @@
> >+#!/bin/sh
> >+
> >+test_description='handling of duplicate objects in incoming packfiles'
> >+. ./test-lib.sh
> >+. ../lib-pack.sh
>
> This should be
>
> . "$TEST_DIRECTORY"/lib-pack.sh
>
> to support running tests with --root (also in patch 3/6).
Doh, you would think that I would remember that, as the one who
introduced "--root" in the first place.
Will fix. Thanks for noticing.
-Peff
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 3/6] add tests for indexing packs with delta cycles
2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
2013-08-22 23:12 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
2013-08-22 23:14 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
@ 2013-08-22 23:14 ` Jeff King
2013-08-22 23:15 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
` (3 subsequent siblings)
6 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-22 23:14 UTC (permalink / raw)
To: Git Mailing List
Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre
If we receive a broken or malicious pack from a remote, we
will feed it to index-pack. As index-pack processes the
objects as a stream, reconstructing and hashing each object
to get its name, it is not very susceptible to doing the
wrong with bad data (it simply notices that the data is
bogus and aborts).
However, one question raised on the list is whether it could
be susceptible to problems during the delta-resolution
phase. In particular, can a cycle in the packfile deltas
cause us to go into an infinite loop or cause any other
problem?
The answer is no.
We cannot have a cycle of delta-base offsets, because they
go only in one direction (the OFS_DELTA object mentions its
base by an offset towards the beginning of the file, and we
explicitly reject negative offsets).
We can have a cycle of REF_DELTA objects, which refer to
base objects by sha1 name. However, index-pack does not know
these sha1 names ahead of time; it has to reconstruct the
objects to get their names, and it cannot do so if there is
a delta cycle (in other words, it does not even realize
there is a cycle, but only that there are items that cannot
be resolved).
Even though we can reason out that index-pack should handle
this fine, let's add a few tests to make sure it behaves
correctly.
Signed-off-by: Jeff King <peff@peff•net>
---
t/lib-pack.sh | 22 +++++++++++++++++
t/t5309-pack-delta-cycles.sh | 59 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 81 insertions(+)
create mode 100755 t/t5309-pack-delta-cycles.sh
diff --git a/t/lib-pack.sh b/t/lib-pack.sh
index 61c5d19..e028c40 100644
--- a/t/lib-pack.sh
+++ b/t/lib-pack.sh
@@ -55,6 +55,28 @@ pack_obj() {
printf '\062\170\234\143\267\3\0\0\116\0\106'
return
;;
+ 01d7713666f4de822776c7622c10f1b07de280dc)
+ printf '\165\1\327\161\66\146\364\336\202\47\166' &&
+ printf '\307\142\54\20\361\260\175\342\200\334\170' &&
+ printf '\234\143\142\142\142\267\003\0\0\151\0\114'
+ return
+ ;;
+ esac
+ ;;
+
+ # blob containing "\7\0"
+ 01d7713666f4de822776c7622c10f1b07de280dc)
+ case "$2" in
+ '')
+ printf '\062\170\234\143\147\0\0\0\20\0\10'
+ return
+ ;;
+ e68fe8129b546b101aee9510c5328e7f21ca1d18)
+ printf '\165\346\217\350\22\233\124\153\20\32\356' &&
+ printf '\225\20\305\62\216\177\41\312\35\30\170\234' &&
+ printf '\143\142\142\142\147\0\0\0\53\0\16'
+ return
+ ;;
esac
;;
esac
diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh
new file mode 100755
index 0000000..9b3f2c3
--- /dev/null
+++ b/t/t5309-pack-delta-cycles.sh
@@ -0,0 +1,59 @@
+#!/bin/sh
+
+test_description='test index-pack handling of delta cycles in packfiles'
+. ./test-lib.sh
+. ../lib-pack.sh
+
+# Two similar-ish objects that we have computed deltas between.
+A=01d7713666f4de822776c7622c10f1b07de280dc
+B=e68fe8129b546b101aee9510c5328e7f21ca1d18
+
+# double-check our hand-constucted packs
+test_expect_success 'index-pack works with a single delta (A->B)' '
+ clear_packs &&
+ {
+ pack_header 2 &&
+ pack_obj $A $B &&
+ pack_obj $B
+ } >ab.pack &&
+ pack_trailer ab.pack &&
+ git index-pack --stdin <ab.pack &&
+ git cat-file -t $A &&
+ git cat-file -t $B
+'
+
+test_expect_success 'index-pack works with a single delta (B->A)' '
+ clear_packs &&
+ {
+ pack_header 2 &&
+ pack_obj $A &&
+ pack_obj $B $A
+ } >ba.pack &&
+ pack_trailer ba.pack &&
+ git index-pack --stdin <ba.pack &&
+ git cat-file -t $A &&
+ git cat-file -t $B
+'
+
+test_expect_success 'index-pack detects missing base objects' '
+ clear_packs &&
+ {
+ pack_header 1 &&
+ pack_obj $A $B
+ } >missing.pack &&
+ pack_trailer missing.pack &&
+ test_must_fail git index-pack --fix-thin --stdin <missing.pack
+'
+
+test_expect_success 'index-pack detects REF_DELTA cycles' '
+ clear_packs &&
+ {
+ pack_header 2 &&
+ pack_obj $A $B &&
+ pack_obj $B $A
+ } >cycle.pack &&
+ pack_trailer cycle.pack &&
+ test_must_fail git index-pack --fix-thin --stdin <cycle.pack
+'
+
+test_done
--
1.8.4.rc2.28.g6bb5f3f
^ permalink raw reply related [flat|nested] 37+ messages in thread* [PATCH 4/6] test index-pack on packs with recoverable delta cycles
2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
` (2 preceding siblings ...)
2013-08-22 23:14 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
@ 2013-08-22 23:15 ` Jeff King
2013-08-22 23:15 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
` (2 subsequent siblings)
6 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-22 23:15 UTC (permalink / raw)
To: Git Mailing List
Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre
The previous commit added tests to show that index-pack
correctly bails in unrecoverable situations. There are some
situations where the data could be recovered, but it is not
currently:
1. If we can break the cycle using an object from another
pack via --fix-thin.
2. If we can break the cycle using a duplicate of one of
the objects found in the same pack.
Note that neither of these is particularly high priority; a
delta cycle within a pack should never occur, and we have no
record of even a buggy git implementation creating such a
pack.
However, it's worth adding these tests for two reasons. One,
to document that we do not currently handle the situation,
even though it is possible. And two, to exercise the code
that runs in this situation; even though it fails, by
running it we can confirm that index-pack detects the
situation and aborts, and does not misbehave (e.g., by
following the cycle in an infinite loop).
In both cases, we hit an assert that aborts index-pack.
Signed-off-by: Jeff King <peff@peff•net>
---
t/t5309-pack-delta-cycles.sh | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/t/t5309-pack-delta-cycles.sh b/t/t5309-pack-delta-cycles.sh
index 9b3f2c3..38e1809 100755
--- a/t/t5309-pack-delta-cycles.sh
+++ b/t/t5309-pack-delta-cycles.sh
@@ -56,4 +56,22 @@ test_expect_success 'index-pack detects REF_DELTA cycles' '
test_must_fail git index-pack --fix-thin --stdin <cycle.pack
'
+test_expect_failure 'failover to an object in another pack' '
+ clear_packs &&
+ git index-pack --stdin <ab.pack &&
+ git index-pack --stdin --fix-thin <cycle.pack
+'
+
+test_expect_failure 'failover to a duplicate object in the same pack' '
+ clear_packs &&
+ {
+ pack_header 3 &&
+ pack_obj $A $B &&
+ pack_obj $B $A &&
+ pack_obj $A
+ } >recoverable.pack &&
+ pack_trailer recoverable.pack &&
+ git index-pack --fix-thin --stdin <recoverable.pack
+'
+
test_done
--
1.8.4.rc2.28.g6bb5f3f
^ permalink raw reply related [flat|nested] 37+ messages in thread* [PATCH 5/6] index-pack: optionally reject packs with duplicate objects
2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
` (3 preceding siblings ...)
2013-08-22 23:15 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
@ 2013-08-22 23:15 ` Jeff King
2013-08-22 23:16 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
2013-08-22 23:35 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Nicolas Pitre
6 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-22 23:15 UTC (permalink / raw)
To: Git Mailing List
Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre
Git should never generate packs with duplicate objects.
However, we may see such packs due to bugs in Git or other
implementations (e.g., JGit had such a bug a few years ago).
In theory, such packs should not be a problem for us (we
will simply find one of the instances of the object when
looking in the pack). However, the JGit bug report mentioned
possible infinite loops during repacking due to cycles in
the delta chain. Though this problem hasn't specifically
been reproduced on modern git, there is no reason not to be
careful with incoming packs, given that only buggy
implementations should be producing such packs, anyway.
This patch introduces the pack.indexDuplicates option to
allow or reject such packs from index-pack. The default
remains to allow it.
Signed-off-by: Jeff King <peff@peff•net>
---
builtin/index-pack.c | 4 ++++
pack-write.c | 24 ++++++++++++++++++++++++
pack.h | 2 ++
t/t5308-pack-detect-duplicates.sh | 8 ++++++++
4 files changed, 38 insertions(+)
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 9c1cfac..c27556f 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1369,6 +1369,10 @@ static int git_index_pack_config(const char *k, const char *v, void *cb)
#endif
return 0;
}
+ if (!strcmp(k, "pack.indexduplicates")) {
+ opts->allow_duplicates = git_config_bool(k, v);
+ return 0;
+ }
return git_default_config(k, v, cb);
}
diff --git a/pack-write.c b/pack-write.c
index ca9e63b..da946a7 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -7,6 +7,7 @@ void reset_pack_idx_option(struct pack_idx_option *opts)
memset(opts, 0, sizeof(*opts));
opts->version = 2;
opts->off32_limit = 0x7fffffff;
+ opts->allow_duplicates = 1;
}
static int sha1_compare(const void *_a, const void *_b)
@@ -37,6 +38,19 @@ static int need_large_offset(off_t offset, const struct pack_idx_option *opts)
sizeof(ofsval), cmp_uint32);
}
+static void *find_duplicate(void *vbase, size_t n, size_t size,
+ int (*cmp)(const void *, const void *))
+{
+ unsigned char *base = vbase;
+ while (n > 1) {
+ if (!cmp(base, base + size))
+ return base;
+ base += size;
+ n--;
+ }
+ return NULL;
+}
+
/*
* On entry *sha1 contains the pack content SHA1 hash, on exit it is
* the SHA1 hash of sorted object names. The objects array passed in
@@ -68,6 +82,16 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
else
sorted_by_sha = list = last = NULL;
+ if (!opts->allow_duplicates) {
+ struct pack_idx_entry **dup;
+
+ dup = find_duplicate(sorted_by_sha, nr_objects,
+ sizeof(*sorted_by_sha), sha1_compare);
+ if (dup)
+ die("pack has duplicate entries for %s",
+ sha1_to_hex((*dup)->sha1));
+ }
+
if (opts->flags & WRITE_IDX_VERIFY) {
assert(index_name);
f = sha1fd_check(index_name);
diff --git a/pack.h b/pack.h
index aa6ee7d..45217b6 100644
--- a/pack.h
+++ b/pack.h
@@ -44,6 +44,8 @@ struct pack_idx_option {
uint32_t version;
uint32_t off32_limit;
+ int allow_duplicates;
+
/*
* List of offsets that would fit within off32_limit but
* need to be written out as 64-bit entity for byte-for-byte
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
index 719d48f..cb9b44e 100755
--- a/t/t5308-pack-detect-duplicates.sh
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -70,4 +70,12 @@ test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' '
test_cmp expect actual
'
+test_expect_success 'index-pack can reject packs with duplicates' '
+ clear_packs &&
+ create_pack dups.pack 2 &&
+ test_must_fail \
+ git -c pack.indexDuplicates=0 index-pack --stdin <dups.pack &&
+ test_expect_code 1 git cat-file -e $LO_SHA1
+'
+
test_done
--
1.8.4.rc2.28.g6bb5f3f
^ permalink raw reply related [flat|nested] 37+ messages in thread* [PATCH 6/6] default pack.indexDuplicates to false
2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
` (4 preceding siblings ...)
2013-08-22 23:15 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
@ 2013-08-22 23:16 ` Jeff King
2013-08-22 23:35 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Nicolas Pitre
6 siblings, 0 replies; 37+ messages in thread
From: Jeff King @ 2013-08-22 23:16 UTC (permalink / raw)
To: Git Mailing List
Cc: Duy Nguyen, Junio C Hamano, Shawn O. Pearce, Nicolas Pitre
We should never see duplicate objects in packs, and it is
unknown what kind of complications such packs could create
during the repacking process. The previous commit introduced
a safety valve for checking packs coming into the repository
and being indexed by index-pack.
Let's turn the safety valve on by default. This helps
protect sites receiving packfiles from potentially malicious
strangers, and shouldn't affect normal use (and if it does,
it will have helped us identify a buggy packfile writer).
In a pinch, users can recover by turning off the option, or
by resorting to unpack-objects to explode the pack.
Note that this also turns the option on for packs we write
ourselves (e.g., via pack-objects, fast-import, or streaming
large blobs). We do not expect maliciously constructed
packfiles in these code paths, of course, but it can serve
as an extra check that we have no accidentally created a
buggy pack (and the check itself is cheap to perform).
Signed-off-by: Jeff King <peff@peff•net>
---
pack-write.c | 1 -
t/t5308-pack-detect-duplicates.sh | 9 ++++-----
2 files changed, 4 insertions(+), 6 deletions(-)
diff --git a/pack-write.c b/pack-write.c
index da946a7..1e3c459 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -7,7 +7,6 @@ void reset_pack_idx_option(struct pack_idx_option *opts)
memset(opts, 0, sizeof(*opts));
opts->version = 2;
opts->off32_limit = 0x7fffffff;
- opts->allow_duplicates = 1;
}
static int sha1_compare(const void *_a, const void *_b)
diff --git a/t/t5308-pack-detect-duplicates.sh b/t/t5308-pack-detect-duplicates.sh
index cb9b44e..e982095 100755
--- a/t/t5308-pack-detect-duplicates.sh
+++ b/t/t5308-pack-detect-duplicates.sh
@@ -37,10 +37,10 @@ test_expect_success 'index-pack will allow duplicate objects by default' '
git index-pack --stdin <no-dups.pack
'
-test_expect_success 'index-pack will allow duplicate objects by default' '
+test_expect_success 'index-pack will allow duplicate objects if asked' '
clear_packs &&
create_pack dups.pack 100 &&
- git index-pack --stdin <dups.pack
+ git -c pack.indexDuplicates=true index-pack --stdin <dups.pack
'
test_expect_success 'create batch-check test vectors' '
@@ -70,11 +70,10 @@ test_expect_success 'index-pack can reject packs with duplicates' '
test_cmp expect actual
'
-test_expect_success 'index-pack can reject packs with duplicates' '
+test_expect_success 'index-pack rejects packs with duplicates by default' '
clear_packs &&
create_pack dups.pack 2 &&
- test_must_fail \
- git -c pack.indexDuplicates=0 index-pack --stdin <dups.pack &&
+ test_must_fail git index-pack --stdin <dups.pack &&
test_expect_code 1 git cat-file -e $LO_SHA1
'
--
1.8.4.rc2.28.g6bb5f3f
^ permalink raw reply related [flat|nested] 37+ messages in thread* Re: [PATCHv2 0/6] duplicate objects and delta cycles, oh my!
2013-08-22 23:12 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
` (5 preceding siblings ...)
2013-08-22 23:16 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
@ 2013-08-22 23:35 ` Nicolas Pitre
6 siblings, 0 replies; 37+ messages in thread
From: Nicolas Pitre @ 2013-08-22 23:35 UTC (permalink / raw)
To: Jeff King; +Cc: Git Mailing List, Duy Nguyen, Junio C Hamano, Shawn O. Pearce
On Thu, 22 Aug 2013, Jeff King wrote:
> On Thu, Aug 22, 2013 at 10:43:05AM -0400, Jeff King wrote:
>
> > > write_idx_file() is called after index-pack processes all delta
> > > objects. Could resolve_deltas() go cyclic with certain duplicate
> > > object setup?
> >
> > Good question. I'm not sure. I'll check it out.
>
> I think the answer is "no", based on both reasoning and testing (both of
> which are included in patches 3-4 of the series below).
>
> So here's my re-roll of the series.
This looks all good to me.
Acked-by: Nicolas Pitre <nico@fluxnic•net>
Nicolas
^ permalink raw reply [flat|nested] 37+ messages in thread