* block-sha1: improve code on large-register-set machines @ 2009-08-10 23:52 Linus Torvalds 2009-08-11 6:15 ` Nicolas Pitre 0 siblings, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2009-08-10 23:52 UTC (permalink / raw) To: Git Mailing List, Junio C Hamano For x86 performance (especially in 32-bit mode) I added that hack to write the SHA1 internal temporary hash using a volatile pointer, in order to get gcc to not try to cache the array contents. Because gcc will do all the wrong things, and then spill things in insane random ways. But on architectures like PPC, where you have 32 registers, it's actually perfectly reasonable to put the whole temporary array[] into the register set, and gcc can do so. So make the 'volatile unsigned int *' cast be dependent on a SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just x86 and x86-64. With that, the routine is fairly reasonable even when compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on a G5: * Paulus asm version: about 3.67s * Yours with no change: about 5.74s * Yours without "volatile": about 3.78s so with this the C version is within about 3% of the asm one. And add a lot of commentary on what the heck is going on. Signed-off-by: Linus Torvalds <torvalds@linux-foundation•org> --- I also asked David Miller to test the non-volatile version on Sparc, but I suspect it will have the same pattern. ia64 likewise (but I have not asked anybody to test). Of the other architectures, ARM probably would wants SMALL_REGISTER_SET, but I suspect the problem there is the htonl() (on little-endian), and possibly the unaligned loads - at least on older ARM. The latter is something gcc could be taught about, though (the SHA_SRC macro would just need to use a pointer that goes through a packed struct member or something). block-sha1/sha1.c | 25 ++++++++++++++++++++++++- 1 files changed, 24 insertions(+), 1 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index 36da763..9bc8b8a 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -82,6 +82,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) #define SHA_ASM(op, x, n) ({ unsigned int __res; asm(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; }) #define SHA_ROL(x,n) SHA_ASM("rol", x, n) #define SHA_ROR(x,n) SHA_ASM("ror", x, n) +#define SMALL_REGISTER_SET #else @@ -93,7 +94,29 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx) /* This "rolls" over the 512-bit array */ #define W(x) (array[(x)&15]) -#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val)) + +/* + * If you have 32 registers or more, the compiler can (and should) + * try to change the array[] accesses into registers. However, on + * machines with less than ~25 registers, that won't really work, + * and at least gcc will make an unholy mess of it. + * + * So to avoid that mess which just slows things down, we force + * the stores to memory to actually happen (we might be better off + * with a 'W(t)=(val);asm("":"+m" (W(t))' there instead, as + * suggested by Artur Skawina - that will also make gcc unable to + * try to do the silly "optimize away loads" part because it won't + * see what the value will be). + * + * Ben Herrenschmidt reports that on PPC, the C version comes close + * to the optimized asm with this (ie on PPC you don't want that + * 'volatile', since there are lots of registers). + */ +#ifdef SMALL_REGISTER_SET + #define setW(x, val) (*(volatile unsigned int *)&W(x) = (val)) +#else + #define setW(x, val) (W(x) = (val)) +#endif /* * Where do we get the source from? The first 16 iterations get it from ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-10 23:52 block-sha1: improve code on large-register-set machines Linus Torvalds @ 2009-08-11 6:15 ` Nicolas Pitre 2009-08-11 15:04 ` Linus Torvalds 2009-08-11 15:43 ` Linus Torvalds 0 siblings, 2 replies; 16+ messages in thread From: Nicolas Pitre @ 2009-08-11 6:15 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Mon, 10 Aug 2009, Linus Torvalds wrote: > > For x86 performance (especially in 32-bit mode) I added that hack to write > the SHA1 internal temporary hash using a volatile pointer, in order to get > gcc to not try to cache the array contents. Because gcc will do all the > wrong things, and then spill things in insane random ways. > > But on architectures like PPC, where you have 32 registers, it's actually > perfectly reasonable to put the whole temporary array[] into the register > set, and gcc can do so. > > So make the 'volatile unsigned int *' cast be dependent on a > SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just > x86 and x86-64. With that, the routine is fairly reasonable even when > compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on > a G5: > > * Paulus asm version: about 3.67s > * Yours with no change: about 5.74s > * Yours without "volatile": about 3.78s > > so with this the C version is within about 3% of the asm one. > > And add a lot of commentary on what the heck is going on. > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation•org> > --- > > I also asked David Miller to test the non-volatile version on Sparc, but I > suspect it will have the same pattern. ia64 likewise (but I have not asked > anybody to test). > > Of the other architectures, ARM probably would wants SMALL_REGISTER_SET, > but I suspect the problem there is the htonl() (on little-endian), and > possibly the unaligned loads - at least on older ARM. The latter is > something gcc could be taught about, though (the SHA_SRC macro would just > need to use a pointer that goes through a packed struct member or > something). The "older" ARM (those that don't perform unaligned accesses in hardware) are still the majority by far in the field. Here some numbers on ARM for 203247018 bytes. MOZILLA_SHA1: 14.520s ARM_SHA1: 5.600s OPENSSL: 5.530s BLK_SHA1: 5.280s [original] BLK_SHA1: 7.410s [with SMALL_REGISTER_SET defined] BLK_SHA1: 7.480s [with 'W(x)=(val);asm("":"+m" (W(x)))'] BLK_SHA1: 4.980s [with 'W(x)=(val);asm("":::"memory")'] At this point the generated assembly is pretty slick. I bet the full memory barrier might help on x86 as well. However the above BLK_SHA1 works only for aligned source buffers. So let's define our own SHA_SRC to replace the htonl() (which should probably be ntohl() by the way) like this: #define SHA_SRC(t) \ ({ unsigned char *__d = (unsigned char *)&data[t]; \ (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); }) And this provides the exact same performance as the ntohl() based version (4.980s) except that this now cope with unaligned buffers too. Of course the BLK_SHA1 version is a pig since it is totally unrolled text data bss dec hex filename 1220 0 0 1220 4c4 mozilla-sha1/sha1.o 852 0 0 852 354 arm/sha1_arm.o 6292 0 0 6292 1894 block-sha1/sha1.o so the speed advantage has a significant (but relative) code size cost. Nicolas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-11 6:15 ` Nicolas Pitre @ 2009-08-11 15:04 ` Linus Torvalds 2009-08-11 18:00 ` Nicolas Pitre 2009-08-11 15:43 ` Linus Torvalds 1 sibling, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2009-08-11 15:04 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Git Mailing List, Junio C Hamano On Tue, 11 Aug 2009, Nicolas Pitre wrote: > > #define SHA_SRC(t) \ > ({ unsigned char *__d = (unsigned char *)&data[t]; \ > (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); }) > > And this provides the exact same performance as the ntohl() based > version (4.980s) except that this now cope with unaligned buffers too. Is it better to do a (conditional) memcpy up front? Or is the byte-based one better just because you always end up doing the shifting anyway due to most ARM situations being little-endian? I _suspect_ that most large SHA1 calls from git are pre-aligned. The big SHA1 calls are for pack-file verification in fsck, which should all be aligned. Same goes for index file integrity checking. The actual object SHA1 calculations are likely not aligned (we do that object header thing), and if you can't do the htonl() any better way I guess the byte-based thing is the way to go.. Linus --- block-sha1/sha1.c | 13 ++++++++++++- 1 files changed, 12 insertions(+), 1 deletions(-) diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c index 9bc8b8a..df27e66 100644 --- a/block-sha1/sha1.c +++ b/block-sha1/sha1.c @@ -25,6 +25,12 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx) ctx->H[4] = 0xc3d2e1f0; } +#ifdef REALLY_SLOW_UNALIGNED + #define is_unaligned(ptr) (3 & (unsigned long)(ptr)) +#else + #define is_unaligned(ptr) 0 +#endif + void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) { @@ -47,7 +53,12 @@ void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len) blk_SHA1Block(ctx, ctx->W); } while (len >= 64) { - blk_SHA1Block(ctx, data); + const unsigned int *block = data; + if (is_unaligned(data)) { + memcpy(ctx->W, data, 64); + block = ctx->W; + } + blk_SHA1Block(ctx, block); data += 64; len -= 64; } ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-11 15:04 ` Linus Torvalds @ 2009-08-11 18:00 ` Nicolas Pitre 2009-08-11 19:31 ` Nicolas Pitre 0 siblings, 1 reply; 16+ messages in thread From: Nicolas Pitre @ 2009-08-11 18:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Tue, 11 Aug 2009, Linus Torvalds wrote: > On Tue, 11 Aug 2009, Nicolas Pitre wrote: > > > > #define SHA_SRC(t) \ > > ({ unsigned char *__d = (unsigned char *)&data[t]; \ > > (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); }) > > > > And this provides the exact same performance as the ntohl() based > > version (4.980s) except that this now cope with unaligned buffers too. > > Is it better to do a (conditional) memcpy up front? Or is the byte-based > one better just because you always end up doing the shifting anyway due to > most ARM situations being little-endian? The vast majority of ARM processors where git might be run are using a LE environment. > I _suspect_ that most large SHA1 calls from git are pre-aligned. The big > SHA1 calls are for pack-file verification in fsck, which should all be > aligned. Same goes for index file integrity checking. > > The actual object SHA1 calculations are likely not aligned (we do that > object header thing), and if you can't do the htonl() any better way I > guess the byte-based thing is the way to go.. Let's see. The default ntohl() provided by glibc generates this code: ldr r3, [r0, #0] mov r0, r3, lsr #24 and r2, r3, #0x00ff0000 orr r0, r0, r3, asl #24 orr r0, r0, r2, lsr #8 and r3, r3, #0x0000ff00 orr r0, r0, r3, asl #8 Ignoring the load result delay that gcc should properly schedule anyway, that makes for 7 cycles. Using the smarter ARM swab32 implementation from Linux we get: ldr r3, [r0, #0] eor r0, r3, r3, ror #16 bic r0, r0, #0x00ff0000 mov r0, r0, lsr #8 eor r0, r0, r3, ror #8 So we're down to 5 cycles. And the SHA1 test is a bit faster too: 4.570s down from 4.980s. However this is still using purely aligned buffers. Adding your patch using memcpy() to align the data in the unaligned case gives me wild results. Sometimes it is 4.930s, sometimes it is 5.560s. I suspect the icache starts to get tight and sometimes the SHA1 code and/or the special unaligned memcpy path get evicted sometimes. Using the byte access version we get: ldrb r3, [r0, #3] ldrb r2, [r0, #0] ldrb r1, [r0, #1] orr r3, r3, r2, asl #24 ldrb r0, [r0, #2] orr r3, r3, r1, asl #16 orr r0, r3, r0, asl #8 Again 7 cycles, like the ntohl() based version, which is coherent with the fact that they both make for 4.980s.. However this time any buffer alignment is supported. And in fact the extra 2 cycles over the swab32() version should actually be less overhead per word than the unaligned memcpy which is around 4 cycles per word. Nicolas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-11 18:00 ` Nicolas Pitre @ 2009-08-11 19:31 ` Nicolas Pitre 2009-08-11 21:20 ` Brandon Casey 0 siblings, 1 reply; 16+ messages in thread From: Nicolas Pitre @ 2009-08-11 19:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Tue, 11 Aug 2009, Nicolas Pitre wrote: > On Tue, 11 Aug 2009, Linus Torvalds wrote: > > > On Tue, 11 Aug 2009, Nicolas Pitre wrote: > > > > > > #define SHA_SRC(t) \ > > > ({ unsigned char *__d = (unsigned char *)&data[t]; \ > > > (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); }) > > > > > > And this provides the exact same performance as the ntohl() based > > > version (4.980s) except that this now cope with unaligned buffers too. > > > > The actual object SHA1 calculations are likely not aligned (we do that > > object header thing), and if you can't do the htonl() any better way I > > guess the byte-based thing is the way to go.. OK, even better: 4.400s. This is with this instead of the above: #define SHA_SRC(t) \ ({ unsigned char *__d = (unsigned char *)data; \ (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \ (__d[(t)*4 + 2] << 8) | (__d[(t)*4 + 3] << 0); }) The previous version would allocate a new register for __d and then index it with an offset of 0, 1, 2 or 3. This version always uses the register containing the data pointer with absolute offsets. The binary is a bit smaller too. Nicolas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-11 19:31 ` Nicolas Pitre @ 2009-08-11 21:20 ` Brandon Casey 2009-08-11 21:36 ` Nicolas Pitre 2009-08-11 22:57 ` Linus Torvalds 0 siblings, 2 replies; 16+ messages in thread From: Brandon Casey @ 2009-08-11 21:20 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano Nicolas Pitre wrote: > On Tue, 11 Aug 2009, Nicolas Pitre wrote: > >> On Tue, 11 Aug 2009, Linus Torvalds wrote: >> >>> On Tue, 11 Aug 2009, Nicolas Pitre wrote: >>>> #define SHA_SRC(t) \ >>>> ({ unsigned char *__d = (unsigned char *)&data[t]; \ >>>> (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); }) >>>> >>>> And this provides the exact same performance as the ntohl() based >>>> version (4.980s) except that this now cope with unaligned buffers too. >>> The actual object SHA1 calculations are likely not aligned (we do that >>> object header thing), and if you can't do the htonl() any better way I >>> guess the byte-based thing is the way to go.. > > OK, even better: 4.400s. > > This is with this instead of the above: > > #define SHA_SRC(t) \ > ({ unsigned char *__d = (unsigned char *)data; \ > (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \ > (__d[(t)*4 + 2] << 8) | (__d[(t)*4 + 3] << 0); }) > > The previous version would allocate a new register for __d and then > index it with an offset of 0, 1, 2 or 3. This version always uses the > register containing the data pointer with absolute offsets. The binary > is a bit smaller too. In that case, why not change the interface of blk_SHA1Block() so that its second argument is const unsigned char* and get rid of __d and the { } ? Then it will look like this: static void blk_SHA1Block(blk_SHA_CTX *ctx, const unsigned char *data); ... #define SHA_SRC(t) \ ( (data[(t)*4 + 0] << 24) | (data[(t)*4 + 1] << 16) | \ (data[(t)*4 + 2] << 8) | (data[(t)*4 + 3] << 0) ) Plus, we need something like the following to handle storing the hash to an unaligned buffer (warning copy/pasted): @@ -73,8 +74,12 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *c /* Output hash */ - for (i = 0; i < 5; i++) - ((unsigned int *)hashout)[i] = htonl(ctx->H[i]); + for (i = 0; i < 5; i++) { + *hashout++ = (unsigned char) (ctx->H[i] >> 24); + *hashout++ = (unsigned char) (ctx->H[i] >> 16); + *hashout++ = (unsigned char) (ctx->H[i] >> 8); + *hashout++ = (unsigned char) (ctx->H[i] >> 0); + } } #if defined(__i386__) || defined(__x86_64__) With these two changes plus a few other minor tweaks, the block-sha1 code compiles and passes the test suite on sparc (solaris 7) and mips (irix 6.5). -brandon ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-11 21:20 ` Brandon Casey @ 2009-08-11 21:36 ` Nicolas Pitre 2009-08-11 21:49 ` Brandon Casey 2009-08-11 22:57 ` Linus Torvalds 1 sibling, 1 reply; 16+ messages in thread From: Nicolas Pitre @ 2009-08-11 21:36 UTC (permalink / raw) To: Brandon Casey; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano On Tue, 11 Aug 2009, Brandon Casey wrote: > Nicolas Pitre wrote: > > On Tue, 11 Aug 2009, Nicolas Pitre wrote: > > > >> On Tue, 11 Aug 2009, Linus Torvalds wrote: > >> > >>> On Tue, 11 Aug 2009, Nicolas Pitre wrote: > >>>> #define SHA_SRC(t) \ > >>>> ({ unsigned char *__d = (unsigned char *)&data[t]; \ > >>>> (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); }) > >>>> > >>>> And this provides the exact same performance as the ntohl() based > >>>> version (4.980s) except that this now cope with unaligned buffers too. > >>> The actual object SHA1 calculations are likely not aligned (we do that > >>> object header thing), and if you can't do the htonl() any better way I > >>> guess the byte-based thing is the way to go.. > > > > OK, even better: 4.400s. > > > > This is with this instead of the above: > > > > #define SHA_SRC(t) \ > > ({ unsigned char *__d = (unsigned char *)data; \ > > (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \ > > (__d[(t)*4 + 2] << 8) | (__d[(t)*4 + 3] << 0); }) > > > > The previous version would allocate a new register for __d and then > > index it with an offset of 0, 1, 2 or 3. This version always uses the > > register containing the data pointer with absolute offsets. The binary > > is a bit smaller too. > > In that case, why not change the interface of blk_SHA1Block() so that its > second argument is const unsigned char* and get rid of __d and the { } ? Because not all architectures care to access the data bytewise. Please see the original SHA_SRC definition. Nicolas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-11 21:36 ` Nicolas Pitre @ 2009-08-11 21:49 ` Brandon Casey 0 siblings, 0 replies; 16+ messages in thread From: Brandon Casey @ 2009-08-11 21:49 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano Nicolas Pitre wrote: > On Tue, 11 Aug 2009, Brandon Casey wrote: > >> Nicolas Pitre wrote: >>> On Tue, 11 Aug 2009, Nicolas Pitre wrote: >>> >>>> On Tue, 11 Aug 2009, Linus Torvalds wrote: >>>> >>>>> On Tue, 11 Aug 2009, Nicolas Pitre wrote: >>>>>> #define SHA_SRC(t) \ >>>>>> ({ unsigned char *__d = (unsigned char *)&data[t]; \ >>>>>> (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); }) >>>>>> >>>>>> And this provides the exact same performance as the ntohl() based >>>>>> version (4.980s) except that this now cope with unaligned buffers too. >>>>> The actual object SHA1 calculations are likely not aligned (we do that >>>>> object header thing), and if you can't do the htonl() any better way I >>>>> guess the byte-based thing is the way to go.. >>> OK, even better: 4.400s. >>> >>> This is with this instead of the above: >>> >>> #define SHA_SRC(t) \ >>> ({ unsigned char *__d = (unsigned char *)data; \ >>> (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \ >>> (__d[(t)*4 + 2] << 8) | (__d[(t)*4 + 3] << 0); }) >>> >>> The previous version would allocate a new register for __d and then >>> index it with an offset of 0, 1, 2 or 3. This version always uses the >>> register containing the data pointer with absolute offsets. The binary >>> is a bit smaller too. >> In that case, why not change the interface of blk_SHA1Block() so that its >> second argument is const unsigned char* and get rid of __d and the { } ? > > Because not all architectures care to access the data bytewise. Please > see the original SHA_SRC definition. You mean this: #define SHA_SRC(t) htonl(data[t]) ? Or was there a definition before this one? I don't follow what you are saying. Are you saying that the following two examples are different? unsigned int *data; #define SHA_SRC(t) \ ({ unsigned char *__d = (unsigned char *)data; \ (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \ (__d[(t)*4 + 2] << 8) | (__d[(t)*4 + 3] << 0); }) and unsigned char *data; #define SHA_SRC(t) \ ( (data[(t)*4 + 0] << 24) | (data[(t)*4 + 1] << 16) | \ (data[(t)*4 + 2] << 8) | (data[(t)*4 + 3] << 0) ) -brandon ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-11 21:20 ` Brandon Casey 2009-08-11 21:36 ` Nicolas Pitre @ 2009-08-11 22:57 ` Linus Torvalds 2009-08-11 23:13 ` Brandon Casey 1 sibling, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2009-08-11 22:57 UTC (permalink / raw) To: Brandon Casey; +Cc: Nicolas Pitre, Git Mailing List, Junio C Hamano On Tue, 11 Aug 2009, Brandon Casey wrote: > > In that case, why not change the interface of blk_SHA1Block() so that its > second argument is const unsigned char* and get rid of __d and the { } ? Because on big-endian, or on architectures like x86 that have an efficient byte swap, that would be horrible. You absoluetoy MUST NOT do things a byte at a time in those cases. The memory operations and the shifting just kills you. The reason you want to do things a byte at a time on ARM is that ARM cannot do unaligned accesses well (very modern cores are better, but rare), and that ARM has no bswap instruction and has fairly cheap shifts. On no sane architecture is that true. Unaligned loads are fast (and quite frankly, hardware where unaliged loads aren't fast is just crazy sh*t), and doing 'bswap' is way faster than doing many shifts and masks. So everything should be fundamentally word-oriented. Then, broken architectures that can't handle it should split up the words, not the other way around. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-11 22:57 ` Linus Torvalds @ 2009-08-11 23:13 ` Brandon Casey 0 siblings, 0 replies; 16+ messages in thread From: Brandon Casey @ 2009-08-11 23:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, Git Mailing List, Junio C Hamano Linus Torvalds wrote: > > On Tue, 11 Aug 2009, Brandon Casey wrote: >> In that case, why not change the interface of blk_SHA1Block() so that its >> second argument is const unsigned char* and get rid of __d and the { } ? > > Because on big-endian, or on architectures like x86 that have an efficient > byte swap, that would be horrible. Sorry, I missed Nicolas's first message where he said his SHA_SRC macro was for arm only. I started at your reply to him which only has the snippet which says "...this provides the exact same performance as the ntohl() based version except that this now cope with unaligned buffers too". My mistake. -brandon ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-11 6:15 ` Nicolas Pitre 2009-08-11 15:04 ` Linus Torvalds @ 2009-08-11 15:43 ` Linus Torvalds 2009-08-11 20:03 ` Nicolas Pitre 1 sibling, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2009-08-11 15:43 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Git Mailing List, Junio C Hamano On Tue, 11 Aug 2009, Nicolas Pitre wrote: > > BLK_SHA1: 5.280s [original] > BLK_SHA1: 7.410s [with SMALL_REGISTER_SET defined] > BLK_SHA1: 7.480s [with 'W(x)=(val);asm("":"+m" (W(x)))'] > BLK_SHA1: 4.980s [with 'W(x)=(val);asm("":::"memory")'] > > At this point the generated assembly is pretty slick. I bet the full > memory barrier might help on x86 as well. No, I had tested that earlier - single-word memory barrier for some reason gets _much_ better numbers at least on x86-64. We're talking linus 1.46 418.2 vs linus 2.004 304.6 kind of differences. With the "+m" it outperforms openssl (375-380MB/s). The "volatile unsigned int *" cast looks pretty much like the "+m" version to me, but Arthur got a speedup from whatever gcc code generation differences on his P4. The really fundamental and basic problem with gcc on this code is that gcc does not see _any_ difference what-so-ever between the five variables declared with unsigned int A, B, C, D, E; and the sixteen variables declared with unsigned int array[16]; and considers those all to be 21 local variables. It really seems to think that they are all 100% equivalent, and gcc totally ignores me doing things like adding "register" to the A-E ones etc. And if you are a compiler, and think that the routine has 21 equal register variables, you're going to do crazy reload sh*t when you have only 7 (or 15) GP registers. So doing that full memory barrier seems to just take that random situation, and force some random variable to be spilled (this is all from looking at the generated code, not from looking at gcc). In contrast, with the _targeted_ thing ("you'd better write back into array[]") we force gcc to spill the array[16] values, and not the A-E ones, and that's why it seems to make such a big difference. And no, I'm not sure why ARM apparently doesn't show the same behavior. Or maybe it does, but with an in-order core it doesn't matter as much which registers you keep reloading - you'll be serialized all the time _anyway_. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-11 15:43 ` Linus Torvalds @ 2009-08-11 20:03 ` Nicolas Pitre 2009-08-11 22:53 ` Linus Torvalds 0 siblings, 1 reply; 16+ messages in thread From: Nicolas Pitre @ 2009-08-11 20:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Tue, 11 Aug 2009, Linus Torvalds wrote: > On Tue, 11 Aug 2009, Nicolas Pitre wrote: > > > > BLK_SHA1: 5.280s [original] > > BLK_SHA1: 7.410s [with SMALL_REGISTER_SET defined] > > BLK_SHA1: 7.480s [with 'W(x)=(val);asm("":"+m" (W(x)))'] > > BLK_SHA1: 4.980s [with 'W(x)=(val);asm("":::"memory")'] > > > > At this point the generated assembly is pretty slick. I bet the full > > memory barrier might help on x86 as well. > > No, I had tested that earlier - single-word memory barrier for some reason > gets _much_ better numbers at least on x86-64. We're talking > > linus 1.46 418.2 > vs > linus 2.004 304.6 > > kind of differences. With the "+m" it outperforms openssl (375-380MB/s). > > The "volatile unsigned int *" cast looks pretty much like the "+m" version > to me, but Arthur got a speedup from whatever gcc code generation > differences on his P4. The volatile pointer forces a write to memory but the cached value in the processor's register remains valid, whereas the "+m" forces gcc to assume the register copy is not valid anymore. That certainly gives the compiler a different clue about register availability, etc. > The really fundamental and basic problem with gcc on this code is that gcc > does not see _any_ difference what-so-ever between the five variables > declared with > > unsigned int A, B, C, D, E; > > and the sixteen variables declared with > > unsigned int array[16]; > > and considers those all to be 21 local variables. It really seems to think > that they are all 100% equivalent, and gcc totally ignores me doing things > like adding "register" to the A-E ones etc. > > And if you are a compiler, and think that the routine has 21 equal > register variables, you're going to do crazy reload sh*t when you have > only 7 (or 15) GP registers. So doing that full memory barrier seems to > just take that random situation, and force some random variable to be > spilled (this is all from looking at the generated code, not from looking > at gcc). > > In contrast, with the _targeted_ thing ("you'd better write back into > array[]") we force gcc to spill the array[16] values, and not the A-E > ones, and that's why it seems to make such a big difference. > > And no, I'm not sure why ARM apparently doesn't show the same behavior. Or > maybe it does, but with an in-order core it doesn't matter as much which > registers you keep reloading - you'll be serialized all the time _anyway_. Well... gcc is really strange in this case (and similar other ones) with ARM compilation. A good indicator of the quality of the code is the size of the stack frame. When using the "+m" then gcc creates a 816 byte stack frame, the generated binary grows by approx 3000 bytes, and performances is almost halved (7.600s). Looking at the assembly result I just can't figure out all the crazy moves taking place. Even the version with no barrier what so ever produces better assembly with a stack frame of 560 bytes. The volatile version is second to worst with a 808 byte stack frame with similar bad performances. With the full "memory" the stack frame shrinks to 280 bytes and best performances so far is obtained. And none of the A B C D E or data variables are ever spilled onto the stack either, only the array[16] gets allocated stack slots, and the TEMP variable. Nicolas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-11 20:03 ` Nicolas Pitre @ 2009-08-11 22:53 ` Linus Torvalds 2009-08-11 23:14 ` Linus Torvalds 2009-08-11 23:45 ` Artur Skawina 0 siblings, 2 replies; 16+ messages in thread From: Linus Torvalds @ 2009-08-11 22:53 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Git Mailing List, Junio C Hamano On Tue, 11 Aug 2009, Nicolas Pitre wrote: > > Well... gcc is really strange in this case (and similar other ones) with > ARM compilation. A good indicator of the quality of the code is the > size of the stack frame. When using the "+m" then gcc creates a 816 > byte stack frame, the generated binary grows by approx 3000 bytes, and > performances is almost halved (7.600s). Looking at the assembly result > I just can't figure out all the crazy moves taking place. Even the > version with no barrier what so ever produces better assembly with a > stack frame of 560 bytes. Ok, that's just crazy. That function has a required stack size of exactly 64 bytes, and anything more than that is just spilling. And if you end up with a stack frame of 560 bytes, that means that gcc is doing some _crazy_ spilling. One thing that strikes me is that I've been just testing with gcc-4.4, and BenH (who did some tests on PPC where SHA1 is just _trivial_ because it all fits in the normal register space) noticed that older versions of gcc that he tested did much worse on this. I think Artur also posted (x86) numbers with older gcc versions doing worse. Maybe you're seeing some of that? Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-11 22:53 ` Linus Torvalds @ 2009-08-11 23:14 ` Linus Torvalds 2009-08-12 2:26 ` Nicolas Pitre 2009-08-11 23:45 ` Artur Skawina 1 sibling, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2009-08-11 23:14 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Git Mailing List, Junio C Hamano On Tue, 11 Aug 2009, Linus Torvalds wrote: > > > On Tue, 11 Aug 2009, Nicolas Pitre wrote: > > > > Well... gcc is really strange in this case (and similar other ones) with > > ARM compilation. A good indicator of the quality of the code is the > > size of the stack frame. When using the "+m" then gcc creates a 816 > > byte stack frame, the generated binary grows by approx 3000 bytes, and > > performances is almost halved (7.600s). Looking at the assembly result > > I just can't figure out all the crazy moves taking place. Even the > > version with no barrier what so ever produces better assembly with a > > stack frame of 560 bytes. > > Ok, that's just crazy. That function has a required stack size of exactly > 64 bytes, and anything more than that is just spilling. And if you end up > with a stack frame of 560 bytes, that means that gcc is doing some _crazy_ > spilling. Btw, what I think happens is: - gcc turns all those array accesses into pseudo's So now the 'array[16]' is seen as just another 16 variables rather than an array. - gcc then turns it into SSA, where each assignment basically creates a new variable. So the 16 array variables (and 5 regular variables) are now expanded to 80 SSA asignments (one array assignment per SHA1 round) plus an additional 2 assignments to the "regular" variables per round (B and E are changed each round). So in SSA form, you actually end up having 240 pseudo's associated with the actual variables. Plus all the temporaries of course. - the thing then goes crazy and tries to generate great code from that internal SSA model. And since there are never more than ~25 things _live_ at any particular point, it works fine with lots of registers, but on small-register machines gcc just goes crazy and has to spill. And it doesn't spill 'array[x]' entries - it spills the _pseudo's_ it has created - hundreds of them. - End result: if the spill code doesn't share slots, it's going to create a totally unholy mess of crap. That's what the whole 'volatile unsigned int *' game tried to avoid. But it really sounds like it's not working too well for you. And the _big_ memory barrier ends up helping just because with that in place, you end up being almost entirely unable to schedule _anything_ between the different SHA rounds, so you end up with only six or seven variables "live" in between those barriers, and the stupid register allocator/spill logic doesn't break down too badly. The thing is, if you do full memory barriers, then you're probably better off making both the loads and the stores be "volatile". That should have similar effects. The downside with that is that it really limits the loads. So (like the full memory barrier) it's a big hammer approach. But it probably generates better code for you, because it avoids the mental breakdown of gcc spilling its pseudo's. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-11 23:14 ` Linus Torvalds @ 2009-08-12 2:26 ` Nicolas Pitre 0 siblings, 0 replies; 16+ messages in thread From: Nicolas Pitre @ 2009-08-12 2:26 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano On Tue, 11 Aug 2009, Linus Torvalds wrote: > > > On Tue, 11 Aug 2009, Linus Torvalds wrote: > > > > > > > On Tue, 11 Aug 2009, Nicolas Pitre wrote: > > > > > > Well... gcc is really strange in this case (and similar other ones) with > > > ARM compilation. A good indicator of the quality of the code is the > > > size of the stack frame. When using the "+m" then gcc creates a 816 > > > byte stack frame, the generated binary grows by approx 3000 bytes, and > > > performances is almost halved (7.600s). Looking at the assembly result > > > I just can't figure out all the crazy moves taking place. Even the > > > version with no barrier what so ever produces better assembly with a > > > stack frame of 560 bytes. > > > > Ok, that's just crazy. That function has a required stack size of exactly > > 64 bytes, and anything more than that is just spilling. And if you end up > > with a stack frame of 560 bytes, that means that gcc is doing some _crazy_ > > spilling. > > Btw, what I think happens is: > > - gcc turns all those array accesses into pseudo's > > So now the 'array[16]' is seen as just another 16 variables rather than > an array. > > - gcc then turns it into SSA, where each assignment basically creates a > new variable. So the 16 array variables (and 5 regular variables) are > now expanded to 80 SSA asignments (one array assignment per SHA1 round) > plus an additional 2 assignments to the "regular" variables per round > (B and E are changed each round). So in SSA form, you actually end up > having 240 pseudo's associated with the actual variables. Plus all > the temporaries of course. > > - the thing then goes crazy and tries to generate great code from that > internal SSA model. And since there are never more than ~25 things > _live_ at any particular point, it works fine with lots of registers, > but on small-register machines gcc just goes crazy and has to spill. > And it doesn't spill 'array[x]' entries - it spills the _pseudo's_ it > has created - hundreds of them. > > - End result: if the spill code doesn't share slots, it's going to create > a totally unholy mess of crap. > > That's what the whole 'volatile unsigned int *' game tried to avoid. But > it really sounds like it's not working too well for you. And the _big_ > memory barrier ends up helping just because with that in place, you end up > being almost entirely unable to schedule _anything_ between the different > SHA rounds, so you end up with only six or seven variables "live" in > between those barriers, and the stupid register allocator/spill logic > doesn't break down too badly. > > The thing is, if you do full memory barriers, then you're probably better > off making both the loads and the stores be "volatile". That should have > similar effects. If the loads are volatile then gcc has less flexibility when scheduling them. > The downside with that is that it really limits the loads. So (like the > full memory barrier) it's a big hammer approach. But it probably generates > better code for you, because it avoids the mental breakdown of gcc > spilling its pseudo's. Actually, all my previous tests were done with gcc-4.3.2. I now have installed Fedora 11 which has gcc-4.4.0. And now the stack frame is a nice 64 bytes. ;-) That's with the "memory" though. With the volatile, stack frame goes up to 224 bytes and performance, although not as bad as before, is like 5.160s instead of 4.410s. The "+m" version is not much better: 208 byte stack frame and similar performance. The version with no barrier what so ever runs in 4.580s and uses a 88 byte stack frame. The generated assembly contains stupid things, but this is still the second best version, even better than the "+m" and volatile ptr ones. Conclusion: the full "memory" barrier remains the best choice on ARM. Nicolas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: block-sha1: improve code on large-register-set machines 2009-08-11 22:53 ` Linus Torvalds 2009-08-11 23:14 ` Linus Torvalds @ 2009-08-11 23:45 ` Artur Skawina 1 sibling, 0 replies; 16+ messages in thread From: Artur Skawina @ 2009-08-11 23:45 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, Git Mailing List, Junio C Hamano Linus Torvalds wrote: > > One thing that strikes me is that I've been just testing with gcc-4.4, and > BenH (who did some tests on PPC where SHA1 is just _trivial_ because it > all fits in the normal register space) noticed that older versions of gcc > that he tested did much worse on this. > > I think Artur also posted (x86) numbers with older gcc versions doing > worse. Maybe you're seeing some of that? FWIW, this is how it looks here. On 32-bit x86 gcc4.4 makes a large difference[1], but your code does fairly well w/ most gccs, relatively to all the other C implementations. artur P4: [linusv is the recent one w/ the volatile stores] ### sha1bench-gcc295: GCCVER 2.95.4 20030502 (prerelease) rfc3174 0.9438 64.67 linus 0.9081 67.21 linusv 0.4155 146.9 linusp4 0.8761 69.66 linusas 0.9619 63.45 linusas2 1.025 59.52 mozilla 1.314 46.46 mozillaas 1.132 53.92 ### sha1bench-gcc31: GCCVER 3.2 2002-07-26 (prerelease) rfc3174 0.8582 71.12 linus 0.7943 76.84 linusv 0.5667 107.7 linusp4 0.7224 84.48 linusas 0.7127 85.64 linusas2 0.5109 119.5 mozilla 1.251 48.79 mozillaas 1.239 49.27 ### sha1bench-gcc32: GCCVER 3.2.3 rfc3174 0.9062 67.35 linus 0.5555 109.9 linusv 0.3647 167.4 linusp4 0.5337 114.4 linusas 0.7126 85.66 linusas2 0.5089 119.9 mozilla 1.138 53.64 mozillaas 1.075 56.78 ### sha1bench-gcc33: GCCVER 3.3.6 rfc3174 0.9029 67.6 linus 0.6059 100.7 linusv 0.3734 163.4 linusp4 0.6695 91.16 linusas 0.7832 77.93 linusas2 0.571 106.9 mozilla 1.083 56.36 mozillaas 1.078 56.62 ### sha1bench-gcc34: GCCVER 3.4.6 20060121 (prerelease) rfc3174 0.9277 65.79 linus 0.6583 92.71 linusv 0.6096 100.1 linusp4 0.7326 83.31 linusas 0.7383 82.67 linusas2 0.6264 97.44 mozilla 1.398 43.67 mozillaas 1.392 43.84 ### sha1bench-gcc40: GCCVER 4.0.4 20061113 (prerelease) rfc3174 0.9889 61.72 linus 0.7508 81.29 linusv 0.7752 78.73 linusp4 0.6548 93.21 linusas 0.4904 124.5 linusas2 0.6378 95.7 mozilla 1.528 39.93 mozillaas 1.596 38.24 ### sha1bench-gcc41: GCCVER 4.1.3 20080704 (prerelease) rfc3174 0.9798 62.29 linus 0.6993 87.28 linusv 0.767 79.57 linusp4 0.6785 89.95 linusas 0.6555 93.11 linusas2 0.691 88.32 mozilla 1.594 38.3 mozillaas 1.566 38.98 ### sha1bench-gcc42: GCCVER 4.2.5 20090330 (prerelease) rfc3174 1.138 53.63 linus 0.7772 78.53 linusv 0.6138 99.43 linusp4 0.7018 86.97 linusas 0.8164 74.76 linusas2 0.7038 86.73 mozilla 1.697 35.97 mozillaas 1.491 40.94 ### sha1bench-gcc43: GCCVER 4.3.5 20090810 (prerelease) rfc3174 1.148 53.15 linus 0.7085 86.14 linusv 0.5474 111.5 linusp4 0.5399 113 linusas 0.7522 81.14 linusas2 0.5341 114.3 mozilla 1.723 35.43 mozillaas 1.502 40.64 ### sha1bench-gcc44: GCCVER 4.4.2 20090809 (prerelease) rfc3174 1.451 42.06 linus 0.5871 104 linusv 0.3713 164.4 linusp4 0.4367 139.8 linusas 0.4083 149.5 linusas2 0.4372 139.6 mozilla 1.104 55.27 mozillaas 1.314 46.44 And on Atom: ### sha1bench-gcc295: GCCVER 2.95.4 20030502 (prerelease) rfc3174 1.905 32.04 linus 1.089 56.06 linusv 0.8134 75.04 linusp4 1.086 56.19 linusas 1.009 60.52 linusas2 1.255 48.63 mozilla 2.663 22.92 ### sha1bench-gcc31: GCCVER 3.2 2002-07-26 (prerelease) rfc3174 2.141 28.51 linus 1.022 59.75 linusv 0.8323 73.34 linusp4 1.061 57.54 linusas 0.9889 61.72 linusas2 0.9204 66.32 mozilla 2.573 23.72 ### sha1bench-gcc32: GCCVER 3.2.3 rfc3174 2.155 28.32 linus 0.9031 67.58 linusv 0.7849 77.76 linusp4 0.847 72.06 linusas 0.9888 61.73 linusas2 0.912 66.93 mozilla 2.485 24.56 ### sha1bench-gcc33: GCCVER 3.3.6 rfc3174 2.178 28.02 linus 0.9489 64.32 linusv 0.8642 70.63 linusp4 0.8784 69.48 linusas 1.017 60.03 linusas2 0.906 67.37 mozilla 2.541 24.02 ### sha1bench-gcc34: GCCVER 3.4.6 20060121 (prerelease) rfc3174 2.157 28.3 linus 0.9481 64.37 linusv 0.8383 72.8 linusp4 0.965 63.25 linusas 0.9852 61.95 linusas2 0.9809 62.22 mozilla 3.143 19.42 ### sha1bench-gcc40: GCCVER 4.0.4 20061113 (prerelease) rfc3174 2.088 29.24 linus 0.9706 62.89 linusv 0.928 65.77 linusp4 1.003 60.85 linusas 0.9478 64.4 linusas2 0.9475 64.42 mozilla 2.742 22.26 ### sha1bench-gcc41: GCCVER 4.1.3 20080704 (prerelease) rfc3174 2.047 29.81 linus 0.9778 62.42 linusv 1.051 58.06 linusp4 1.062 57.46 linusas 1.052 58.01 linusas2 1.069 57.12 mozilla 2.6 23.47 ### sha1bench-gcc42: GCCVER 4.2.5 20090330 (prerelease) rfc3174 2.025 30.14 linus 0.9622 63.43 linusv 0.7984 76.44 linusp4 0.8967 68.07 linusas 1.018 59.94 linusas2 0.9048 67.46 mozilla 2.748 22.21 ### sha1bench-gcc43: GCCVER 4.3.5 20090810 (prerelease) rfc3174 2.043 29.88 linus 0.9436 64.69 linusv 0.8532 71.54 linusp4 0.8531 71.54 linusas 1.04 58.71 linusas2 0.8495 71.85 mozilla 2.678 22.79 ### sha1bench-gcc44: GCCVER 4.4.2 20090809 (prerelease) rfc3174 2.119 28.8 linus 0.9132 66.84 linusv 0.8632 70.71 linusp4 0.9842 62.02 linusas 1.027 59.45 linusas2 0.9844 62 mozilla 2.214 27.57 ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2009-08-12 2:27 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-08-10 23:52 block-sha1: improve code on large-register-set machines Linus Torvalds 2009-08-11 6:15 ` Nicolas Pitre 2009-08-11 15:04 ` Linus Torvalds 2009-08-11 18:00 ` Nicolas Pitre 2009-08-11 19:31 ` Nicolas Pitre 2009-08-11 21:20 ` Brandon Casey 2009-08-11 21:36 ` Nicolas Pitre 2009-08-11 21:49 ` Brandon Casey 2009-08-11 22:57 ` Linus Torvalds 2009-08-11 23:13 ` Brandon Casey 2009-08-11 15:43 ` Linus Torvalds 2009-08-11 20:03 ` Nicolas Pitre 2009-08-11 22:53 ` Linus Torvalds 2009-08-11 23:14 ` Linus Torvalds 2009-08-12 2:26 ` Nicolas Pitre 2009-08-11 23:45 ` Artur Skawina
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox