* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro [not found] <CAHk-=wh1XeaxWXG5QziGA4ds918UnW1hO924kusgVB-wGj+9Og@mail.gmail.com> @ 2022-05-26 12:14 ` Michael Ellerman 2022-05-26 12:42 ` Mark Rutland 2022-05-26 16:52 ` Linus Torvalds 0 siblings, 2 replies; 4+ messages in thread From: Michael Ellerman @ 2022-05-26 12:14 UTC (permalink / raw) To: Linus Torvalds, Uros Bizjak, Catalin Marinas, Will Deacon, Russell King, Thomas Bogendoerfer, Heiko Carstens Cc: Waiman.Long, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Thomas Gleixner, Paul McKenney, linuxppc-dev Linus Torvalds <torvalds@linux-foundation•org> writes: > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail•com> wrote: >> >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. >> x86 CMPXCHG instruction returns success in ZF flag, so this >> change saves a compare after cmpxchg (and related move instruction >> in front of cmpxchg). The main loop of lockref_get improves from: > > Ack on this one regardless of the 32-bit x86 question. > > HOWEVER. > > I'd like other architectures to pipe up too, because I think right now > x86 is the only one that implements that "arch_try_cmpxchg()" family > of operations natively, and I think the generic fallback for when it > is missing might be kind of nasty. > > Maybe it ends up generating ok code, but it's also possible that it > just didn't matter when it was only used in one place in the > scheduler. This patch seems to generate slightly *better* code on powerpc. I see one register-to-register move that gets shifted slightly later, so that it's skipped on the path that returns directly via the SUCCESS case. So LGTM. > The lockref_get() case can be quite hot under some loads, it would be > sad if this made other architectures worse. Do you know of a benchmark that shows it up? I tried a few things but couldn't get lockref_get() to count for more than 1-2%. cheers ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro 2022-05-26 12:14 ` [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro Michael Ellerman @ 2022-05-26 12:42 ` Mark Rutland 2022-05-27 9:36 ` Heiko Carstens 2022-05-26 16:52 ` Linus Torvalds 1 sibling, 1 reply; 4+ messages in thread From: Mark Rutland @ 2022-05-26 12:42 UTC (permalink / raw) To: Michael Ellerman, Linus Torvalds Cc: Waiman.Long, Thomas Bogendoerfer, Peter Zijlstra, Catalin Marinas, Heiko Carstens, the arch/x86 maintainers, Uros Bizjak, Russell King, Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, Paul McKenney, Will Deacon On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote: > Linus Torvalds <torvalds@linux-foundation•org> writes: > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail•com> wrote: > >> > >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. > >> x86 CMPXCHG instruction returns success in ZF flag, so this > >> change saves a compare after cmpxchg (and related move instruction > >> in front of cmpxchg). The main loop of lockref_get improves from: > > > > Ack on this one regardless of the 32-bit x86 question. > > > > HOWEVER. > > > > I'd like other architectures to pipe up too, because I think right now > > x86 is the only one that implements that "arch_try_cmpxchg()" family > > of operations natively, and I think the generic fallback for when it > > is missing might be kind of nasty. > > > > Maybe it ends up generating ok code, but it's also possible that it > > just didn't matter when it was only used in one place in the > > scheduler. > > This patch seems to generate slightly *better* code on powerpc. > > I see one register-to-register move that gets shifted slightly later, so > that it's skipped on the path that returns directly via the SUCCESS > case. FWIW, I see the same on arm64; a register-to-register move gets moved out of the success path. That changes the register allocation, and resulting in one fewer move, but otherwise the code generation is the same. Thanks, Mark. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro 2022-05-26 12:42 ` Mark Rutland @ 2022-05-27 9:36 ` Heiko Carstens 0 siblings, 0 replies; 4+ messages in thread From: Heiko Carstens @ 2022-05-27 9:36 UTC (permalink / raw) To: Mark Rutland Cc: Waiman.Long, Thomas Bogendoerfer, Will Deacon, Peter Zijlstra, the arch/x86 maintainers, Uros Bizjak, Russell King, Linux Kernel Mailing List, linuxppc-dev, Catalin Marinas, Thomas Gleixner, Paul McKenney, Linus Torvalds On Thu, May 26, 2022 at 01:42:35PM +0100, Mark Rutland wrote: > On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote: > > Linus Torvalds <torvalds@linux-foundation•org> writes: > > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail•com> wrote: > > >> > > >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. > > >> x86 CMPXCHG instruction returns success in ZF flag, so this > > >> change saves a compare after cmpxchg (and related move instruction > > >> in front of cmpxchg). The main loop of lockref_get improves from: > > > > > > Ack on this one regardless of the 32-bit x86 question. > > > > > > HOWEVER. > > > > > > I'd like other architectures to pipe up too, because I think right now > > > x86 is the only one that implements that "arch_try_cmpxchg()" family > > > of operations natively, and I think the generic fallback for when it > > > is missing might be kind of nasty. > > > > > > Maybe it ends up generating ok code, but it's also possible that it > > > just didn't matter when it was only used in one place in the > > > scheduler. > > > > This patch seems to generate slightly *better* code on powerpc. > > > > I see one register-to-register move that gets shifted slightly later, so > > that it's skipped on the path that returns directly via the SUCCESS > > case. > > FWIW, I see the same on arm64; a register-to-register move gets moved out of > the success path. That changes the register allocation, and resulting in one > fewer move, but otherwise the code generation is the same. Just for the records: s390 code generation changes the same like on powerpc; so looks good. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro 2022-05-26 12:14 ` [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro Michael Ellerman 2022-05-26 12:42 ` Mark Rutland @ 2022-05-26 16:52 ` Linus Torvalds 1 sibling, 0 replies; 4+ messages in thread From: Linus Torvalds @ 2022-05-26 16:52 UTC (permalink / raw) To: Michael Ellerman Cc: Waiman.Long, Thomas Bogendoerfer, Peter Zijlstra, Catalin Marinas, Heiko Carstens, the arch/x86 maintainers, Uros Bizjak, Russell King, Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, Paul McKenney, Will Deacon On Thu, May 26, 2022 at 5:15 AM Michael Ellerman <mpe@ellerman•id.au> wrote: > > Do you know of a benchmark that shows it up? I tried a few things but > couldn't get lockref_get() to count for more than 1-2%. Heh. 1% for a small instruction sequence that is only handful of instructions and is used in just a couple of places counts as "very hot" for me. I assume you did the natural thing: threaded pathname lookup (with paths being of the longer variety to not be dominated by system call etc costs). Linus ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-05-27 9:37 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAHk-=wh1XeaxWXG5QziGA4ds918UnW1hO924kusgVB-wGj+9Og@mail.gmail.com>
2022-05-26 12:14 ` [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro Michael Ellerman
2022-05-26 12:42 ` Mark Rutland
2022-05-27 9:36 ` Heiko Carstens
2022-05-26 16:52 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox