* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro
[not found] <CAHk-=wh1XeaxWXG5QziGA4ds918UnW1hO924kusgVB-wGj+9Og@mail.gmail.com>
@ 2022-05-26 12:14 ` Michael Ellerman
2022-05-26 12:42 ` Mark Rutland
2022-05-26 16:52 ` Linus Torvalds
0 siblings, 2 replies; 4+ messages in thread
From: Michael Ellerman @ 2022-05-26 12:14 UTC (permalink / raw)
To: Linus Torvalds, Uros Bizjak, Catalin Marinas, Will Deacon,
Russell King, Thomas Bogendoerfer, Heiko Carstens
Cc: Waiman.Long, Peter Zijlstra, the arch/x86 maintainers,
Linux Kernel Mailing List, Thomas Gleixner, Paul McKenney,
linuxppc-dev
Linus Torvalds <torvalds@linux-foundation•org> writes:
> On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail•com> wrote:
>>
>> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
>> x86 CMPXCHG instruction returns success in ZF flag, so this
>> change saves a compare after cmpxchg (and related move instruction
>> in front of cmpxchg). The main loop of lockref_get improves from:
>
> Ack on this one regardless of the 32-bit x86 question.
>
> HOWEVER.
>
> I'd like other architectures to pipe up too, because I think right now
> x86 is the only one that implements that "arch_try_cmpxchg()" family
> of operations natively, and I think the generic fallback for when it
> is missing might be kind of nasty.
>
> Maybe it ends up generating ok code, but it's also possible that it
> just didn't matter when it was only used in one place in the
> scheduler.
This patch seems to generate slightly *better* code on powerpc.
I see one register-to-register move that gets shifted slightly later, so
that it's skipped on the path that returns directly via the SUCCESS
case.
So LGTM.
> The lockref_get() case can be quite hot under some loads, it would be
> sad if this made other architectures worse.
Do you know of a benchmark that shows it up? I tried a few things but
couldn't get lockref_get() to count for more than 1-2%.
cheers
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro
2022-05-26 12:14 ` [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro Michael Ellerman
@ 2022-05-26 12:42 ` Mark Rutland
2022-05-27 9:36 ` Heiko Carstens
2022-05-26 16:52 ` Linus Torvalds
1 sibling, 1 reply; 4+ messages in thread
From: Mark Rutland @ 2022-05-26 12:42 UTC (permalink / raw)
To: Michael Ellerman, Linus Torvalds
Cc: Waiman.Long, Thomas Bogendoerfer, Peter Zijlstra, Catalin Marinas,
Heiko Carstens, the arch/x86 maintainers, Uros Bizjak,
Russell King, Linux Kernel Mailing List, linuxppc-dev,
Thomas Gleixner, Paul McKenney, Will Deacon
On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote:
> Linus Torvalds <torvalds@linux-foundation•org> writes:
> > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail•com> wrote:
> >>
> >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
> >> x86 CMPXCHG instruction returns success in ZF flag, so this
> >> change saves a compare after cmpxchg (and related move instruction
> >> in front of cmpxchg). The main loop of lockref_get improves from:
> >
> > Ack on this one regardless of the 32-bit x86 question.
> >
> > HOWEVER.
> >
> > I'd like other architectures to pipe up too, because I think right now
> > x86 is the only one that implements that "arch_try_cmpxchg()" family
> > of operations natively, and I think the generic fallback for when it
> > is missing might be kind of nasty.
> >
> > Maybe it ends up generating ok code, but it's also possible that it
> > just didn't matter when it was only used in one place in the
> > scheduler.
>
> This patch seems to generate slightly *better* code on powerpc.
>
> I see one register-to-register move that gets shifted slightly later, so
> that it's skipped on the path that returns directly via the SUCCESS
> case.
FWIW, I see the same on arm64; a register-to-register move gets moved out of
the success path. That changes the register allocation, and resulting in one
fewer move, but otherwise the code generation is the same.
Thanks,
Mark.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro
2022-05-26 12:14 ` [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro Michael Ellerman
2022-05-26 12:42 ` Mark Rutland
@ 2022-05-26 16:52 ` Linus Torvalds
1 sibling, 0 replies; 4+ messages in thread
From: Linus Torvalds @ 2022-05-26 16:52 UTC (permalink / raw)
To: Michael Ellerman
Cc: Waiman.Long, Thomas Bogendoerfer, Peter Zijlstra, Catalin Marinas,
Heiko Carstens, the arch/x86 maintainers, Uros Bizjak,
Russell King, Linux Kernel Mailing List, linuxppc-dev,
Thomas Gleixner, Paul McKenney, Will Deacon
On Thu, May 26, 2022 at 5:15 AM Michael Ellerman <mpe@ellerman•id.au> wrote:
>
> Do you know of a benchmark that shows it up? I tried a few things but
> couldn't get lockref_get() to count for more than 1-2%.
Heh. 1% for a small instruction sequence that is only handful of
instructions and is used in just a couple of places counts as "very
hot" for me.
I assume you did the natural thing: threaded pathname lookup (with
paths being of the longer variety to not be dominated by system call
etc costs).
Linus
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro
2022-05-26 12:42 ` Mark Rutland
@ 2022-05-27 9:36 ` Heiko Carstens
0 siblings, 0 replies; 4+ messages in thread
From: Heiko Carstens @ 2022-05-27 9:36 UTC (permalink / raw)
To: Mark Rutland
Cc: Waiman.Long, Thomas Bogendoerfer, Will Deacon, Peter Zijlstra,
the arch/x86 maintainers, Uros Bizjak, Russell King,
Linux Kernel Mailing List, linuxppc-dev, Catalin Marinas,
Thomas Gleixner, Paul McKenney, Linus Torvalds
On Thu, May 26, 2022 at 01:42:35PM +0100, Mark Rutland wrote:
> On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote:
> > Linus Torvalds <torvalds@linux-foundation•org> writes:
> > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail•com> wrote:
> > >>
> > >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
> > >> x86 CMPXCHG instruction returns success in ZF flag, so this
> > >> change saves a compare after cmpxchg (and related move instruction
> > >> in front of cmpxchg). The main loop of lockref_get improves from:
> > >
> > > Ack on this one regardless of the 32-bit x86 question.
> > >
> > > HOWEVER.
> > >
> > > I'd like other architectures to pipe up too, because I think right now
> > > x86 is the only one that implements that "arch_try_cmpxchg()" family
> > > of operations natively, and I think the generic fallback for when it
> > > is missing might be kind of nasty.
> > >
> > > Maybe it ends up generating ok code, but it's also possible that it
> > > just didn't matter when it was only used in one place in the
> > > scheduler.
> >
> > This patch seems to generate slightly *better* code on powerpc.
> >
> > I see one register-to-register move that gets shifted slightly later, so
> > that it's skipped on the path that returns directly via the SUCCESS
> > case.
>
> FWIW, I see the same on arm64; a register-to-register move gets moved out of
> the success path. That changes the register allocation, and resulting in one
> fewer move, but otherwise the code generation is the same.
Just for the records: s390 code generation changes the same like on
powerpc; so looks good.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-05-27 9:37 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAHk-=wh1XeaxWXG5QziGA4ds918UnW1hO924kusgVB-wGj+9Og@mail.gmail.com>
2022-05-26 12:14 ` [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro Michael Ellerman
2022-05-26 12:42 ` Mark Rutland
2022-05-27 9:36 ` Heiko Carstens
2022-05-26 16:52 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox