Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG

public inbox for linuxppc-dev@ozlabs.org 
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro
       [not found] <CAHk-=wh1XeaxWXG5QziGA4ds918UnW1hO924kusgVB-wGj+9Og@mail.gmail.com>
@ 2022-05-26 12:14 ` Michael Ellerman
  2022-05-26 12:42   ` Mark Rutland
  2022-05-26 16:52   ` Linus Torvalds
  0 siblings, 2 replies; 4+ messages in thread
From: Michael Ellerman @ 2022-05-26 12:14 UTC (permalink / raw)
  To: Linus Torvalds, Uros Bizjak, Catalin Marinas, Will Deacon,
	Russell King, Thomas Bogendoerfer, Heiko Carstens
  Cc: Waiman.Long, Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List, Thomas Gleixner, Paul McKenney,
	linuxppc-dev

Linus Torvalds <torvalds@linux-foundation•org> writes:
> On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail•com> wrote:
>>
>> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
>> x86 CMPXCHG instruction returns success in ZF flag, so this
>> change saves a compare after cmpxchg (and related move instruction
>> in front of cmpxchg). The main loop of lockref_get improves from:
>
> Ack on this one regardless of the 32-bit x86 question.
>
> HOWEVER.
>
> I'd like other architectures to pipe up too, because I think right now
> x86 is the only one that implements that "arch_try_cmpxchg()" family
> of operations natively, and I think the generic fallback for when it
> is missing might be kind of nasty.
>
> Maybe it ends up generating ok code, but it's also possible that it
> just didn't matter when it was only used in one place in the
> scheduler.

This patch seems to generate slightly *better* code on powerpc.

I see one register-to-register move that gets shifted slightly later, so
that it's skipped on the path that returns directly via the SUCCESS
case.

So LGTM.

> The lockref_get() case can be quite hot under some loads, it would be
> sad if this made other architectures worse.

Do you know of a benchmark that shows it up? I tried a few things but
couldn't get lockref_get() to count for more than 1-2%.

cheers

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro
  2022-05-26 12:14 ` [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro Michael Ellerman
@ 2022-05-26 12:42   ` Mark Rutland
  2022-05-27  9:36     ` Heiko Carstens
  2022-05-26 16:52   ` Linus Torvalds
  1 sibling, 1 reply; 4+ messages in thread
From: Mark Rutland @ 2022-05-26 12:42 UTC (permalink / raw)
  To: Michael Ellerman, Linus Torvalds
  Cc: Waiman.Long, Thomas Bogendoerfer, Peter Zijlstra, Catalin Marinas,
	Heiko Carstens, the arch/x86 maintainers, Uros Bizjak,
	Russell King, Linux Kernel Mailing List, linuxppc-dev,
	Thomas Gleixner, Paul McKenney, Will Deacon

On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote:
> Linus Torvalds <torvalds@linux-foundation•org> writes:
> > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail•com> wrote:
> >>
> >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
> >> x86 CMPXCHG instruction returns success in ZF flag, so this
> >> change saves a compare after cmpxchg (and related move instruction
> >> in front of cmpxchg). The main loop of lockref_get improves from:
> >
> > Ack on this one regardless of the 32-bit x86 question.
> >
> > HOWEVER.
> >
> > I'd like other architectures to pipe up too, because I think right now
> > x86 is the only one that implements that "arch_try_cmpxchg()" family
> > of operations natively, and I think the generic fallback for when it
> > is missing might be kind of nasty.
> >
> > Maybe it ends up generating ok code, but it's also possible that it
> > just didn't matter when it was only used in one place in the
> > scheduler.
> 
> This patch seems to generate slightly *better* code on powerpc.
> 
> I see one register-to-register move that gets shifted slightly later, so
> that it's skipped on the path that returns directly via the SUCCESS
> case.

FWIW, I see the same on arm64; a register-to-register move gets moved out of
the success path. That changes the register allocation, and resulting in one
fewer move, but otherwise the code generation is the same.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro
  2022-05-26 12:42   ` Mark Rutland
@ 2022-05-27  9:36     ` Heiko Carstens
  0 siblings, 0 replies; 4+ messages in thread
From: Heiko Carstens @ 2022-05-27  9:36 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Waiman.Long, Thomas Bogendoerfer, Will Deacon, Peter Zijlstra,
	the arch/x86 maintainers, Uros Bizjak, Russell King,
	Linux Kernel Mailing List, linuxppc-dev, Catalin Marinas,
	Thomas Gleixner, Paul McKenney, Linus Torvalds

On Thu, May 26, 2022 at 01:42:35PM +0100, Mark Rutland wrote:
> On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote:
> > Linus Torvalds <torvalds@linux-foundation•org> writes:
> > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail•com> wrote:
> > >>
> > >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
> > >> x86 CMPXCHG instruction returns success in ZF flag, so this
> > >> change saves a compare after cmpxchg (and related move instruction
> > >> in front of cmpxchg). The main loop of lockref_get improves from:
> > >
> > > Ack on this one regardless of the 32-bit x86 question.
> > >
> > > HOWEVER.
> > >
> > > I'd like other architectures to pipe up too, because I think right now
> > > x86 is the only one that implements that "arch_try_cmpxchg()" family
> > > of operations natively, and I think the generic fallback for when it
> > > is missing might be kind of nasty.
> > >
> > > Maybe it ends up generating ok code, but it's also possible that it
> > > just didn't matter when it was only used in one place in the
> > > scheduler.
> > 
> > This patch seems to generate slightly *better* code on powerpc.
> > 
> > I see one register-to-register move that gets shifted slightly later, so
> > that it's skipped on the path that returns directly via the SUCCESS
> > case.
> 
> FWIW, I see the same on arm64; a register-to-register move gets moved out of
> the success path. That changes the register allocation, and resulting in one
> fewer move, but otherwise the code generation is the same.

Just for the records: s390 code generation changes the same like on
powerpc; so looks good.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro
  2022-05-26 12:14 ` [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro Michael Ellerman
  2022-05-26 12:42   ` Mark Rutland
@ 2022-05-26 16:52   ` Linus Torvalds
  1 sibling, 0 replies; 4+ messages in thread
From: Linus Torvalds @ 2022-05-26 16:52 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Waiman.Long, Thomas Bogendoerfer, Peter Zijlstra, Catalin Marinas,
	Heiko Carstens, the arch/x86 maintainers, Uros Bizjak,
	Russell King, Linux Kernel Mailing List, linuxppc-dev,
	Thomas Gleixner, Paul McKenney, Will Deacon

On Thu, May 26, 2022 at 5:15 AM Michael Ellerman <mpe@ellerman•id.au> wrote:
>
> Do you know of a benchmark that shows it up? I tried a few things but
> couldn't get lockref_get() to count for more than 1-2%.

Heh. 1% for a small instruction sequence that is only handful of
instructions and is used in just a couple of places counts as "very
hot" for me.

I assume you did the natural thing: threaded pathname lookup (with
paths being of the longer variety to not be dominated by system call
etc costs).

               Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-05-27  9:37 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAHk-=wh1XeaxWXG5QziGA4ds918UnW1hO924kusgVB-wGj+9Og@mail.gmail.com>
2022-05-26 12:14 ` [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro Michael Ellerman
2022-05-26 12:42   ` Mark Rutland
2022-05-27  9:36     ` Heiko Carstens
2022-05-26 16:52   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox