From: Michael Ellerman <mpe@ellerman•id.au>
To: paulmck@kernel•org
Cc: rcu <rcu@vger•kernel.org>, Zhouyi Zhou <zhouzhouyi@gmail•com>,
linuxppc-dev <linuxppc-dev@lists•ozlabs.org>,
Nicholas Piggin <npiggin@gmail•com>,
Miguel Ojeda <miguel.ojeda.sandonis@gmail•com>
Subject: Re: rcu_sched self-detected stall on CPU
Date: Tue, 12 Apr 2022 16:53:06 +1000 [thread overview]
Message-ID: <87mtgq6be5.fsf@mpe.ellerman.id.au> (raw)
In-Reply-To: <20220411030553.GW4285@paulmck-ThinkPad-P17-Gen-1>
"Paul E. McKenney" <paulmck@kernel•org> writes:
> On Sun, Apr 10, 2022 at 09:33:43PM +1000, Michael Ellerman wrote:
>> Zhouyi Zhou <zhouzhouyi@gmail•com> writes:
>> > On Fri, Apr 8, 2022 at 10:07 PM Paul E. McKenney <paulmck@kernel•org> wrote:
>> >> On Fri, Apr 08, 2022 at 06:02:19PM +0800, Zhouyi Zhou wrote:
>> >> > On Fri, Apr 8, 2022 at 3:23 PM Michael Ellerman <mpe@ellerman•id.au> wrote:
>> ...
>> >> > > I haven't seen it in my testing. But using Miguel's config I can
>> >> > > reproduce it seemingly on every boot.
>> >> > >
>> >> > > For me it bisects to:
>> >> > >
>> >> > > 35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
>> >> > >
>> >> > > Which seems plausible.
>> >> > I also bisect to 35de589cb879 ("powerpc/time: improve decrementer
>> >> > clockevent processing")
>> ...
>> >>
>> >> > > Reverting that on mainline makes the bug go away.
>>
>> >> > I also revert that on the mainline, and am currently doing a pressure
>> >> > test (by repeatedly invoking qemu and checking the console.log) on PPC
>> >> > VM in Oregon State University.
>>
>> > After 306 rounds of stress test on mainline without triggering the bug
>> > (last for 4 hours and 27 minutes), I think the bug is indeed caused by
>> > 35de589cb879 ("powerpc/time: improve decrementer clockevent
>> > processing") and stop the test for now.
>>
>> Thanks for testing, that's pretty conclusive.
>>
>> I'm not inclined to actually revert it yet.
>>
>> We need to understand if there's actually a bug in the patch, or if it's
>> just exposing some existing bug/bad behavior we have. The fact that it
>> only appears with CONFIG_HIGH_RES_TIMERS=n is suspicious.
>>
>> Do we have some code that inadvertently relies on something enabled by
>> HIGH_RES_TIMERS=y, or do we have a bug that is hidden by HIGH_RES_TIMERS=y ?
>
> For whatever it is worth, moderate rcutorture runs to completion without
> errors with CONFIG_HIGH_RES_TIMERS=n on 64-bit x86.
Thanks for testing that, I don't have any big x86 machines to test on :)
> Also for whatever it is worth, I don't know of anything other than
> microcontrollers or the larger IoT devices that would want their kernels
> built with CONFIG_HIGH_RES_TIMERS=n. Which might be a failure of
> imagination on my part, but so it goes.
Yeah I agree, like I said before I wasn't even aware you could turn it
off. So I think we'll definitely add a select HIGH_RES_TIMERS in future,
but first I need to work out why we are seeing stalls with it disabled.
cheers
next prev parent reply other threads:[~2022-04-12 6:53 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-05 21:41 rcu_sched self-detected stall on CPU Miguel Ojeda
2022-04-06 9:31 ` Zhouyi Zhou
2022-04-06 17:00 ` Paul E. McKenney
2022-04-06 18:25 ` Zhouyi Zhou
2022-04-06 19:50 ` Paul E. McKenney
2022-04-07 2:26 ` Zhouyi Zhou
2022-04-07 10:07 ` Miguel Ojeda
2022-04-07 15:15 ` Paul E. McKenney
2022-04-07 17:05 ` Miguel Ojeda
2022-04-07 17:55 ` Paul E. McKenney
2022-04-07 23:14 ` Zhouyi Zhou
2022-04-08 1:43 ` Paul E. McKenney
2022-04-08 7:23 ` Michael Ellerman
2022-04-08 10:02 ` Zhouyi Zhou
2022-04-08 14:07 ` Paul E. McKenney
2022-04-08 14:25 ` Zhouyi Zhou
2022-04-10 11:33 ` Michael Ellerman
2022-04-11 3:05 ` Paul E. McKenney
2022-04-12 6:53 ` Michael Ellerman [this message]
2022-04-12 13:36 ` Paul E. McKenney
2022-04-08 13:52 ` Miguel Ojeda
2022-04-08 14:06 ` Paul E. McKenney
2022-04-08 14:42 ` Michael Ellerman
2022-04-08 15:52 ` Paul E. McKenney
2022-04-08 17:02 ` Miguel Ojeda
2022-04-13 5:11 ` Nicholas Piggin
2022-04-13 6:10 ` Low-res tick handler device not going to ONESHOT_STOPPED when tick is stopped (was: rcu_sched self-detected stall on CPU) Nicholas Piggin
2022-04-14 17:15 ` Paul E. McKenney
2022-04-22 15:53 ` Thomas Gleixner
2022-04-23 2:29 ` Re: Nicholas Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87mtgq6be5.fsf@mpe.ellerman.id.au \
--to=mpe@ellerman$(echo .)id.au \
--cc=linuxppc-dev@lists$(echo .)ozlabs.org \
--cc=miguel.ojeda.sandonis@gmail$(echo .)com \
--cc=npiggin@gmail$(echo .)com \
--cc=paulmck@kernel$(echo .)org \
--cc=rcu@vger$(echo .)kernel.org \
--cc=zhouzhouyi@gmail$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox