public inbox for linuxppc-dev@ozlabs.org 
 help / color / mirror / Atom feed
From: Nicholas Piggin <npiggin@gmail•com>
To: Nathan Chancellor <natechancellor@gmail•com>
Cc: clang-built-linux@googlegroups•com, linuxppc-dev@lists•ozlabs.org
Subject: Re: Boot flakiness with QEMU 3.1.0 and Clang built kernels
Date: Sat, 11 Apr 2020 19:32:08 +1000	[thread overview]
Message-ID: <1586597161.xyshvdbjo6.astroid@bobo.none> (raw)
In-Reply-To: <20200411005354.GA24145@ubuntu-s3-xlarge-x86>

Nathan Chancellor's on April 11, 2020 10:53 am:
> Hi Nicholas,
> 
> On Sat, Apr 11, 2020 at 10:29:45AM +1000, Nicholas Piggin wrote:
>> Nathan Chancellor's on April 11, 2020 6:59 am:
>> > Hi all,
>> > 
>> > Recently, our CI started running into several hangs when running the
>> > spinlock torture tests during a boot with QEMU 3.1.0 on
>> > powernv_defconfig and pseries_defconfig when compiled with Clang.
>> > 
>> > I initially bisected Linux and came down to commit 3282a3da25bd
>> > ("powerpc/64: Implement soft interrupt replay in C") [1], which seems to
>> > make sense. However, I realized I could not reproduce this in my local
>> > environment no matter how hard I tried, only in our Docker image. I then
>> > realized my environment's QEMU version was 4.2.0; I compiled 3.1.0 and
>> > was able to reproduce it then.
>> > 
>> > I bisected QEMU down to two commits: powernv_defconfig was fixed by [2]
>> > and pseries_defconfig was fixed by [3].
>> 
>> Looks like it might have previously been testing power8, now power9?
>> -cpu power8 might get it reproducing again.
> 
> Yes, that is what it looks like. I can reproduce the hang with both
> pseries-3.1 and powernv8 on QEMU 4.2.0.
> 
>> > I ran 100 boots with our boot-qemu.sh script [4] and QEMU 3.1.0 failed
>> > approximately 80% of the time but 4.2.0 and 5.0.0-rc1 only failed 1% of
>> > the time [5]. GCC 9.3.0 built kernels failed approximately 3% of time
>> > [6].
>> 
>> Do they fail in the same way? Was the fail rate at 0% before upgrading
>> kernels?
> 
> Yes, it just hangs after I see the print out that the torture tests are
> running.
> 
> [    2.277125] spin_lock-torture: Creating torture_shuffle task
> [    2.279058] spin_lock-torture: Creating torture_stutter task
> [    2.280285] spin_lock-torture: torture_shuffle task started
> [    2.281326] spin_lock-torture: Creating lock_torture_writer task
> [    2.282509] spin_lock-torture: torture_stutter task started
> [    2.283511] spin_lock-torture: Creating lock_torture_writer task
> [    2.285155] spin_lock-torture: lock_torture_writer task started
> [    2.286586] spin_lock-torture: Creating lock_torture_stats task
> [    2.287772] spin_lock-torture: lock_torture_writer task started
> [    2.290578] spin_lock-torture: lock_torture_stats task started
> 
> Yes, we never had any failures in our CI before that upgrade happened. I
> will try to run a set of boot tests with a kernel built at the commit
> right before 3282a3da25bd and at 3282a3da25bd to make triple sure I did
> fall on the right commit.
> 
>> > Without access to real hardware, I cannot really say if there is a
>> > problem here. We are going to upgrade to QEMU 4.2.0 to fix it. This is
>> > more of an FYI so that there is some record of it outside of our issue
>> > tracker and so people can be aware of it in case it comes up somewhere
>> > else.
>> 
>> Thanks for this I'll try to reproduce. You're not running SMP guest?
> 
> No, not as far as I am aware at least. You can see our QEMU line in our
> CI and the boot-qemu.sh script I have listed below:
> 
> https://travis-ci.com/github/ClangBuiltLinux/continuous-integration/jobs/318260635
> 
>> Anything particular to run the lock torture test? This is just 
>> powernv_defconfig + CONFIG_LOCK_TORTURE_TEST=y ?
> 
> We do enable some other configs, you can see those here:
> 
> https://github.com/ClangBuiltLinux/continuous-integration/blob/c02d2f008a64d44e62518bc03beb1126db7619ce/configs/common.config
> https://github.com/ClangBuiltLinux/continuous-integration/blob/c02d2f008a64d44e62518bc03beb1126db7619ce/configs/tt.config
> 
> The tt.config values are needed to reproduce but I did not verify that
> ONLY tt.config was needed. Other than that, no, we are just building
> either pseries_defconfig or powernv_defconfig with those configs and
> letting it boot up with a simple initramfs, which prints the version
> string then shuts the machine down.
> 
> Let me know if you need any more information, cheers!

Okay I can reproduce it. Sometimes it eventually recovers after a long
pause, and some keyboard input often helps it along. So that seems like 
it might be a lost interrupt.

POWER8 vs POWER9 might just be a timing thing if P9 is still hanging
sometimes. I wasn't able to reproduce it with defconfig+tt.config, I
needed your other config with various other debug options.

Thanks for the very good report. I'll let you know what I find.

Thanks,
Nick

  reply	other threads:[~2020-04-11  9:34 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-10 20:59 Boot flakiness with QEMU 3.1.0 and Clang built kernels Nathan Chancellor
2020-04-11  0:29 ` Nicholas Piggin
2020-04-11  0:53   ` Nathan Chancellor
2020-04-11  9:32     ` Nicholas Piggin [this message]
2020-04-11 13:57       ` Nicholas Piggin
2020-04-11 23:35         ` Nathan Chancellor
2020-04-12 12:03         ` Cédric Le Goater
2020-04-14  2:05         ` David Gibson
2020-04-14  4:05           ` Nathan Chancellor
2020-04-14  4:40             ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1586597161.xyshvdbjo6.astroid@bobo.none \
    --to=npiggin@gmail$(echo .)com \
    --cc=clang-built-linux@googlegroups$(echo .)com \
    --cc=linuxppc-dev@lists$(echo .)ozlabs.org \
    --cc=natechancellor@gmail$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox