From: Joakim Tjernlund <Joakim.Tjernlund@infinera•com>
To: "linuxppc-dev@lists•ozlabs.org" <linuxppc-dev@lists•ozlabs.org>,
"leoyang.li@nxp•com" <leoyang.li@nxp•com>,
"york.sun@nxp•com" <york.sun@nxp•com>
Subject: Re: Machine Check in P2010(e500v2)
Date: Wed, 6 Sep 2017 20:17:01 +0000 [thread overview]
Message-ID: <1504729020.27247.120.camel@infinera.com> (raw)
In-Reply-To: <AM4PR0401MB16992DFADBD59ADEF8E3ACC08F970@AM4PR0401MB1699.eurprd04.prod.outlook.com>
On Wed, 2017-09-06 at 19:31 +0000, Leo Li wrote:
> > -----Original Message-----
> > From: York Sun
> > Sent: Wednesday, September 06, 2017 10:38 AM
> > To: Joakim Tjernlund <Joakim.Tjernlund@infinera•com>; linuxppc-
> > dev@lists•ozlabs.org; Leo Li <leoyang.li@nxp•com>
> > Subject: Re: Machine Check in P2010(e500v2)
> >=20
> > Scott is no longer with Freescale/NXP. Adding Leo.
> >=20
> > On 09/05/2017 01:40 AM, Joakim Tjernlund wrote:
> > > So after some debugging I found this bug:
> > > @@ -996,7 +998,7 @@ int fsl_pci_mcheck_exception(struct pt_regs *regs=
)
> > > if (is_in_pci_mem_space(addr)) {
> > > if (user_mode(regs)) {
> > > pagefault_disable();
> > > - ret =3D get_user(regs->nip, &inst);
> > > + ret =3D get_user(inst, (__u32 __user
> > > + *)regs->nip);
> > > pagefault_enable();
> > > } else {
> > > ret =3D probe_kernel_address(regs->nip, inst=
);
> > >=20
> > > However, the kernel still locked up after fixing that.
> > > Now I wonder why this fixup is there in the first place? The routine
> > > will not really fixup the insn, just return 0xffffffff for the failin=
g
> > > read and then advance the process NIP.
>=20
> You are right. The code here only gives 0xffffffff to the load instructi=
ons and continue with the next instruction when the load instruction is cau=
sing the machine check. This will prevent a system lockup when reading fro=
m PCI/RapidIO device which is link down.
>=20
> I don't know what is actual problem in your case. Maybe it is a write in=
struction instead of read? Or the code is in a infinite loop waiting for =
a valid read result? Are you able to do some further debugging with the NI=
P correctly printed?
>=20
According to the MC it is a Read and the NIP also leads to a read in the pr=
ogram.
ATM, I have disabled the fixup but I will enable that again.
Question, is it safe add a small printk when this MC happens(after fixing u=
p)? I need to see that
it has happened as the error is somewhat random.
Jocke
> Regards,
> Leo
>=20
> > >=20
> > > Removing the fixup does not help either, kernel still locks up:
> > > [ 28.170532] Machine check in kernel mode.
> > > [ 28.174538] Caused by (from MCSR=3D10008):
> > > [ 28.182804] Bus - Read Data Bus Error: DAR:b7013000
> > > [ 28.197079] Oops: Machine check, sig: 7 [#1]
> > > [ 28.201343] P1010 RDB
> > > [ 28.203608] Modules linked in: linux_bcm_knet(PO) linux_user_bde(P=
O)
> >=20
> > linux_kernel_bde(PO)
> > > [ 28.211796] CPU: 0 PID: 470 Comm: emxp2_hw_bl Tainted: P =
O
> >=20
> > 4.1.38+ #201
> > > [ 28.219540] task: db16ed10 ti: df122000 task.ti: df122000
> > > [ 28.224935] NIP: 10a4e2f4 LR: 10a4e404 CTR: 10046c38
> > > [ 28.229896] REGS: df123f10 TRAP: 0204 Tainted: P O =
(4.1.38+)
> > > [ 28.236942] MSR: 0002d000 <CE,EE,PR,ME> CR: 44002428 XER: 000000=
00
> > > [ 28.243306] DEAR: b7013000 ESR: 00000000
> > > GPR00: 10a4e404 bfab2730 b7b354a0 132f9fa8 07006000 07000000
> >=20
> > 00000000
> > > 132f9fd8
> > > GPR08: b6fd5000 b6fe5000 0003e000 bfab2720 24004424 11d6cf7c 00000000
> > > 00000000
> > > GPR16: 10f6e29c 10f6c872 10f6db01 0000b541 0000b541 11d92fcc 00000011
> > > 00000001
> > > GPR24: 01a5bd3e 132ffbf0 11d60000 00000000 07006000 00000000 132f9fa8
> >=20
> > 00000000
> > > [ 28.275547] NIP [10a4e2f4] 0x10a4e2f4
> > > [ 28.279204] LR [10a4e404] 0x10a4e404
> > > [ 28.282772] Call Trace:
> > > [ 28.285213] ---[ end trace 9f8b64ab1e83f449 ]---
> > > [ 28.289825]
> > >=20
> > >=20
> > > Jocke
> > >=20
> > > On Fri, 2017-09-01 at 13:32 +0200, Joakim Tjernlund wrote:
> > > > I am trying to debug a Machine Check for a P2010 (e500v2) CPU:
> > > >=20
> > > > [ 28.111816] Caused by (from MCSR=3D10008): Bus - Read Data Bus E=
rror
> > > > [ 28.117998] Oops: Machine check, sig: 7 [#1]
> > > > [ 28.122263] P1010 RDB
> > > > [ 28.124529] Modules linked in: linux_bcm_knet(PO) linux_user_bde=
(PO)
> >=20
> > linux_kernel_bde(PO)
> > > > [ 28.132718] CPU: 0 PID: 470 Comm: emxp2_hw_bl Tainted: P =
O
> >=20
> > 4.1.38+ #49
> > > > [ 28.140376] task: db16cd10 ti: df128000 task.ti: df128000
> > > > [ 28.145770] NIP: 00000000 LR: 10a4e404 CTR: 10046c38
> > > > [ 28.150730] REGS: df129f10 TRAP: 0204 Tainted: P O =
(4.1.38+)
> > > > [ 28.157776] MSR: 0002d000 <CE,EE,PR,ME> CR: 44002428 XER: 0000=
0000
> > > > [ 28.164140] DEAR: b7187000 ESR: 00000000
> > > > GPR00: 10a4e404 bf86ea30 b7ca94a0 132f9fa8 07006000 07000000
> >=20
> > 00000000
> > > > 132f9fd8
> > > > GPR08: b7149000 b7159000 0003e000 bf86ea20 24004424 11d6cf7c
> >=20
> > 00000000
> > > > 00000000
> > > > GPR16: 10f6e29c 10f6c872 10f6db01 0000b541 0000b541 11d92fcc
> >=20
> > 00000011
> > > > 00000001
> > > > GPR24: 01a4d12d 132ffbf0 11d60000 00000000 07006000 00000000
> >=20
> > 132f9fa8 00000000
> > > > [ 28.196375] NIP [00000000] (null)
> > > > [ 28.199859] LR [10a4e404] 0x10a4e404
> > > > [ 28.203426] Call Trace:
> > > > [ 28.205866] ---[ end trace f456255ddf9bee83 ]---
> > > >=20
> > > > I cannot figure out why NIP is NULL ? It LOOKs like NIP is set to
> > > > MCSRR0 early on but maybe it is lost somehow?
> > > >=20
> > > > Anyhow, looking at entry_32.S:
> > > > .globl mcheck_transfer_to_handler
> > > > mcheck_transfer_to_handler:
> > > > mfspr r0,SPRN_DSRR0
> > > > stw r0,_DSRR0(r11)
> > > > mfspr r0,SPRN_DSRR1
> > > > stw r0,_DSRR1(r11)
> > > > /* fall through */
> > > >=20
> > > > .globl debug_transfer_to_handler
> > > > debug_transfer_to_handler:
> > > > mfspr r0,SPRN_CSRR0
> > > > stw r0,_CSRR0(r11)
> > > > mfspr r0,SPRN_CSRR1
> > > > stw r0,_CSRR1(r11)
> > > > /* fall through */
> > > >=20
> > > > .globl crit_transfer_to_handler
> > > > crit_transfer_to_handler:
> > > >=20
> > > > It looks odd that DSRRx is assigned in mcheck and CSRRx in debug an=
d
> > > > crit has none. Should not this assigment be shifted down one level?
> > > >=20
> > > > Jocke
>=20
>=20
next prev parent reply other threads:[~2017-09-06 20:17 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-09-01 11:32 Machine Check in P2010(e500v2) Joakim Tjernlund
2017-09-05 8:40 ` Joakim Tjernlund
2017-09-06 15:38 ` York Sun
2017-09-06 19:31 ` Leo Li
2017-09-06 20:17 ` Joakim Tjernlund [this message]
2017-09-06 20:28 ` Leo Li
2017-09-06 20:53 ` Joakim Tjernlund
2017-09-06 21:13 ` Leo Li
2017-09-06 22:50 ` Joakim Tjernlund
2017-09-07 8:41 ` Joakim Tjernlund
2017-09-07 18:54 ` Leo Li
2017-09-08 9:54 ` Joakim Tjernlund
2017-09-08 12:50 ` Joakim Tjernlund
2017-09-08 22:27 ` Leo Li
2017-09-09 12:45 ` Joakim Tjernlund
[not found] ` <1504961965.31322.72.camel@infinera.com>
2017-09-14 16:55 ` Joakim Tjernlund
2017-09-20 16:45 ` Joakim Tjernlund
2017-09-21 18:53 ` Leo Li
2017-09-06 10:05 ` Laurentiu Tudor
2017-09-06 10:16 ` Joakim Tjernlund
2017-09-08 1:56 ` Scott Wood
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1504729020.27247.120.camel@infinera.com \
--to=joakim.tjernlund@infinera$(echo .)com \
--cc=leoyang.li@nxp$(echo .)com \
--cc=linuxppc-dev@lists$(echo .)ozlabs.org \
--cc=york.sun@nxp$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox