From: Yazen Ghannam <yazen.ghannam@amd•com>
To: Nikolay Borisov <nik.borisov@suse•com>
Cc: Bert Karwatzki <spasswolf@web•de>, Borislav Petkov <bp@alien8•de>,
Tony Luck <tony.luck@intel•com>,
linux-kernel@vger•kernel.org, linux-next@vger•kernel.org,
linux-edac@vger•kernel.org, linux-acpi@vger•kernel.org,
x86@kernel•org, rafael@kernel•org, qiuxu.zhuo@intel•com,
Smita.KoralahalliChannabasappa@amd•com
Subject: Re: spurious mce Hardware Error messages in next-20250912
Date: Thu, 18 Sep 2025 17:00:05 -0400 [thread overview]
Message-ID: <20250918210005.GA2150610@yaz-khff2.amd.com> (raw)
In-Reply-To: <5ba955fe-2b96-429e-b2e8-5e1bf19d8e8e@suse.com>
On Thu, Sep 18, 2025 at 01:20:58PM +0300, Nikolay Borisov wrote:
>
>
> On 9/17/25 22:26, Yazen Ghannam wrote:
> <snip>
>
>
> > Right, so it seems we have bogus data logged in these registers. And
> > this is unrelated to the recent patches.
> >
> > We have some combination of bits set in MCA_DESTAT registers. The
> > deferred error interrupt hasn't fired (at least from the latest
> > example).
> >
> > There does seem to be some combination of bits that are always set and
> > others flip between examples.
> >
> > I'll highlight this to our hardware folks. But I don't think there's
> > much we can do other than filter these out somehow.
> >
> > I can add two checks to the patch to make it more like the current
> > behavior.
> >
> > 1) Check for 'Deferred' status bit when logging from the MCA_DESTAT.
> > This was in the debug patch I shared.
>
> According to AMD APM 9.3.3.4:
>
> "If the error being logged is a deferred error, then the error will be
> logged to MCA_DESTAT."
>
> So this means that when Valid is set in DESTAT then the error MUST BE
> deferred. I.e I think it's in valid to have valid && !deferred in DESTAT, no
> ?
Yes, correct. That is why this issue is perplexing.
>
> Additionally nowhere in the APM is ti mentioned what's the default value of
> MCA_CONFIG.LogDeferredEn so as it stands you are now working with the
> assumption that it's 1 and DESTAT is always a redundant copy of STATUS.
>
The value is determined by the platform. The Linux code is structured so
the data is gathered from any possible source. That's why there are a
few checks to determine which register to look at.
> Btw looking at the output that Bert has provided it seems that indeed
> MCA_CONFIG.LogDeferredEn is 0 by default:
Banks 9 to 14 seem to have bogus values. And this seems to be the cause
of our mishandling here.
You can see the difference compared to the other banks. Banks 7 and 8
are good comparisons as they are of the same "type" (L3 cache).
>
> "
> LogDeferredEn—Bit 34. Enable logging of deferred errors in MCA_STATUS. 0=Log
> deferred errors only in MCA_DESTAT and MCA_DEADDR. 1=Log deferred errors in
> MCA_STATUS and MCA_ADDR in addition to MCA_DESTAT and MCA_DEADDR. This bit
> does not affect logging of deferred errors in MCA_SYND or MCA_MISCx.
> "
>
>
> I think the polling code is slightly broken now for AMD. The order of
> operation per poll cycle should be:
>
> 1. Check MCA_STATUS -> report if there is anything, clear it the bank
> 2. (In the same cycle) -> Check DEFERRED and report if there is anything,
> clear the deferred.
>
It is unlikely to have two independent errors in MCA_STATUS and
MCA_DESTAT due to how errors can be overwritten by more severe errors.
By default, our reference platform implementation has
MCA_CONFIG[LogDeferredInMcaStat] enabled. So a deferred error in
MCA_STATUS will only be overwritten by an uncorrectable (#MC) error. In
this case, MCA_STATUS will be cleared by the #MC handler. And so
MCA_DESTAT acts as a backup.
But you're right there is a gap here that we can try to fill if a
platform ever changes this config bit.
For the current issue, it does seem that the registers contain junk
values. And we are only now seeing this with the recent rework.
Bert, can you please provide two more register dumps from the script?
Our hardware team is interested to see if the values remain consistent
or change between reads.
Thanks,
Yazen
next prev parent reply other threads:[~2025-09-18 21:00 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-15 1:00 spurious mce Hardware Error messages in next-20250912 Bert Karwatzki
2025-09-15 17:55 ` Yazen Ghannam
2025-09-15 21:03 ` Bert Karwatzki
2025-09-15 21:43 ` Bert Karwatzki
2025-09-16 9:10 ` Borislav Petkov
2025-09-16 14:07 ` Yazen Ghannam
2025-09-16 20:27 ` Bert Karwatzki
2025-09-17 7:13 ` Bert Karwatzki
2025-09-17 14:41 ` Yazen Ghannam
2025-09-17 15:33 ` Bert Karwatzki
2025-09-17 19:26 ` Yazen Ghannam
2025-09-17 21:15 ` Yazen Ghannam
2025-09-17 22:01 ` Bert Karwatzki
2025-09-18 10:20 ` Nikolay Borisov
2025-09-18 21:00 ` Yazen Ghannam [this message]
2025-09-18 21:04 ` Luck, Tony
2025-09-18 21:14 ` Yazen Ghannam
2025-09-18 22:07 ` Bert Karwatzki
2025-10-09 13:20 ` Yazen Ghannam
2026-02-12 12:50 ` spurious (?) mce Hardware Error messages in v6.19 Bert Karwatzki
2026-02-13 12:45 ` Bert Karwatzki
2026-02-16 20:25 ` Yazen Ghannam
2026-02-19 14:33 ` Yazen Ghannam
2026-02-19 15:43 ` Bert Karwatzki
2026-02-20 16:49 ` Mario Limonciello
2026-02-20 18:24 ` Bert Karwatzki
2026-02-23 21:53 ` Yazen Ghannam
2026-04-03 14:05 ` Borislav Petkov
2026-04-05 8:47 ` Bert Karwatzki
2026-04-05 10:46 ` Borislav Petkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250918210005.GA2150610@yaz-khff2.amd.com \
--to=yazen.ghannam@amd$(echo .)com \
--cc=Smita.KoralahalliChannabasappa@amd$(echo .)com \
--cc=bp@alien8$(echo .)de \
--cc=linux-acpi@vger$(echo .)kernel.org \
--cc=linux-edac@vger$(echo .)kernel.org \
--cc=linux-kernel@vger$(echo .)kernel.org \
--cc=linux-next@vger$(echo .)kernel.org \
--cc=nik.borisov@suse$(echo .)com \
--cc=qiuxu.zhuo@intel$(echo .)com \
--cc=rafael@kernel$(echo .)org \
--cc=spasswolf@web$(echo .)de \
--cc=tony.luck@intel$(echo .)com \
--cc=x86@kernel$(echo .)org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox