* [PATCH v2] ACPI: APEI: Handle repeated SEA error storms
@ 2026-05-27 8:27 Junhao He
2026-05-28 1:48 ` mawupeng
0 siblings, 1 reply; 2+ messages in thread
From: Junhao He @ 2026-05-27 8:27 UTC (permalink / raw)
To: rafael, tony.luck, guohanjun, mchehab, xueshuai, jarkko,
yazen.ghannam, jane.chu, lenb, linmiaohe
Cc: bp, linux-acpi, linux-arm-kernel, linux-kernel, linux-edac,
tanxiaofei, linuxarm, liuyonglong, mawupeng1, hejunhao3
When hardware memory corruption occurs and a user process accesses the
corrupted page, the CPU triggers a Synchronous External Abort (SEA).
The kernel invokes do_sea() to handle the exception, which calls
memory_failure() to handle the faulty page.
Scenario 1: Memory Error Interrupt First, then SEA
The page is already poisoned by the memory error interrupt path. The
subsequent SEA handler sends a SIGBUS to the task, which accesses the
poisoned page. This flow is correct.
Scenario 2: SEA first, then memory error interrupt (problematic scenario)
If a user task directly accesses corrupted memory through a PFNMAP-style
mapping (e.g., devmem), the page may still be in the free-buddy state when
SEA is handled. In this case, memory_failure() will poison the page without
invoking kill_accessing_process(), and then takes the free-buddy recovery
path.
After the CPU returns to the task context, the task re-enters the SEA
handler due to the same access. However, ghes_estatus_cached() suppresses
all subsequent entries during the 10-second window, preventing
ghes_do_proc() from being called. This suppression blocks the
MF_ACTION_REQUIRED-based SIGBUS delivery, causing the kernel to fail to
kill the task immediately. Consequently, the process keeps re-entering
the SEA handler, leading to an SEA storm. Later, the memory error
interrupt path also cannot kill the task, leaving the system stuck in
this repeated loop.
The following error logs are explained using the devmem process:
NOTICE: SEA Handle
[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
[Hardware Error]: event severity: recoverable
[Hardware Error]: section_type: ARM processor error
[Hardware Error]: physical fault address: 0x0000001000093c00
[T54990] Memory failure: 0x1000093: recovery action for free buddy page: Recovered
[ T9955] EDAC MC0: 1 UE Multi-bit ECC on unknown memory
(page:0x1000093 offset:0xc00 grain:1 - APEI location: ...)
NOTICE: SEA Handle
NOTICE: SEA Handle
...
... ---> SEA storm
...
NOTICE: SEA Handle
[ T9955] Memory failure: 0x1000093: already hardware poisoned
ghes_print_estatus: 1 callbacks suppressed
[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
[Hardware Error]: event severity: recoverable
[Hardware Error]: section_type: ARM processor error
[Hardware Error]: physical fault address: 0x0000001000093c00
[T54990] Memory failure: 0x1000093: already hardware poisoned
[T54990] 0x1000093: Sending SIGBUS to devmem:54990 due to hardware memory corruption
To resolve this, return an error when encountering the same SEA again.
The subsequent SEA handler invocation uses arm64_notify_die() to send a
SIGBUS signal to the task, which terminates the process and prevents it
from re-entering the handler loop.
Signed-off-by: Junhao He <hejunhao3@h-partners•com>
---
drivers/acpi/apei/ghes.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
Changes in V2:
1. update the commit message per suggestion from Xueshuai
2. Add a check to only return failure on the ghes_notify_sea() path,
avoiding impact on other NMI-type GHES handlers.
Link to V1 - https://lore.kernel.org/all/20251030071321.2763224-1-hejunhao3@h-partners.com/
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 3236a3ce79d6..787664740150 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -1383,8 +1383,16 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
ghes_clear_estatus(ghes, &tmp_header, buf_paddr, fixmap_idx);
/* This error has been reported before, don't process it again. */
- if (ghes_estatus_cached(estatus))
+ if (ghes_estatus_cached(estatus)) {
+ /*
+ * Return failure on duplicate SEA entries so that the
+ * subsequent SEA handler invocation sends a SIGBUS signal to
+ * the task to prevent it from re-entering the handler loop.
+ */
+ if (is_hest_sync_notify(ghes))
+ rc = -ECANCELED;
goto no_work;
+ }
llist_add(&estatus_node->llnode, &ghes_estatus_llist);
--
2.33.0
^ permalink raw reply related [flat|nested] 2+ messages in thread* Re: [PATCH v2] ACPI: APEI: Handle repeated SEA error storms
2026-05-27 8:27 [PATCH v2] ACPI: APEI: Handle repeated SEA error storms Junhao He
@ 2026-05-28 1:48 ` mawupeng
0 siblings, 0 replies; 2+ messages in thread
From: mawupeng @ 2026-05-28 1:48 UTC (permalink / raw)
To: hejunhao3, rafael, tony.luck, guohanjun, mchehab, xueshuai,
jarkko, yazen.ghannam, jane.chu, lenb, linmiaohe
Cc: mawupeng1, bp, linux-acpi, linux-arm-kernel, linux-kernel,
linux-edac, tanxiaofei, linuxarm, liuyonglong
On 周三 2026-5-27 16:27, Junhao He wrote:
> When hardware memory corruption occurs and a user process accesses the
> corrupted page, the CPU triggers a Synchronous External Abort (SEA).
> The kernel invokes do_sea() to handle the exception, which calls
> memory_failure() to handle the faulty page.
>
> Scenario 1: Memory Error Interrupt First, then SEA
> The page is already poisoned by the memory error interrupt path. The
> subsequent SEA handler sends a SIGBUS to the task, which accesses the
> poisoned page. This flow is correct.
>
> Scenario 2: SEA first, then memory error interrupt (problematic scenario)
> If a user task directly accesses corrupted memory through a PFNMAP-style
> mapping (e.g., devmem), the page may still be in the free-buddy state when
> SEA is handled. In this case, memory_failure() will poison the page without
> invoking kill_accessing_process(), and then takes the free-buddy recovery
> path.
>
> After the CPU returns to the task context, the task re-enters the SEA
> handler due to the same access. However, ghes_estatus_cached() suppresses
> all subsequent entries during the 10-second window, preventing
> ghes_do_proc() from being called. This suppression blocks the
> MF_ACTION_REQUIRED-based SIGBUS delivery, causing the kernel to fail to
> kill the task immediately. Consequently, the process keeps re-entering
> the SEA handler, leading to an SEA storm. Later, the memory error
> interrupt path also cannot kill the task, leaving the system stuck in
> this repeated loop.
>
> The following error logs are explained using the devmem process:
> NOTICE: SEA Handle
> [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
> [Hardware Error]: event severity: recoverable
> [Hardware Error]: section_type: ARM processor error
> [Hardware Error]: physical fault address: 0x0000001000093c00
> [T54990] Memory failure: 0x1000093: recovery action for free buddy page: Recovered
> [ T9955] EDAC MC0: 1 UE Multi-bit ECC on unknown memory
> (page:0x1000093 offset:0xc00 grain:1 - APEI location: ...)
> NOTICE: SEA Handle
> NOTICE: SEA Handle
> ...
> ... ---> SEA storm
> ...
> NOTICE: SEA Handle
> [ T9955] Memory failure: 0x1000093: already hardware poisoned
> ghes_print_estatus: 1 callbacks suppressed
> [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
> [Hardware Error]: event severity: recoverable
> [Hardware Error]: section_type: ARM processor error
> [Hardware Error]: physical fault address: 0x0000001000093c00
> [T54990] Memory failure: 0x1000093: already hardware poisoned
> [T54990] 0x1000093: Sending SIGBUS to devmem:54990 due to hardware memory corruption
>
> To resolve this, return an error when encountering the same SEA again.
> The subsequent SEA handler invocation uses arm64_notify_die() to send a
> SIGBUS signal to the task, which terminates the process and prevents it
> from re-entering the handler loop.
>
> Signed-off-by: Junhao He <hejunhao3@h-partners•com>
Reviewed-by: Wupeng Ma <mawupeng1@huawei•com>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-05-28 2:34 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-27 8:27 [PATCH v2] ACPI: APEI: Handle repeated SEA error storms Junhao He
2026-05-28 1:48 ` mawupeng
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox