* [PATCH v4 1/4] perf/core: Fix sched_task callbacks for CPU-wide branch stack events
2026-05-27 12:11 [PATCH v4 0/4] arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
@ 2026-05-27 12:11 ` Puranjay Mohan
2026-05-27 12:11 ` [PATCH v4 2/4] perf: Use a union to clear branch entry bitfields Puranjay Mohan
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Puranjay Mohan @ 2026-05-27 12:11 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Daniel Borkmann, John Fastabend, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
Will Deacon, Mark Rutland, Catalin Marinas, Leo Yan, Rob Herring,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, James Clark, Ian Rogers, Adrian Hunter, Shuah Khan,
Breno Leitao, Ravi Bangoria, Stephane Eranian,
Kumar Kartikeya Dwivedi, Usama Arif, linux-arm-kernel,
linux-perf-users, linux-kselftest, linux-kernel, kernel-team
perf_pmu_sched_task() returns early when cpuctx->task_ctx is non-NULL,
deferring to perf_ctx_sched_task_cb() in the context sched_in/out
paths. But perf_ctx_sched_task_cb() only walks the task context's
pmu_ctx_list -- PMUs that have only CPU-wide events are not on that
list and their sched_task callback is silently skipped.
On ARM64 with CPU-wide branch recording:
perf record -b -e cycles -a -- ls
armv8pmu_sched_task() is skipped whenever the scheduled task has an
unrelated perf event (e.g. a software event), and branch records leak
across task boundaries.
A second problem exists in __perf_pmu_sched_task(): it passes
cpc->task_epc directly to pmu->sched_task(), but task_epc is NULL for
PMUs with only CPU-wide events. When perf_pmu_sched_task() does reach
the loop (because cpuctx->task_ctx is NULL), this causes a NULL
pointer dereference:
Unable to handle kernel NULL pointer dereference at virtual address 00[.]
PC is at armv8pmu_sched_task+0x14/0x50
Call trace:
armv8pmu_sched_task+0x14/0x50 (P)
perf_pmu_sched_task+0xac/0x108
__perf_event_task_sched_out+0x6c/0xe0
Fix both:
- Remove the blanket early return in perf_pmu_sched_task() when
cpuctx->task_ctx is set. Instead, skip individual CPCs that have a
task_epc (those are handled by perf_ctx_sched_task_cb()). CPCs
without a task_epc are CPU-only and must be handled here.
- Fall back to &cpc->epc in __perf_pmu_sched_task() when task_epc is
NULL, so the callback always gets a valid pmu_ctx.
Fixes: bd2756811766 ("perf: Rewrite core context handling")
Signed-off-by: Puranjay Mohan <puranjay@kernel•org>
---
kernel/events/core.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6d1f8bad7e1c..6604f6e8f352 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3906,7 +3906,8 @@ static void __perf_pmu_sched_task(struct perf_cpu_pmu_context *cpc,
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_pmu_disable(pmu);
- pmu->sched_task(cpc->task_epc, task, sched_in);
+ pmu->sched_task(cpc->task_epc ? cpc->task_epc : &cpc->epc,
+ task, sched_in);
perf_pmu_enable(pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -3919,12 +3920,20 @@ static void perf_pmu_sched_task(struct task_struct *prev,
struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
struct perf_cpu_pmu_context *cpc;
- /* cpuctx->task_ctx will be handled in perf_event_context_sched_in/out */
- if (prev == next || cpuctx->task_ctx)
+ if (prev == next)
return;
- list_for_each_entry(cpc, this_cpu_ptr(&sched_cb_list), sched_cb_entry)
+ list_for_each_entry(cpc, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {
+ /*
+ * PMUs with per-task events are handled by
+ * perf_ctx_sched_task_cb() via perf_event_context_sched_in/out
+ * when a task context is active.
+ */
+ if (cpuctx->task_ctx && cpc->task_epc)
+ continue;
+
__perf_pmu_sched_task(cpc, sched_in ? next : prev, sched_in);
+ }
}
static void perf_event_switch(struct task_struct *task,
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH v4 2/4] perf: Use a union to clear branch entry bitfields
2026-05-27 12:11 [PATCH v4 0/4] arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
2026-05-27 12:11 ` [PATCH v4 1/4] perf/core: Fix sched_task callbacks for CPU-wide branch stack events Puranjay Mohan
@ 2026-05-27 12:11 ` Puranjay Mohan
2026-05-27 13:00 ` bot+bpf-ci
2026-05-27 12:11 ` [PATCH v4 3/4] perf/arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
2026-05-27 12:12 ` [PATCH v4 4/4] selftests/bpf: Adjust wasted entries threshold for ARM64 BRBE Puranjay Mohan
3 siblings, 1 reply; 7+ messages in thread
From: Puranjay Mohan @ 2026-05-27 12:11 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Daniel Borkmann, John Fastabend, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
Will Deacon, Mark Rutland, Catalin Marinas, Leo Yan, Rob Herring,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, James Clark, Ian Rogers, Adrian Hunter, Shuah Khan,
Breno Leitao, Ravi Bangoria, Stephane Eranian,
Kumar Kartikeya Dwivedi, Usama Arif, linux-arm-kernel,
linux-perf-users, linux-kselftest, linux-kernel, kernel-team
perf_clear_branch_entry_bitfields() zeroes individual bitfields of struct
perf_branch_entry but has repeatedly fallen out of sync when new fields
were added (new_type and priv were missed).
Wrap the bitfields in an anonymous struct inside a union with a u64
bitfields member, and clear them all with a single assignment. This
avoids having to update the clearing function every time a new bitfield
is added.
Fixes: bfe4daf850f4 ("perf/core: Add perf_clear_branch_entry_bitfields() helper")
Signed-off-by: Puranjay Mohan <puranjay@kernel•org>
---
include/linux/perf_event.h | 9 +--------
include/uapi/linux/perf_event.h | 25 +++++++++++++++----------
tools/include/uapi/linux/perf_event.h | 25 +++++++++++++++----------
3 files changed, 31 insertions(+), 28 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 48d851fbd8ea..f7360c43f902 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1474,14 +1474,7 @@ static inline u32 perf_sample_data_size(struct perf_sample_data *data,
*/
static inline void perf_clear_branch_entry_bitfields(struct perf_branch_entry *br)
{
- br->mispred = 0;
- br->predicted = 0;
- br->in_tx = 0;
- br->abort = 0;
- br->cycles = 0;
- br->type = 0;
- br->spec = PERF_BR_SPEC_NA;
- br->reserved = 0;
+ br->bitfields = 0;
}
extern void perf_output_sample(struct perf_output_handle *handle,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index fd10aa8d697f..c2e7b1b1c4fa 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1491,16 +1491,21 @@ union perf_mem_data_src {
struct perf_branch_entry {
__u64 from;
__u64 to;
- __u64 mispred : 1, /* target mispredicted */
- predicted : 1, /* target predicted */
- in_tx : 1, /* in transaction */
- abort : 1, /* transaction abort */
- cycles : 16, /* cycle count to last branch */
- type : 4, /* branch type */
- spec : 2, /* branch speculation info */
- new_type : 4, /* additional branch type */
- priv : 3, /* privilege level */
- reserved : 31;
+ union {
+ struct {
+ __u64 mispred : 1, /* target mispredicted */
+ predicted : 1, /* target predicted */
+ in_tx : 1, /* in transaction */
+ abort : 1, /* transaction abort */
+ cycles : 16, /* cycle count to last branch */
+ type : 4, /* branch type */
+ spec : 2, /* branch speculation info */
+ new_type : 4, /* additional branch type */
+ priv : 3, /* privilege level */
+ reserved : 31;
+ };
+ __u64 bitfields;
+ };
};
/* Size of used info bits in struct perf_branch_entry */
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index fd10aa8d697f..c2e7b1b1c4fa 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -1491,16 +1491,21 @@ union perf_mem_data_src {
struct perf_branch_entry {
__u64 from;
__u64 to;
- __u64 mispred : 1, /* target mispredicted */
- predicted : 1, /* target predicted */
- in_tx : 1, /* in transaction */
- abort : 1, /* transaction abort */
- cycles : 16, /* cycle count to last branch */
- type : 4, /* branch type */
- spec : 2, /* branch speculation info */
- new_type : 4, /* additional branch type */
- priv : 3, /* privilege level */
- reserved : 31;
+ union {
+ struct {
+ __u64 mispred : 1, /* target mispredicted */
+ predicted : 1, /* target predicted */
+ in_tx : 1, /* in transaction */
+ abort : 1, /* transaction abort */
+ cycles : 16, /* cycle count to last branch */
+ type : 4, /* branch type */
+ spec : 2, /* branch speculation info */
+ new_type : 4, /* additional branch type */
+ priv : 3, /* privilege level */
+ reserved : 31;
+ };
+ __u64 bitfields;
+ };
};
/* Size of used info bits in struct perf_branch_entry */
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v4 2/4] perf: Use a union to clear branch entry bitfields
2026-05-27 12:11 ` [PATCH v4 2/4] perf: Use a union to clear branch entry bitfields Puranjay Mohan
@ 2026-05-27 13:00 ` bot+bpf-ci
0 siblings, 0 replies; 7+ messages in thread
From: bot+bpf-ci @ 2026-05-27 13:00 UTC (permalink / raw)
To: puranjay, bpf
Cc: puranjay, puranjay12, ast, daniel, john.fastabend, andrii,
martin.lau, eddyz87, song, yonghong.song, will, mark.rutland,
catalin.marinas, leo.yan, robh, peterz, mingo, acme, namhyung,
james.clark, irogers, adrian.hunter, shuah, leitao, ravi.bangoria,
eranian, memxor, usama.arif, linux-arm-kernel, linux-perf-users,
linux-kselftest, linux-kernel, kernel-team, ast, andrii, daniel,
martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 4709 bytes --]
> perf: Use a union to clear branch entry bitfields
>
> perf_clear_branch_entry_bitfields() zeroes individual bitfields of struct
> perf_branch_entry but has repeatedly fallen out of sync when new fields
> were added (new_type and priv were missed).
>
> Wrap the bitfields in an anonymous struct inside a union with a u64
> bitfields member, and clear them all with a single assignment. This
> avoids having to update the clearing function every time a new bitfield
> is added.
>
> Fixes: bfe4daf850f4 ("perf/core: Add perf_clear_branch_entry_bitfields() helper")
> Signed-off-by: Puranjay Mohan <puranjay@kernel•org>
The Fixes tag points to commit bfe4daf850f4 which introduced the
perf_clear_branch_entry_bitfields() helper, but that commit was correct
at the time.
The actual bug was introduced by commit b190bc4ac9e6 which added the
'new_type' field without updating the clearing function. Commit
5402d25aa571 also added the 'priv' field with the same issue.
Should the Fixes tag be:
Fixes: b190bc4ac9e6 ("perf: Extend branch type classification")
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 48d851fbd8ea..f7360c43f902 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1474,14 +1474,7 @@ static inline u32 perf_sample_data_size(struct perf_sample_data *data,
> */
> static inline void perf_clear_branch_entry_bitfields(struct perf_branch_entry *br)
> {
> - br->mispred = 0;
> - br->predicted = 0;
> - br->in_tx = 0;
> - br->abort = 0;
> - br->cycles = 0;
> - br->type = 0;
> - br->spec = PERF_BR_SPEC_NA;
> - br->reserved = 0;
> + br->bitfields = 0;
> }
>
> extern void perf_output_sample(struct perf_output_handle *handle,
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index fd10aa8d697f..c2e7b1b1c4fa 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -1491,16 +1491,21 @@ union perf_mem_data_src {
> struct perf_branch_entry {
> __u64 from;
> __u64 to;
> - __u64 mispred : 1, /* target mispredicted */
> - predicted : 1, /* target predicted */
> - in_tx : 1, /* in transaction */
> - abort : 1, /* transaction abort */
> - cycles : 16, /* cycle count to last branch */
> - type : 4, /* branch type */
> - spec : 2, /* branch speculation info */
> - new_type : 4, /* additional branch type */
> - priv : 3, /* privilege level */
> - reserved : 31;
> + union {
> + struct {
> + __u64 mispred : 1, /* target mispredicted */
> + predicted : 1, /* target predicted */
> + in_tx : 1, /* in transaction */
> + abort : 1, /* transaction abort */
> + cycles : 16, /* cycle count to last branch */
> + type : 4, /* branch type */
> + spec : 2, /* branch speculation info */
> + new_type : 4, /* additional branch type */
> + priv : 3, /* privilege level */
> + reserved : 31;
> + };
> + __u64 bitfields;
> + };
> };
>
> /* Size of used info bits in struct perf_branch_entry */
> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> index fd10aa8d697f..c2e7b1b1c4fa 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -1491,16 +1491,21 @@ union perf_mem_data_src {
> struct perf_branch_entry {
> __u64 from;
> __u64 to;
> - __u64 mispred : 1, /* target mispredicted */
> - predicted : 1, /* target predicted */
> - in_tx : 1, /* in transaction */
> - abort : 1, /* transaction abort */
> - cycles : 16, /* cycle count to last branch */
> - type : 4, /* branch type */
> - spec : 2, /* branch speculation info */
> - new_type : 4, /* additional branch type */
> - priv : 3, /* privilege level */
> - reserved : 31;
> + union {
> + struct {
> + __u64 mispred : 1, /* target mispredicted */
> + predicted : 1, /* target predicted */
> + in_tx : 1, /* in transaction */
> + abort : 1, /* transaction abort */
> + cycles : 16, /* cycle count to last branch */
> + type : 4, /* branch type */
> + spec : 2, /* branch speculation info */
> + new_type : 4, /* additional branch type */
> + priv : 3, /* privilege level */
> + reserved : 31;
> + };
> + __u64 bitfields;
> + };
> };
>
> /* Size of used info bits in struct perf_branch_entry */
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26510917353
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v4 3/4] perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
2026-05-27 12:11 [PATCH v4 0/4] arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
2026-05-27 12:11 ` [PATCH v4 1/4] perf/core: Fix sched_task callbacks for CPU-wide branch stack events Puranjay Mohan
2026-05-27 12:11 ` [PATCH v4 2/4] perf: Use a union to clear branch entry bitfields Puranjay Mohan
@ 2026-05-27 12:11 ` Puranjay Mohan
2026-05-27 13:01 ` bot+bpf-ci
2026-05-27 12:12 ` [PATCH v4 4/4] selftests/bpf: Adjust wasted entries threshold for ARM64 BRBE Puranjay Mohan
3 siblings, 1 reply; 7+ messages in thread
From: Puranjay Mohan @ 2026-05-27 12:11 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Daniel Borkmann, John Fastabend, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
Will Deacon, Mark Rutland, Catalin Marinas, Leo Yan, Rob Herring,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, James Clark, Ian Rogers, Adrian Hunter, Shuah Khan,
Breno Leitao, Ravi Bangoria, Stephane Eranian,
Kumar Kartikeya Dwivedi, Usama Arif, linux-arm-kernel,
linux-perf-users, linux-kselftest, linux-kernel, kernel-team
Enable bpf_get_branch_snapshot() on ARM64 by implementing the
perf_snapshot_branch_stack static call for BRBE.
BRBE is paused before masking exceptions to avoid branch buffer
pollution from trace_hardirqs_off(). Exceptions are then masked with
local_daif_save() to prevent PMU overflow pseudo-NMIs from interfering.
If an overflow between pause and DAIF save re-enables BRBE, the snapshot
detects this via BRBFCR_EL1.PAUSED and bails out.
Branch records are read using perf_entry_from_brbe_regset() with a NULL
event pointer to bypass event-specific filtering. The buffer is
invalidated after reading.
Introduce a for_each_brbe_entry() iterator to deduplicate bank
iteration between brbe_read_filtered_entries() and the snapshot.
Signed-off-by: Puranjay Mohan <puranjay@kernel•org>
---
drivers/perf/arm_brbe.c | 127 ++++++++++++++++++++++++++++++++-------
drivers/perf/arm_brbe.h | 9 +++
drivers/perf/arm_pmuv3.c | 5 +-
3 files changed, 119 insertions(+), 22 deletions(-)
diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
index ba554e0c846c..aede95e27784 100644
--- a/drivers/perf/arm_brbe.c
+++ b/drivers/perf/arm_brbe.c
@@ -9,6 +9,7 @@
#include <linux/types.h>
#include <linux/bitmap.h>
#include <linux/perf/arm_pmu.h>
+#include <asm/daifflags.h>
#include "arm_brbe.h"
#define BRBFCR_EL1_BRANCH_FILTERS (BRBFCR_EL1_DIRECT | \
@@ -256,6 +257,14 @@ static bool valid_brbe_version(int brbe_version)
brbe_version == ID_AA64DFR0_EL1_BRBE_BRBE_V1P1;
}
+static __always_inline bool cpu_has_brbe(void)
+{
+ u64 aa64dfr0 = read_sysreg_s(SYS_ID_AA64DFR0_EL1);
+ int brbe = cpuid_feature_extract_unsigned_field(aa64dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT);
+
+ return valid_brbe_version(brbe);
+}
+
static void select_brbe_bank(int bank)
{
u64 brbfcr;
@@ -271,6 +280,20 @@ static void select_brbe_bank(int bank)
isb();
}
+static inline void __brbe_advance(int *bank, int *idx, int nr_hw)
+{
+ if (++(*idx) >= BRBE_BANK_MAX_ENTRIES &&
+ *bank * BRBE_BANK_MAX_ENTRIES + *idx < nr_hw) {
+ *idx = 0;
+ select_brbe_bank(++(*bank));
+ }
+}
+
+#define for_each_brbe_entry(idx, nr_hw) \
+ for (int __bank = (select_brbe_bank(0), 0), idx = 0; \
+ __bank * BRBE_BANK_MAX_ENTRIES + idx < (nr_hw); \
+ __brbe_advance(&__bank, &idx, (nr_hw)))
+
static bool __read_brbe_regset(struct brbe_regset *entry, int idx)
{
entry->brbinf = get_brbinf_reg(idx);
@@ -474,11 +497,9 @@ unsigned int brbe_num_branch_records(const struct arm_pmu *armpmu)
void brbe_probe(struct arm_pmu *armpmu)
{
- u64 brbidr, aa64dfr0 = read_sysreg_s(SYS_ID_AA64DFR0_EL1);
- u32 brbe;
+ u64 brbidr;
- brbe = cpuid_feature_extract_unsigned_field(aa64dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT);
- if (!valid_brbe_version(brbe))
+ if (!cpu_has_brbe())
return;
brbidr = read_sysreg_s(SYS_BRBIDR0_EL1);
@@ -618,10 +639,10 @@ static bool perf_entry_from_brbe_regset(int index, struct perf_branch_entry *ent
brbe_set_perf_entry_type(entry, brbinf);
- if (!branch_sample_no_cycles(event))
+ if (!event || !branch_sample_no_cycles(event))
entry->cycles = brbinf_get_cycles(brbinf);
- if (!branch_sample_no_flags(event)) {
+ if (!event || !branch_sample_no_flags(event)) {
/* Mispredict info is available for source only and complete branch records. */
if (!brbe_record_is_target_only(brbinf)) {
entry->mispred = brbinf_get_mispredict(brbinf);
@@ -774,32 +795,96 @@ void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
{
struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
int nr_hw = brbe_num_branch_records(cpu_pmu);
- int nr_banks = DIV_ROUND_UP(nr_hw, BRBE_BANK_MAX_ENTRIES);
int nr_filtered = 0;
u64 branch_sample_type = event->attr.branch_sample_type;
DECLARE_BITMAP(event_type_mask, PERF_BR_ARM64_MAX);
prepare_event_branch_type_mask(branch_sample_type, event_type_mask);
- for (int bank = 0; bank < nr_banks; bank++) {
- int nr_remaining = nr_hw - (bank * BRBE_BANK_MAX_ENTRIES);
- int nr_this_bank = min(nr_remaining, BRBE_BANK_MAX_ENTRIES);
+ for_each_brbe_entry(i, nr_hw) {
+ struct perf_branch_entry *pbe = &branch_stack->entries[nr_filtered];
- select_brbe_bank(bank);
+ if (!perf_entry_from_brbe_regset(i, pbe, event))
+ break;
- for (int i = 0; i < nr_this_bank; i++) {
- struct perf_branch_entry *pbe = &branch_stack->entries[nr_filtered];
+ if (!filter_branch_record(pbe, branch_sample_type, event_type_mask))
+ continue;
- if (!perf_entry_from_brbe_regset(i, pbe, event))
- goto done;
+ nr_filtered++;
+ }
- if (!filter_branch_record(pbe, branch_sample_type, event_type_mask))
- continue;
+ branch_stack->nr = nr_filtered;
+}
- nr_filtered++;
- }
+/*
+ * Best-effort BRBE snapshot for BPF tracing. Pause BRBE to avoid
+ * self-recording and return 0 if the snapshot state appears disturbed.
+ */
+int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int cnt)
+{
+ unsigned long flags;
+ int nr_hw, nr_copied = 0;
+ u64 brbfcr, brbcr;
+
+ if (!cnt)
+ return 0;
+
+ /* Guard against running on a CPU without BRBE (e.g. big.LITTLE). */
+ if (!cpu_has_brbe())
+ return 0;
+
+ /*
+ * Pause BRBE first to avoid recording our own branches. The
+ * sysreg read/write and ISB are branchless, so pausing before
+ * checking BRBCR avoids polluting the buffer with our own
+ * conditional branches.
+ */
+ brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ brbcr = read_sysreg_s(SYS_BRBCR_EL1);
+ write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+ isb();
+
+ /* Bail out if BRBE is not enabled (BRBCR_EL1 == 0). */
+ if (!brbcr) {
+ write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+ return 0;
}
-done:
- branch_stack->nr = nr_filtered;
+ /* Block local exception delivery while reading the buffer. */
+ flags = local_daif_save();
+
+ /*
+ * A PMU overflow before local_daif_save() could have re-enabled
+ * BRBE, clearing the PAUSED bit. The overflow handler already
+ * restored BRBE to its correct state, so just bail out.
+ */
+ if (!(read_sysreg_s(SYS_BRBFCR_EL1) & BRBFCR_EL1_PAUSED)) {
+ local_daif_restore(flags);
+ return 0;
+ }
+
+ nr_hw = FIELD_GET(BRBIDR0_EL1_NUMREC_MASK,
+ read_sysreg_s(SYS_BRBIDR0_EL1));
+
+ for_each_brbe_entry(i, nr_hw) {
+ if (nr_copied >= cnt)
+ break;
+
+ if (!perf_entry_from_brbe_regset(i, &entries[nr_copied], NULL))
+ break;
+
+ nr_copied++;
+ }
+
+ brbe_invalidate();
+
+ /* Restore BRBCR before unpausing via BRBFCR, matching brbe_enable(). */
+ write_sysreg_s(brbcr, SYS_BRBCR_EL1);
+ isb();
+ write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+ /* Ensure BRBE is unpaused before returning to the caller. */
+ isb();
+ local_daif_restore(flags);
+
+ return nr_copied;
}
diff --git a/drivers/perf/arm_brbe.h b/drivers/perf/arm_brbe.h
index b7c7d8796c86..c2a1824437fb 100644
--- a/drivers/perf/arm_brbe.h
+++ b/drivers/perf/arm_brbe.h
@@ -10,6 +10,7 @@
struct arm_pmu;
struct perf_branch_stack;
struct perf_event;
+struct perf_branch_entry;
#ifdef CONFIG_ARM64_BRBE
void brbe_probe(struct arm_pmu *arm_pmu);
@@ -22,6 +23,8 @@ void brbe_disable(void);
bool brbe_branch_attr_valid(struct perf_event *event);
void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
const struct perf_event *event);
+int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries,
+ unsigned int cnt);
#else
static inline void brbe_probe(struct arm_pmu *arm_pmu) { }
static inline unsigned int brbe_num_branch_records(const struct arm_pmu *armpmu)
@@ -44,4 +47,10 @@ static void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
const struct perf_event *event)
{
}
+
+static inline int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries,
+ unsigned int cnt)
+{
+ return 0;
+}
#endif
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 8014ff766cff..1a9f129a0f94 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -1449,8 +1449,11 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
cpu_pmu->set_event_filter = armv8pmu_set_event_filter;
cpu_pmu->pmu.event_idx = armv8pmu_user_event_idx;
- if (brbe_num_branch_records(cpu_pmu))
+ if (brbe_num_branch_records(cpu_pmu)) {
cpu_pmu->pmu.sched_task = armv8pmu_sched_task;
+ static_call_update(perf_snapshot_branch_stack,
+ arm_brbe_snapshot_branch_stack);
+ }
cpu_pmu->name = name;
cpu_pmu->map_event = map_event;
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v4 3/4] perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
2026-05-27 12:11 ` [PATCH v4 3/4] perf/arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
@ 2026-05-27 13:01 ` bot+bpf-ci
0 siblings, 0 replies; 7+ messages in thread
From: bot+bpf-ci @ 2026-05-27 13:01 UTC (permalink / raw)
To: puranjay, bpf
Cc: puranjay, puranjay12, ast, daniel, john.fastabend, andrii,
martin.lau, eddyz87, song, yonghong.song, will, mark.rutland,
catalin.marinas, leo.yan, robh, peterz, mingo, acme, namhyung,
james.clark, irogers, adrian.hunter, shuah, leitao, ravi.bangoria,
eranian, memxor, usama.arif, linux-arm-kernel, linux-perf-users,
linux-kselftest, linux-kernel, kernel-team, ast, andrii, daniel,
martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 3233 bytes --]
> diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> index ba554e0c846c..aede95e27784 100644
> --- a/drivers/perf/arm_brbe.c
> +++ b/drivers/perf/arm_brbe.c
[ ... ]
> @@ -774,32 +795,96 @@ void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
[ ... ]
> +/*
> + * Best-effort BRBE snapshot for BPF tracing. Pause BRBE to avoid
> + * self-recording and return 0 if the snapshot state appears disturbed.
> + */
> +int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int cnt)
> +{
> + unsigned long flags;
> + int nr_hw, nr_copied = 0;
> + u64 brbfcr, brbcr;
> +
> + if (!cnt)
> + return 0;
> +
> + /* Guard against running on a CPU without BRBE (e.g. big.LITTLE). */
> + if (!cpu_has_brbe())
> + return 0;
> +
> + /*
> + * Pause BRBE first to avoid recording our own branches. The
> + * sysreg read/write and ISB are branchless, so pausing before
> + * checking BRBCR avoids polluting the buffer with our own
> + * conditional branches.
> + */
> + brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> + brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> + write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> + isb();
> +
> + /* Bail out if BRBE is not enabled (BRBCR_EL1 == 0). */
> + if (!brbcr) {
> + write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
> + return 0;
> + }
A subsystem pattern flags this as potentially concerning: when
BRBCR_EL1 is not enabled, the function restores BRBFCR_EL1 and returns
immediately. Should the write_sysreg_s that restores BRBFCR_EL1 be
followed by an isb() before the return?
Looking at other code paths in this function, both the success path and
the PMU-overflow-detection path restore BRBFCR_EL1 followed by isb()
before returning. The early-exit path here omits the barrier, meaning
there is a function return (RET branch) right after the control register
write without an intervening isb().
ARM64 guidelines typically require an isb() after writes to control
system registers to ensure the change takes effect before subsequent
operations. While BRBE is disabled in this path, and eventual exception
entry/exit will provide synchronization, the inconsistency with the
other exit paths is worth checking.
> +
> + /* Block local exception delivery while reading the buffer. */
> + flags = local_daif_save();
> +
> + /*
> + * A PMU overflow before local_daif_save() could have re-enabled
> + * BRBE, clearing the PAUSED bit. The overflow handler already
> + * restored BRBE to its correct state, so just bail out.
> + */
> + if (!(read_sysreg_s(SYS_BRBFCR_EL1) & BRBFCR_EL1_PAUSED)) {
> + local_daif_restore(flags);
> + return 0;
> + }
[ ... ]
> + brbe_invalidate();
> +
> + /* Restore BRBCR before unpausing via BRBFCR, matching brbe_enable(). */
> + write_sysreg_s(brbcr, SYS_BRBCR_EL1);
> + isb();
> + write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
> + /* Ensure BRBE is unpaused before returning to the caller. */
> + isb();
> + local_daif_restore(flags);
> +
> + return nr_copied;
> +}
[ ... ]
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26510917353
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v4 4/4] selftests/bpf: Adjust wasted entries threshold for ARM64 BRBE
2026-05-27 12:11 [PATCH v4 0/4] arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
` (2 preceding siblings ...)
2026-05-27 12:11 ` [PATCH v4 3/4] perf/arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
@ 2026-05-27 12:12 ` Puranjay Mohan
3 siblings, 0 replies; 7+ messages in thread
From: Puranjay Mohan @ 2026-05-27 12:12 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Daniel Borkmann, John Fastabend, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
Will Deacon, Mark Rutland, Catalin Marinas, Leo Yan, Rob Herring,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, James Clark, Ian Rogers, Adrian Hunter, Shuah Khan,
Breno Leitao, Ravi Bangoria, Stephane Eranian,
Kumar Kartikeya Dwivedi, Usama Arif, linux-arm-kernel,
linux-perf-users, linux-kselftest, linux-kernel, kernel-team
The get_branch_snapshot test checks that bpf_get_branch_snapshot()
doesn't waste too many branch entries on infrastructure overhead. The
threshold of < 10 was calibrated for x86 where about 7 entries are
wasted.
On ARM64, the BPF trampoline generates more branches than x86,
resulting in about 13 wasted entries. The overhead comes from the BPF
trampoline calling __bpf_prog_enter_recur which on ARM64 makes
out-of-line calls to __rcu_read_lock and generates more conditional
branches than x86:
[#12] bpf_testmod_loop_test+0x40 -> bpf_trampoline_...+0x48
[#11] bpf_trampoline_...+0x68 -> __bpf_prog_enter_recur+0x0
[#10] __bpf_prog_enter_recur+0x20 -> __bpf_prog_enter_recur+0x118
[#09] __bpf_prog_enter_recur+0x154 -> __bpf_prog_enter_recur+0x160
[#08] __bpf_prog_enter_recur+0x164 -> __bpf_prog_enter_recur+0x2c
[#07] __bpf_prog_enter_recur+0x2c -> __rcu_read_lock+0x0
[#06] __rcu_read_lock+0x18 -> __bpf_prog_enter_recur+0x30
[#05] __bpf_prog_enter_recur+0x9c -> __bpf_prog_enter_recur+0xf0
[#04] __bpf_prog_enter_recur+0xf4 -> __bpf_prog_enter_recur+0xa8
[#03] __bpf_prog_enter_recur+0xb8 -> __bpf_prog_enter_recur+0x100
[#02] __bpf_prog_enter_recur+0x114 -> bpf_trampoline_...+0x6c
[#01] bpf_trampoline_...+0x78 -> bpf_prog_...test1+0x0
[#00] bpf_prog_...test1+0x58 -> arm_brbe_snapshot_branch_stack+0x0
Use an architecture-specific threshold of < 14 for ARM64 to accommodate
this overhead while still detecting regressions.
Signed-off-by: Puranjay Mohan <puranjay@kernel•org>
---
.../selftests/bpf/prog_tests/get_branch_snapshot.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
index 0394a1156d99..8d1a3480767f 100644
--- a/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
+++ b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
@@ -116,13 +116,18 @@ void serial_test_get_branch_snapshot(void)
ASSERT_GT(skel->bss->test1_hits, 6, "find_looptest_in_lbr");
- /* Given we stop LBR in software, we will waste a few entries.
+ /* Given we stop LBR/BRBE in software, we will waste a few entries.
* But we should try to waste as few as possible entries. We are at
- * about 7 on x86_64 systems.
- * Add a check for < 10 so that we get heads-up when something
- * changes and wastes too many entries.
+ * about 7 on x86_64 and about 13 on arm64 systems (the arm64 BPF
+ * trampoline generates more branches than x86_64).
+ * Add a check so that we get heads-up when something changes and
+ * wastes too many entries.
*/
+#if defined(__aarch64__)
+ ASSERT_LT(skel->bss->wasted_entries, 14, "check_wasted_entries");
+#else
ASSERT_LT(skel->bss->wasted_entries, 10, "check_wasted_entries");
+#endif
cleanup:
get_branch_snapshot__destroy(skel);
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 7+ messages in thread