public inbox for netdev@vger.kernel.org 
 help / color / mirror / Atom feed
From: Alexei Starovoitov <ast@fb•com>
To: Peter Zijlstra <peterz@infradead•org>
Cc: Steven Rostedt <rostedt@goodmis•org>,
	"David S . Miller" <davem@davemloft•net>,
	Ingo Molnar <mingo@kernel•org>,
	Daniel Borkmann <daniel@iogearbox•net>,
	Arnaldo Carvalho de Melo <acme@infradead•org>,
	Wang Nan <wangnan0@huawei•com>, Josef Bacik <jbacik@fb•com>,
	Brendan Gregg <brendan.d.gregg@gmail•com>,
	<netdev@vger•kernel.org>, <linux-kernel@vger•kernel.org>,
	<kernel-team@fb•com>
Subject: Re: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints
Date: Tue, 5 Apr 2016 11:09:30 -0700	[thread overview]
Message-ID: <5703FF5A.1040707@fb.com> (raw)
In-Reply-To: <20160405141857.GN3448@twins.programming.kicks-ass.net>

On 4/5/16 7:18 AM, Peter Zijlstra wrote:
> On Mon, Apr 04, 2016 at 09:52:48PM -0700, Alexei Starovoitov wrote:
>> introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be
>> attached to tracepoints.
>
> More specifically the perf tracepoint handler, not tracepoints directly.

yes. perf tracepoint handler only. There is no attempt here to attach
to ftrace tracepoint handler.

>> The tracepoint will copy the arguments in the per-cpu buffer and pass
>> it to the bpf program as its first argument.
>
>> The layout of the fields can be discovered by doing
>> 'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
>> prior to the compilation of the program with exception that first 8 bytes
>> are reserved and not accessible to the program. This area is used to store
>> the pointer to 'struct pt_regs' which some of the bpf helpers will use:
>> +---------+
>> | 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
>> +---------+
>> | N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
>> +---------+
>> | dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
>> +---------+
>>
>> Not that all of the fields are already dumped to user space via perf ring buffer
>> and some application access it directly without consulting tracepoint/format.
>
> We call those apps broken..

yes.
uapi/linux/perf_event.h lines 742-749 are pretty clear about it:
"In other words, PERF_SAMPLE_RAW contents are not an ABI"

>> Same rule applies here: static tracepoint fields should only be accessed
>> in a format defined in tracepoint/format. The order of fields and
>> field sizes are not an ABI.
>
>
>> @@ -56,8 +57,9 @@ perf_trace_##call(void *__data, proto)					\
>>   			     sizeof(u64));				\
>>   	__entry_size -= sizeof(u32);					\
>>   									\
>> -	entry = perf_trace_buf_prepare(__entry_size,			\
>> -			event_call->event.type, &__regs, &rctx);	\
>> +	event_type = prog ? TRACE_EVENT_TYPE_MAX : event_call->event.type; \
>> +	entry = perf_trace_buf_prepare(__entry_size, event_type,	\
>> +				       &__regs, &rctx);			\
>>   	if (!entry)							\
>>   		return;							\
>>   									\
>> @@ -67,6 +69,14 @@ perf_trace_##call(void *__data, proto)					\
>>   									\
>>   	{ assign; }							\
>>   									\
>> +	if (prog) {							\
>> +		*(struct pt_regs **)entry = __regs;			\
>> +		if (!trace_call_bpf(prog, entry) || hlist_empty(head)) { \
>> +			perf_swevent_put_recursion_context(rctx);	\
>> +			return;						\
>
> So if the prog 'fails' you consume the entry,

I wouldn't call it 'fails' ;)
The interpretation of return code from bpf program is defined
in kernel/trace/bpf_trace.c as:
  * 0 - return from kprobe (event is filtered out)
  * 1 - store kprobe event into ring buffer
  * Other values are reserved and currently alias to 1

so the above !trace_call_bpf() check matches existing bpf+kprobe
behavior.

>
>> +		}							\
>> +		memset(&entry->ent, 0, sizeof(entry->ent));		\
>
> But if not, you destroy it and then feed it to perf?

yes. If bpf prog returns 1 the buffer goes into normal ring-buffer
with all perf_event attributes and so on.
So far there wasn't a single real use case where we went this path.
Programs always do aggregation inside and pass stuff to user space
either via bpf maps or via bpf_perf_event_output() helper.
I wanted to keep perf_trace_xx() calls to be minimal in .text size
so memset above is one x86 instruction, but I don't mind
replacing this memset with a call to a helper function that will do:
    local_save_flags(flags);
    tracing_generic_entry_update(entry, flags, preempt_count());
    entry->type = type;
Then whether bpf attached or not the ring buffer will see the same
raw tracepoint entry. You think it's cleaner?

>
>> +	}								\
>>   	perf_trace_buf_submit(entry, __entry_size, rctx, __addr,	\
>>   		__count, __regs, head, __task);				\
>>   }
>
>
>> diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
>> index 7a68afca8249..7ada829029d3 100644
>> --- a/kernel/trace/trace_event_perf.c
>> +++ b/kernel/trace/trace_event_perf.c
>> @@ -284,6 +284,9 @@ void *perf_trace_buf_prepare(int size, unsigned short type,
>>   		*regs = this_cpu_ptr(&__perf_regs[*rctxp]);
>>   	raw_data = this_cpu_ptr(perf_trace_buf[*rctxp]);
>>
>> +	if (type == TRACE_EVENT_TYPE_MAX)
>> +		return raw_data;
>> +
>>   	/* zero the dead bytes from align to not leak stack to user */
>>   	memset(&raw_data[size - sizeof(u64)], 0, sizeof(u64));
>
> What's this hunk do? Why can you skip this stuff for BPF attached
> events?

that hunk is skipping init of first 8 bytes, which are occupied
by 'struct trace_entry' normally and are inited as:
    local_save_flags(flags);
    tracing_generic_entry_update(entry, flags, preempt_count());
    entry->type = type;
that adds extra overhead that bpf progs don't need.
If bpf needs current pid, it calls bpf_get_current_pid_tgid() helper.
local_save_flags() is also quite slow x86 insn that is not needed.
If programs would need to read flags, we can introduce new helper
to read it.
These 8 bytes are instead used to store hidden 'struct pt_regs'
pointer which is invisible to bpf programs directly and used
by two bpf helpers: bpf_get_stackid() and bpf_perf_event_output()
which need pt_regs. See patch 4/8

  reply	other threads:[~2016-04-05 18:09 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-05  4:52 [PATCH net-next 0/8] allow bpf attach to tracepoints Alexei Starovoitov
2016-04-05  4:52 ` [PATCH net-next 1/8] perf: optimize perf_fetch_caller_regs Alexei Starovoitov
2016-04-05 12:06   ` Peter Zijlstra
2016-04-05 17:41     ` Alexei Starovoitov
2016-04-08 22:12     ` Steven Rostedt
2016-04-05  4:52 ` [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints Alexei Starovoitov
2016-04-05 14:18   ` Peter Zijlstra
2016-04-05 18:09     ` Alexei Starovoitov [this message]
2016-04-05 18:16       ` Peter Zijlstra
2016-04-05 18:21         ` Alexei Starovoitov
2016-04-18 20:29   ` Steven Rostedt
2016-04-18 21:43     ` Alexei Starovoitov
2016-04-18 22:16       ` Steven Rostedt
2016-04-19  1:15         ` Alexei Starovoitov
2016-04-19  2:58           ` Steven Rostedt
2016-04-05  4:52 ` [PATCH net-next 3/8] bpf: register BPF_PROG_TYPE_TRACEPOINT program type Alexei Starovoitov
2016-04-05  4:52 ` [PATCH net-next 4/8] bpf: support bpf_get_stackid() and bpf_perf_event_output() in tracepoint programs Alexei Starovoitov
2016-04-05  4:52 ` [PATCH net-next 5/8] bpf: sanitize bpf tracepoint access Alexei Starovoitov
2016-04-05  4:52 ` [PATCH net-next 6/8] samples/bpf: add tracepoint support to bpf loader Alexei Starovoitov
2016-04-05  4:52 ` [PATCH net-next 7/8] samples/bpf: tracepoint example Alexei Starovoitov
2016-04-05  4:52 ` [PATCH net-next 8/8] samples/bpf: add tracepoint vs kprobe performance tests Alexei Starovoitov
2016-04-18 16:13 ` [PATCH net-next 0/8] allow bpf attach to tracepoints Steven Rostedt
2016-04-18 19:51   ` Alexei Starovoitov
2016-04-18 20:47     ` Steven Rostedt
2016-04-18 21:25       ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5703FF5A.1040707@fb.com \
    --to=ast@fb$(echo .)com \
    --cc=acme@infradead$(echo .)org \
    --cc=brendan.d.gregg@gmail$(echo .)com \
    --cc=daniel@iogearbox$(echo .)net \
    --cc=davem@davemloft$(echo .)net \
    --cc=jbacik@fb$(echo .)com \
    --cc=kernel-team@fb$(echo .)com \
    --cc=linux-kernel@vger$(echo .)kernel.org \
    --cc=mingo@kernel$(echo .)org \
    --cc=netdev@vger$(echo .)kernel.org \
    --cc=peterz@infradead$(echo .)org \
    --cc=rostedt@goodmis$(echo .)org \
    --cc=wangnan0@huawei$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox