From: "Toke Høiland-Jørgensen" <toke@redhat•com>
To: Alexei Starovoitov <alexei.starovoitov@gmail•com>,
Daniel Borkmann <daniel@iogearbox•net>
Cc: Stanislav Fomichev <sdf@google•com>, bpf <bpf@vger•kernel.org>,
Nikolay Aleksandrov <razor@blackwall•org>,
Alexei Starovoitov <ast@kernel•org>,
Andrii Nakryiko <andrii@kernel•org>,
Martin KaFai Lau <martin.lau@linux•dev>,
John Fastabend <john.fastabend@gmail•com>,
Joanne Koong <joannelkoong@gmail•com>,
Kumar Kartikeya Dwivedi <memxor@gmail•com>,
Joe Stringer <joe@cilium•io>,
Network Development <netdev@vger•kernel.org>
Subject: Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
Date: Sat, 08 Oct 2022 13:38:54 +0200 [thread overview]
Message-ID: <87sfjysfxt.fsf@toke.dk> (raw)
In-Reply-To: <CAADnVQLpcLWrL-URhRgqCQa6XRZzib4BorZ2QKpPC+Uw_JNW=Q@mail.gmail.com>
Alexei Starovoitov <alexei.starovoitov@gmail•com> writes:
> On Fri, Oct 7, 2022 at 12:37 PM Daniel Borkmann <daniel@iogearbox•net> wrote:
>>
>> On 10/7/22 8:59 PM, Alexei Starovoitov wrote:
>> > On Fri, Oct 7, 2022 at 10:20 AM Toke Høiland-Jørgensen <toke@redhat•com> wrote:
>> [...]
>> >>>> I was thinking a little about how this might work; i.e., how can the
>> >>>> kernel expose the required knobs to allow a system policy to be
>> >>>> implemented without program loading having to talk to anything other
>> >>>> than the syscall API?
>> >>>
>> >>>> How about we only expose prepend/append in the prog attach UAPI, and
>> >>>> then have a kernel function that does the sorting like:
>> >>>
>> >>>> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct
>> >>>> bpf_prog *new_prog, bool append)
>> >>>
>> >>>> where the default implementation just appends/prepends to the array in
>> >>>> progs depending on the value of 'appen'.
>> >>>
>> >>>> And then use the __weak linking trick (or maybe struct_ops with a member
>> >>>> for TXC, another for XDP, etc?) to allow BPF to override the function
>> >>>> wholesale and implement whatever ordering it wants? I.e., allow it can
>> >>>> to just shift around the order of progs in the 'progs' array whenever a
>> >>>> program is loaded/unloaded?
>> >>>
>> >>>> This way, a userspace daemon can implement any policy it wants by just
>> >>>> attaching to that hook, and keeping things like how to express
>> >>>> dependencies as a userspace concern?
>> >>>
>> >>> What if we do the above, but instead of simple global 'attach first/last',
>> >>> the default api would be:
>> >>>
>> >>> - attach before <target_fd>
>> >>> - attach after <target_fd>
>> >>> - attach before target_fd=-1 == first
>> >>> - attach after target_fd=-1 == last
>> >>>
>> >>> ?
>> >>
>> >> Hmm, the problem with that is that applications don't generally have an
>> >> fd to another application's BPF programs; and obtaining them from an ID
>> >> is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach
>> >> before target *ID*" instead, which could work I guess? But then the
>> >> problem becomes that it's racy: the ID you're targeting could get
>> >> detached before you attach, so you'll need to be prepared to check that
>> >> and retry; and I'm almost certain that applications won't test for this,
>> >> so it'll just lead to hard-to-debug heisenbugs. Or am I being too
>> >> pessimistic here?
>> >
>> > I like Stan's proposal and don't see any issue with FD.
>> > It's good to gate specific sequencing with cap_sys_admin.
>> > Also for consistency the FD is better than ID.
>> >
>> > I also like systemd analogy with Before=, After=.
>> > systemd has a ton more ways to specify deps between Units,
>> > but none of them have absolute numbers (which is what priority is).
>> > The only bit I'd tweak in Stan's proposal is:
>> > - attach before <target_fd>
>> > - attach after <target_fd>
>> > - attach before target_fd=0 == first
>> > - attach after target_fd=0 == last
>>
>> I think the before(), after() could work, but the target_fd I have my doubts
>> that it will be practical. Maybe lets walk through a concrete real example. app_a
>> and app_b shipped via container_a resp container_b. Both want to install tc BPF
>> and we (operator/user) want to say that prog from app_b should only be inserted
>> after the one from app_a, never run before; if no prog_a is installed, we ofc just
>> run prog_b, but if prog_a is inserted, it must be before prog_b given the latter
>> can only run after the former. How would we get to one anothers target fd? One
>> could use the 0, but not if more programs sit before/after.
>
> I read your desired use case several times and probably still didn't get it.
> Sounds like prog_b can just do after(fd=0) to become last.
> And prog_a can do before(fd=0).
> Whichever the order of attaching (a or b) these two will always
> be in a->b order.
I agree that it's probably not feasible to have programs themselves
coordinate between themselves except for "install me last/first" type
semantics.
I.e., the "before/after target_fd" is useful for a single application
that wants to install two programs in a certain order. Or for bpftool
for manual/debugging work.
System-wide policy (which includes "two containers both using BPF") is
going to need some kind of policy agent/daemon anyway. And the in-kernel
function override is the only feasible way to do that.
> Since the first and any prog returning !TC_NEXT will abort
> the chain we'd need __weak nop orchestrator prog to interpret
> retval for anything to be useful.
If we also want the orchestrator to interpret return codes, that
probably implies generating a BPF program that does the dispatching,
right? (since the attachment is per-interface we can't reuse the same
one). So maybe we do need to go the route of the (overridable) usermode
helper that gets all the program FDs and generates a BPF dispatcher
program? Or can we do this with a __weak function that emits bytecode
inside the kernel without being unsafe?
Anyway, I'm OK with deferring the orchestrator mechanism and going with
Stanislav's proposal as an initial API.
-Toke
next prev parent reply other threads:[~2022-10-08 11:39 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-04 23:11 [PATCH bpf-next 00/10] BPF link support for tc BPF programs Daniel Borkmann
2022-10-04 23:11 ` [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach " Daniel Borkmann
2022-10-05 0:55 ` sdf
2022-10-05 10:50 ` Toke Høiland-Jørgensen
2022-10-05 14:48 ` Daniel Borkmann
2022-10-05 12:35 ` Daniel Borkmann
2022-10-05 17:56 ` sdf
2022-10-05 18:21 ` Daniel Borkmann
2022-10-05 10:33 ` Toke Høiland-Jørgensen
2022-10-05 12:47 ` Daniel Borkmann
2022-10-05 14:32 ` Toke Høiland-Jørgensen
2022-10-05 14:53 ` Daniel Borkmann
2022-10-05 19:04 ` Jamal Hadi Salim
2022-10-06 20:49 ` Daniel Borkmann
2022-10-07 15:36 ` Jamal Hadi Salim
2022-10-06 0:22 ` Andrii Nakryiko
2022-10-06 5:00 ` Alexei Starovoitov
2022-10-06 14:40 ` Jamal Hadi Salim
2022-10-06 23:29 ` Alexei Starovoitov
2022-10-07 15:43 ` Jamal Hadi Salim
2022-10-06 21:29 ` Daniel Borkmann
2022-10-06 23:28 ` Alexei Starovoitov
2022-10-07 13:26 ` Daniel Borkmann
2022-10-07 14:32 ` Toke Høiland-Jørgensen
2022-10-07 16:55 ` sdf
2022-10-07 17:20 ` Toke Høiland-Jørgensen
2022-10-07 18:11 ` sdf
2022-10-07 19:06 ` Daniel Borkmann
2022-10-07 18:59 ` Alexei Starovoitov
2022-10-07 19:37 ` Daniel Borkmann
2022-10-07 22:45 ` sdf
2022-10-07 23:41 ` Alexei Starovoitov
2022-10-07 23:34 ` Alexei Starovoitov
2022-10-08 11:38 ` Toke Høiland-Jørgensen [this message]
2022-10-08 20:38 ` Alexei Starovoitov
2022-10-13 18:30 ` Andrii Nakryiko
2022-10-14 15:38 ` Alexei Starovoitov
2022-10-27 9:01 ` Daniel Xu
2022-10-06 20:15 ` Martin KaFai Lau
2022-10-06 20:54 ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 02/10] bpf: Implement BPF link handling for " Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
2022-10-06 20:54 ` Daniel Borkmann
2022-10-06 17:56 ` Martin KaFai Lau
2022-10-06 20:10 ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 03/10] bpf: Implement link update for tc BPF link programs Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 04/10] bpf: Implement link introspection " Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
2022-10-06 23:14 ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 05/10] bpf: Implement link detach " Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
2022-10-06 23:24 ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 06/10] libbpf: Change signature of bpf_prog_query Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 07/10] libbpf: Add extended attach/detach opts Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 08/10] libbpf: Add support for BPF tc link Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 09/10] bpftool: Add support for tc fd-based attach types Daniel Borkmann
2022-10-04 23:11 ` [PATCH bpf-next 10/10] bpf, selftests: Add various BPF tc link selftests Daniel Borkmann
2022-10-06 3:19 ` Andrii Nakryiko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87sfjysfxt.fsf@toke.dk \
--to=toke@redhat$(echo .)com \
--cc=alexei.starovoitov@gmail$(echo .)com \
--cc=andrii@kernel$(echo .)org \
--cc=ast@kernel$(echo .)org \
--cc=bpf@vger$(echo .)kernel.org \
--cc=daniel@iogearbox$(echo .)net \
--cc=joannelkoong@gmail$(echo .)com \
--cc=joe@cilium$(echo .)io \
--cc=john.fastabend@gmail$(echo .)com \
--cc=martin.lau@linux$(echo .)dev \
--cc=memxor@gmail$(echo .)com \
--cc=netdev@vger$(echo .)kernel.org \
--cc=razor@blackwall$(echo .)org \
--cc=sdf@google$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox