Re: Netchannles: first stage has been completed. Further ideas.

public inbox for netdev@vger.kernel.org 
 help / color / mirror / Atom feed

From: Rusty Russell <rusty@rustcorp•com.au>
To: David Miller <davem@davemloft•net>
Cc: kuznet@ms2•inr.ac.ru, johnpol@2ka•mipt.ru, netdev@vger•kernel.org
Subject: Re: Netchannles: first stage has been completed. Further ideas.
Date: Tue, 01 Aug 2006 16:36:24 +1000	[thread overview]
Message-ID: <1154414184.31152.99.camel@localhost.localdomain> (raw)
In-Reply-To: <20060731.214729.35358981.davem@davemloft.net>

On Mon, 2006-07-31 at 21:47 -0700, David Miller wrote:
> From: Rusty Russell <rusty@rustcorp•com.au>
> Date: Fri, 28 Jul 2006 15:54:04 +1000
> 
> > (1) I am imagining some Grand Unified Flow Cache (Olsson trie?) that
> > holds (some subset of?) flows.  A successful lookup immediately after
> > packet comes off NIC gives destiny for packet: what route, (optionally)
> > what socket, what filtering, what connection tracking (& what NAT), etc?
> > I don't know if this should be a general array of fn & data ptrs, or
> > specialized fields for each one, or a mix.  Maybe there's a "too hard,
> > do slow path" bit, or maybe hard cases just never get put in the cache.
> > Perhaps we need a separate one for locally-generated packets, a-la
> > ip_route_output().  Anyway, we trade slightly more expensive flow setup
> > for faster packet processing within flows.
> 
> So, specifically, one of the methods you are thinking about might
> be implemented by adding:
> 
> 	void (*input)(struct sk_buff *, void *);
> 	void *input_data;
> 
> to "struct flow_cache_entry" or whatever replaces it?

Probably needs a return value to indicate stop packet processing, and to
be completely general I think we'd want more than one, eg:

	#define MAX_GUFC_INPUTS 5
	unsigned int num_inputs;
	int (*input[MAX_GUFC_INPUTS])(struct sk_buff *, void *);
	void *input_data[MAX_GUFC_INPUTS];

> This way we don't need some kind of "type" information in
> the flow cache entry, since the input handler knows the type.

Some things may want to jam more than a pointer into the cache entry, so
we might do something clever later, but as a first cut this would seem
to work.

> > One way to do this is to add a "have_interest" callback into the
> > hook_ops, which takes each about-to-be-inserted GUFC entry and adds any
> > destinies this hook cares about.  In the case of packet filtering this
> > would do a traversal and append a fn/data ptr to the entry for each rule
> > which could effect it.  
> 
> Can you give a concrete example of how the GUFC might make use
> of this?  Just some small abstract code snippets will do.

OK, I take it back.  I was thinking that on a miss, the GUFC called into
each subsystem to populate the new GUFC entry.  That would be a radical
departure from the current code, so forget it.

So, on a GUFC miss, we could create a new GUFC entry (on stack?), hang
it off the skb, then as each subsystem adds to it as we go through.  At
some point (handwave?) we collect the skb->gufc and insert it into the
trie.

For iptables, as a first step we'd simply do (open-coded for now):

	/* FIXME: Do acceleration properly */
	struct gufc *gufc = skb->gufc;
	if (!gufc || gufc->num_inputs == MAX_INPUTS) {
		skb->gufc = NULL;
	} else {
		gufc->input[gufc->num_inputs] = traverse_entire_table;
		gufc->input_data[gufc->num_inputs++] = this_table;
	}

Later we'd get funky:

	/* Filtering code here */
	...

	if (num_rules_applied > 1 || !only_needed_flow_info) {
		gufc->input[gufc->num_inputs] = traverse_entire_table;
		gufc->input_data[gufc->num_inputs++] = this_table;
	} else if (num_rules_applied == 1) {
		gufc->input[gufc->num_inputs] = traverse_one_rule;
		gufc->input_data[gufc->num_inputs++] = last_rule;
	}

Note that this could be cleverer, too:

	if (result == NF_DROP && only_needed_flow_info) {
		// Who cares about other inputs, we're going to drop
		gufc->input[0] = drop_skb;
		gufc->num_inputs = 1;
	}

Two potential performance issues: 

1) When we change rules, iptables replaces entire table from userspace.
We need pkttables (which uses incremental rule updates) to flush
intelligently.

2) Every iptables rule currently keeps pkt/byte counters, meaning we
can't bypass rules even though they might have no effect on the packet
(eg. iptables -A INPUT -i eth0 -j ETH0_RULES).  We can address this by
having pkt/byte counters in the gufc entry and a method of pushing them
back to iptables when the gufc entry is pruned, and manually traversing
the trie to flush them when the user asks for counters.

> I had the idea of a lazy scheme.  When we create a GUFC entry, we
> tack it onto a DMA'able linked list the card uses.  We do not
> notify the card, we just entail the update onto the list.
> 
> Then, if the card misses it's on-chip GUFC table on an incoming
> packet, it checks the DMA update list by reading it in from memory.
> It updates it's GUFC table with whatever entries are found on this
> list, then it retries to classify the packet.

I had assumed we would simply do full lookup on non-hw-classified
packets, so async insertion is a non-issue.  Can we assume hardware will
cover entire GUFC trie?

> This seems like a possible good solution until we try to address GUFC
> entry deletion, which unfortunately cannot be evaluated in a lazy
> fashion.  It must be synchronous.  This is because if, for example, we
> just killed off a TCP socket we must make sure we don't hit the GUFC
> entry for the TCP identity of that socket any longer.

With RCU, we'll probably be marking the GUFC entry deleted and freeing
it in a callback sometime later.  This gives us a window in which we can
delete it from the card's cache.  If we hit the callback and the card
still hasn't been updated, we need to go synchronous, but maybe that
will be rare?

> Just something to think about, when considering how to translate these
> ideas into hardware.

Yes, it's easy to imagine a DoS pattern where we spend all our cycles
updating trie and hw table, and even less time processing packets.

Cheers,
Rusty.
-- 
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

next prev parent reply	other threads:[~2006-08-01  6:36 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-07-18  8:16 Netchannles: first stage has been completed. Further ideas Evgeniy Polyakov
2006-07-18  8:34 ` David Miller
2006-07-18  8:50   ` Evgeniy Polyakov
2006-07-18 11:16 ` Christian Borntraeger
2006-07-18 11:51   ` Evgeniy Polyakov
2006-07-18 12:36     ` Christian Borntraeger
2006-07-18 19:11       ` Evgeniy Polyakov
2006-07-18 21:20         ` David Miller
2006-07-18 12:15 ` Jörn Engel
2006-07-18 19:08   ` Evgeniy Polyakov
2006-07-19 11:00     ` Jörn Engel
2006-07-20  7:42       ` Evgeniy Polyakov
2006-07-18 23:01 ` Alexey Kuznetsov
2006-07-19  0:39   ` David Miller
2006-07-19  5:38   ` Evgeniy Polyakov
2006-07-19  6:30     ` Evgeniy Polyakov
2006-07-19 13:19     ` Alexey Kuznetsov
2006-07-20  7:32       ` Evgeniy Polyakov
2006-07-20 16:41         ` Alexey Kuznetsov
2006-07-20 21:08           ` Evgeniy Polyakov
2006-07-20 21:21             ` Ben Greear
2006-07-21  7:19               ` Evgeniy Polyakov
2006-07-21  7:20                 ` Evgeniy Polyakov
2006-07-21 16:14                 ` Ben Greear
2006-07-21 16:27                   ` Evgeniy Polyakov
2006-07-22 13:23                   ` Caitlin Bestler
2006-07-20 21:40             ` Ian McDonald
2006-07-21  7:26               ` Evgeniy Polyakov
2006-07-20 22:59             ` Alexey Kuznetsov
2006-07-21  4:55               ` David Miller
2006-07-21  7:10                 ` Evgeniy Polyakov
2006-07-21  7:47                   ` David Miller
2006-07-21  9:06                     ` Evgeniy Polyakov
2006-07-21  9:19                       ` David Miller
2006-07-21  9:39                         ` Evgeniy Polyakov
2006-07-21  9:46                           ` David Miller
2006-07-21  9:55                             ` Evgeniy Polyakov
2006-07-21 16:26                 ` Rick Jones
2006-07-21 20:57                   ` David Miller
2006-07-19 19:52   ` Stephen Hemminger
2006-07-19 20:01     ` David Miller
2006-07-19 20:16       ` Stephen Hemminger
2006-07-24 18:54       ` Stephen Hemminger
2006-07-24 20:52         ` Alexey Kuznetsov
2006-07-27  2:17   ` Rusty Russell
2006-07-27  5:17     ` David Miller
2006-07-27  5:46       ` Rusty Russell
2006-07-27  6:00         ` David Miller
2006-07-27 18:54           ` Stephen Hemminger
2006-07-28  8:21             ` David Miller
2006-07-28  5:54           ` Rusty Russell
2006-08-01  4:47             ` David Miller
2006-08-01  6:36               ` Rusty Russell [this message]
2006-07-27 16:33         ` Alexey Kuznetsov
2006-07-27 16:51           ` Evgeniy Polyakov
2006-07-27 20:56             ` Alexey Kuznetsov
2006-07-28  5:17               ` Evgeniy Polyakov
2006-07-28  5:34                 ` David Miller
2006-07-28  5:47                   ` Evgeniy Polyakov
2006-07-28  4:49           ` Rusty Russell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1154414184.31152.99.camel@localhost.localdomain \
    --to=rusty@rustcorp$(echo .)com.au \
    --cc=davem@davemloft$(echo .)net \
    --cc=johnpol@2ka$(echo .)mipt.ru \
    --cc=kuznet@ms2$(echo .)inr.ac.ru \
    --cc=netdev@vger$(echo .)kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox