public inbox for netdev@vger.kernel.org 
 help / color / mirror / Atom feed
From: David Ahern <dsa@cumulusnetworks•com>
To: Julian Anastasov <ja@ssi•bg>
Cc: netdev@vger•kernel.org
Subject: Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops
Date: Thu, 24 Mar 2016 20:05:07 -0600	[thread overview]
Message-ID: <56F49CD3.1060406@cumulusnetworks.com> (raw)
In-Reply-To: <alpine.LFD.2.11.1603242201200.1795@ja.home.ssi.bg>

On 3/24/16 4:33 PM, Julian Anastasov wrote:
> 	But for multipath routes we can also consider the
> nexthops as "alternatives", so it depends on how one uses
> the multipath mechanism. The ability to fallback to
> another nexthop assumes one connection is allowed to
> move from one ISP to another. What if the second ISP
> decides to reject the connection? What we have is a
> broken connection just because the retransmits
> were diverted to wrong place in the hurry. So, the
> nexthops can be compatible or incompatible. For your
> setup they are, for others they are not.

I am not sure I completely understand your point. Are you saying that 
within a single multipath route some connections to nexthops are allowed 
and others are not?

So to put that paragraph into an example

15.0.0.0/16
	nexthop via 12.0.0.2  dev swp1 weight 1
	nexthop via 12.0.0.3  dev swp1 weight 1

Hosts from 15.0/16 could have TCP connections use 12.0.0.2, but not 
12.0.0.3 because 12.0.0.3 could be a different ISP and not allow TCP 
connections from this address space?


>
> 	For multipath setups I recall for the following
> possible cases:
>
> 1. Every packet from connection hits multipath route
> and what we want is the connection to use only one
> nexthop. If nexthop fails then the connection fails.
> Latest changes for the multipath algorithm chose this
> option: use hash to map traffic to nexthop and any
> fallbacks to another nexthop are not allowed because
> we should follow the hash mapping. Works when nexthops
> use different ISPs.
>
> 2. Only the first packet hits multipath route. With
> the help from CONNMARK all next packets from connection
> use the same nexthop by using another unicast route.
> The goal here is the multipath route to be used just to
> balance connections. This works best when the nexthop
> selection was random and not a hash based. This was
> how the previous algorithm worked. Fallbacks are
> desired because it is fine to select alive nexthop
> for the first packet, but wrong for the next packets.
>
> 	My preference was for the random selection,
> I don't know why we restricted the new algorithm. May be
> because in the common case when just a single default
> multipath route is created we want to use all nexthops
> in a ideal world where all nexthops are alive.
>
> 	So, if the kernel used a random selection
> your fallback algorithm should help. But it is fragile
> for the simple setup with single default multipath route.
> May be what we miss is the ability to choose between
> random and hash-based selection. Then your patch may be
> useful but only for setup 2 (multipath route hit only by
> first packet). So, your patch may come with a sysctl var
> that explains your current patch logic: "avoid failed nexthops,
> never probe them, wait their failed entry to be expired by
> neigh_periodic_work and just then we can use the nexthop
> by creating new entry to probe the GW". Who will trigger
> probes often enough to maintain fresh state?

First packet out does the probe -- neigh lookup fails, kernel has no 
knowledge so can't reject the nexthop based on neighbor information.

After that if it has information that says that a nexthop is dead, why 
would it continue to try to probe? Any traffic that selects that nh is 
dead. That to me defies the basis of having multiple paths.

>
> 	Also, one may argue that such decisions should
> be done in user space. It is common the direct routers
> to answer ARP even while their uplink is down. In
> the common case, one may need to ping 2-3 indirect
> gateways to decide if a path is alive and then to recreate
> the default multipath route.
>
> 	More comments below...
>
>> Signed-off-by: David Ahern <dsa@cumulusnetworks•com>
>> ---
>>   net/ipv4/fib_semantics.c | 34 ++++++++++++++++++++++++++++++++--
>>   1 file changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
>> index d97268e8ff10..28fc6700c2b1 100644
>> --- a/net/ipv4/fib_semantics.c
>> +++ b/net/ipv4/fib_semantics.c
>> @@ -1563,13 +1563,43 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
>>   void fib_select_multipath(struct fib_result *res, int hash)
>>   {
>>   	struct fib_info *fi = res->fi;
>> +	struct neighbour *n;
>> +	int state;
>>
>>   	for_nexthops(fi) {
>>   		if (hash > atomic_read(&nh->nh_upper_bound))
>>   			continue;
>>
>> -		res->nh_sel = nhsel;
>> -		return;
>> +		state = NUD_NONE;
>> +		n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
>
> 	Sometimes nh_gw is a local IP, you can call
> neigh_lookup (or a lockless RCU-BH variant) only for the
> nh_scope == RT_SCOPE_LINK case, just like in fib_select_default.

got it, will change.

>
>> +		if (n) {
>> +			state = n->nud_state;
>> +			neigh_release(n);
>> +		}
>> +		if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) {
>
> 	NUD_REACHABLE is part of NUD_VALID, so this is shorter:
>
> 	if (!n || (state & NUD_VALID)) {

sure.


>
>> +			res->nh_sel = nhsel;
>> +			return;
>> +		}
>> +	} endfor_nexthops(fi);
>> +
>> +	/* try the nexthops again, but covering the entries
>> +	 * skipped by the hash
>> +	 */
>> +	fi = res->fi;
>> +	for_nexthops(fi) {
>> +		if (hash <= atomic_read(&nh->nh_upper_bound))
>
> 	This is dangerous, we can try a RTNH_F_DEAD entry
> with nh_upper_bound = -1. This can work with 2 checks:

In v2 of my patch I dropped this change as it technically is not needed 
for the case I am trying to solve.

>
> 	if (hash <= atomic_read(&nh->nh_upper_bound))
> 		break;
> 	if (atomic_read(&nh->nh_upper_bound) < 0)
> 		continue;
>
>> +			continue;
>> +
>> +		state = NUD_NONE;
>> +		n = neigh_lookup(&arp_tbl, &nh->nh_gw, fi->fib_dev);
>> +		if (n) {
>> +			state = n->nud_state;
>> +			neigh_release(n);
>> +		}
>> +		if (!n || (state == NUD_REACHABLE) || (state & NUD_VALID)) {
>> +			res->nh_sel = nhsel;
>> +			return;
>> +		}
>>   	} endfor_nexthops(fi);
>
> 	If all are failed why not use the nexthop that was
> selected by the hash? Even if it is failed, new ARP probe
> can succeed.

sure.

Thanks for the comments.

David

  reply	other threads:[~2016-03-25  2:05 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-24 15:25 [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops David Ahern
2016-03-24 16:45 ` Eric Dumazet
2016-03-24 17:43   ` David Ahern
2016-03-24 17:55     ` Eric Dumazet
2016-03-24 18:26   ` David Miller
2016-03-24 22:33 ` Julian Anastasov
2016-03-25  2:05   ` David Ahern [this message]
2016-03-25  9:05     ` Julian Anastasov
2016-03-30  3:16       ` David Ahern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56F49CD3.1060406@cumulusnetworks.com \
    --to=dsa@cumulusnetworks$(echo .)com \
    --cc=ja@ssi$(echo .)bg \
    --cc=netdev@vger$(echo .)kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox