From: ebiederm@xmission•com (Eric W. Biederman)
To: Eric Dumazet <eric.dumazet@gmail•com>
Cc: Francesco Ruggeri <fruggeri@aristanetworks•com>,
netdev@vger•kernel.org, Stephen Hemminger <shemminger@vyatta•com>
Subject: Re: Race condition in ipv6 code
Date: Thu, 12 Jan 2012 23:40:52 -0800 [thread overview]
Message-ID: <m1ipkgqccb.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <1326434549.2617.14.camel@edumazet-laptop> (Eric Dumazet's message of "Fri, 13 Jan 2012 07:02:29 +0100")
Eric Dumazet <eric.dumazet@gmail•com> writes:
> Le jeudi 12 janvier 2012 à 16:11 -0800, Eric W. Biederman a écrit :
>
>> Because the rtnl_lock is broad and we have ABBA deadlocks if we don't in
>> particular we hold the rtnl_lock over sysctl registration and removal.
>> sysctl removal blocks until all of the callers into the sysctl methods
>> namely addrconf_sysctl_forward in this case finish executing.
>>
>> CPU 0 CPU 1
>>
>> rtnl_lock() use_count++
>> unregister_netdevice() addrconf_ctl_foward
>> unregister_sysctl_table() rtnl_lock()
>> wait for use_count of addrconf_ctl_forward
>> to == 0
>>
>>
>> I smacked lockdep around so it would warn about the sysfs ones.
>> The proc and sysctl ones I never did manage to get lockdep warnings
>> but a ABBA deadlock is most definitely possible.
>>
>> Any solutions better than simply restarting the system call are welcome.
>>
>> Perhaps for these heavy weigh methods we should create a work struct
>> and go schedule work to perform the change instead of trying to do the
>> work synchronously in the sysctl handler.
>>
>
> This idea was discussed in netconf 2011 (I focused on the ability to
> dismantle netdevices at high rates), and I admit I had not implemented
> yet.
Interesting. That is a discussion I would have been interested to be
a part of.
In my queue I have sysctl cleanup and speed improvements I should be
posting them for review in the next week or two.
That leaves /proc/net/dev_snmp6/.... as the scalability bottleneck for
the number of network devices.
I have gotten the rcu_barrier in device unregister out from under the
trylock.
At a practical level the rcu_barrier seems to be the primary bottleneck
to remove netdevices at a high rate, although I am certain there are
other issues that crop up. If I go with a networking path that can
batch network devices, and am not bottlenecked by proc, sysfs or sysctl
I can remove 10,000 or so network devices a second.
I have been scratching my head trying to figure out if it is possible
to batch network device deletes that come in via a netlink DELLINK message.
> The problem with rtnl_trylock() is that we make very litle progress if
> many tasks are competing for rtnl, and we have no fairness.
Yes. That code path uses rtnl_trylock() simply because I needed a
simple and correct fix. It was never intended to be the long term
solution, but rather a stop gap to at least remove the possibility of
deadlock in the kernel. It is deployed fairly widely because I found
a lot of examples of that bug.
> The task holding RTNL might be descheduled and all cpus running other
> threads spinning in rtnl_trylock()/restart_syscall(). Almost a deadlock.
>
> Maybe waiting RTNL for at least a couple of ms would help, instead of
> instantly failing rtnl_trylock() and looping.
Yes. If the rtnl_trylock and syscall_restart looping is a problem
we should use a lock other than the rtnl_lock for those variables
or we should use a work queue and take the rtnl_lock in a context
where we don't need the try lock.
I don't think there is much point in getting sophisticated about
rtnl_trylock(). It was a good stopgap but we really should remove
the ABBA deadlock.
As for Stephen Hemminger's suggestion to make a: proc_do_intvec_and_rtnl_lock
function. We could make a helper that does what Francesco did but in a
slightly nicer fashion. What we can't do is make a version that does
not have the ABBA deadlock, which makes the nasty rtnl_trylock feasible.
Given that we still would have the nasty effects of the rtnl_trylock and
syscall_restart idiom I don't expect creating a helper and encouraging
people to build code that is going to need that logic is a particularly
good idea.
The other option is to look at dropping the rtnl_lock in addrconf_ifdown
around the addrconf_sysctl_unregister call. It doesn't seem obvious to
me that we can.
I just thought through the sysctl case just to be certain that we can't
take the locks in the other order, and I just don't see a way to do it.
The problem is that the use tracking guarantees that none of the user
supplied data is being used and it is safe to remove a module. If the
lock comes from the module kaboom.
Hmm. I partially take back my conclusion that I can't fix the problem
at the sysctl level. If we want to take rtnl_lock for every
networking sysctl I think I could enhance the ctl_table_set or the
ctl_table_root concept to have a mutex we could take over every
networking sysctl. With copy_from_user being in the middle of that
I'm not particularly fond of that idea. Nor am I fond of taking
the rtnl_mutex in even more situations than we take it now. Plus it
seems like more deep magic that will confuse people.
So I really think the best solution to avoid the locking craziness is to
have a wrapper that gets the value from userspace and calls
schedule_work to get another thread to actually process the change. I
don't see any problems with writing a helper function for that. The
only downside with using schedule_work is that we return to userspace
before the change has been fully installed in the kernel. I don't
expect that would be a problem but stranger things have happened.
Eric
next prev parent reply other threads:[~2012-01-13 7:38 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-01-12 2:13 Race condition in ipv6 code Francesco Ruggeri
2012-01-12 6:31 ` Eric Dumazet
2012-01-12 6:44 ` David Miller
2012-01-12 20:48 ` Francesco Ruggeri
2012-01-13 0:11 ` Eric W. Biederman
2012-01-13 6:02 ` Eric Dumazet
2012-01-13 7:40 ` Eric W. Biederman [this message]
2012-01-13 17:04 ` Ben Greear
2012-01-14 5:46 ` Eric W. Biederman
2012-01-14 18:31 ` Ben Greear
2012-01-20 2:54 ` Eric W. Biederman
2012-01-13 1:17 ` Eric W. Biederman
2012-01-13 1:57 ` Stephen Hemminger
2012-01-13 22:02 ` Francesco Ruggeri
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=m1ipkgqccb.fsf@fess.ebiederm.org \
--to=ebiederm@xmission$(echo .)com \
--cc=eric.dumazet@gmail$(echo .)com \
--cc=fruggeri@aristanetworks$(echo .)com \
--cc=netdev@vger$(echo .)kernel.org \
--cc=shemminger@vyatta$(echo .)com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox