* Re: [PATCH] net/core/sock.c remove extra wakeup [not found] <483F99090200005A00037FFE@sinclair.provo.novell.com> @ 2008-05-30 10:05 ` Gregory Haskins 2008-06-11 6:46 ` Herbert Xu 0 siblings, 1 reply; 9+ messages in thread From: Gregory Haskins @ 2008-05-30 10:05 UTC (permalink / raw) To: davem, Patrick Mullaney; +Cc: netdev Sorry for the top post (on a phone) Point taken that the fix needs to be analysed for races, but the problem is very real. We noticed that our udp netperf test would get two wake-ups per packet. I don't remember the exact path where the udp waits occur off the top of my head, but the fundamental problem is the overloaded use of the single wait-queue. So basically it was waiting for a packet to arrive, and the net-rx softirq would first trigger a NOSPACE wakeup (of which was a noop by the udp-rx code) followed by an rx-wakeup. So the netperf thread woke twice for one packet. Fixing this boosted performance by a large margin (don't recall exact figures off the top of my head....somewhere around 20%) Long stort short, its potentially worth fixing this, even if this patch isn't "it" :) HTH Regards -Greg -----Original Message----- From: David Miller <davem@davemloft•net> To: Patrick Mullaney <PMullaney@novell•com> Cc: Gregory Haskins <GHaskins@novell•com> Cc: <netdev@vger•kernel.org> Sent: 5/30/2008 3:52:54 AM Subject: Re: [PATCH] net/core/sock.c remove extra wakeup From: "Patrick Mullaney" <pmullaney@novell•com> Date: Thu, 29 May 2008 10:08:43 -0600 > We have noticed an extra wakeup of a task waiting on a udp receive > while testing with CONFIG_PREEMPT_RT. The following patch > eliminates the source of the extra wakeup by only waking a task > if it is waiting on the writable condition(the SOCK_NOSPACE > bit has been set). The patch is against 2.6.25. > > Signed-off-by: Patrick Mullaney <pmullaney@novell•com> This looks Ok on the surface, but I'm really worried about races. If you clear that bit, another thread which set that bit and already checked the socket space, might sleep and never wake up. I think that is fundamentally why we don't try to optimize this in that way. And why is this so expensive? Once the first wakeup occurs, the subsequent attemps will merely see that the wait queue is empty and do nothing. The only cost is the read lock on the callback lock, and that is about as expensive as this new atomic operation you are adding to clear the SOCK_NOSPACE bit. So the benefit is not clear, and neither is the correctness of this change. Sorry. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] net/core/sock.c remove extra wakeup 2008-05-30 10:05 ` [PATCH] net/core/sock.c remove extra wakeup Gregory Haskins @ 2008-06-11 6:46 ` Herbert Xu 2008-06-11 9:16 ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) David Miller 2008-06-16 3:48 ` [PATCH] net/core/sock.c remove extra wakeup Gregory Haskins 0 siblings, 2 replies; 9+ messages in thread From: Herbert Xu @ 2008-06-11 6:46 UTC (permalink / raw) To: Gregory Haskins; +Cc: davem, PMullaney, netdev Gregory Haskins <GHaskins@novell•com> wrote: > > So basically it was waiting for a packet to arrive, and the net-rx softirq would first trigger a NOSPACE wakeup (of which was a noop by the udp-rx code) followed by an rx-wakeup. So the netperf thread woke twice for one packet. Fixing this boosted performance by a large margin (don't recall exact figures off the top of my head....somewhere around 20%) Please wrap your lines. Anyway, you've lost me with this scenario. The patch is stopping wake-ups on the write path yet you're talking about the read path. So my question is why are you getting the extra wake-up on that write path when you receive a packet? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor•apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 9+ messages in thread
* Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) 2008-06-11 6:46 ` Herbert Xu @ 2008-06-11 9:16 ` David Miller 2008-06-12 14:05 ` Herbert Xu 2008-06-16 3:48 ` [PATCH] net/core/sock.c remove extra wakeup Gregory Haskins 1 sibling, 1 reply; 9+ messages in thread From: David Miller @ 2008-06-11 9:16 UTC (permalink / raw) To: herbert; +Cc: GHaskins, PMullaney, netdev From: Herbert Xu <herbert@gondor•apana.org.au> Date: Wed, 11 Jun 2008 16:46:12 +1000 > Anyway, you've lost me with this scenario. The patch is stopping > wake-ups on the write path yet you're talking about the read path. > So my question is why are you getting the extra wake-up on that > write path when you receive a packet? During a short discussion on IRC with Herbert I explained that indeed it's the write wakeup path that's causing the issue here and we both agreed that the cost is likely simply from the sk->sk_callback_lock and not the wakeups themselves. Once the first wakeup happens, the wait queues should be inactive and the task awake already. The sk->sk_callback_lock is a pile of mud that has existed since around 1999 when we initially SMP threaded the networking stack. It's long overdue to kill this thing, because for the most part it is superfluous. So I just sat for a bit and compiled some notes about the situation and how we can move forward on this. If I get some energy I will attack this task some time soon: -------------------- Deletion of sk->sk_callback_lock. This lock is overloaded tremendously to protect several aspects of socket state, most of which is superfluous and can be eliminated. One aspect protected by this lock is the parent socket to child sock relationship. It makes sure that both the waitqueue and socket pointer are updated atomically wrt. threads executing the sock callbacks such as sk->sk_write_space(). A seperate lock was needed for this because socket callbacks can be invoked asynchronously in softirq context via kfree_skb() and therefore the socket lock would not be suitable for this purpose. Basically, for this usage, all the callback cares about is that the sk->sk_socket and sk->sk_sleep point to stable items. Fortunately we can ensure this by simply never letting these two things point to NULL. Instead we can point them at dummy waitqueue and socket objects when there is no real parent currently associated with the sock. This would allow large amounts of callback code to run lockless. For example, sock_def_write_space() would most not need the sk_callback_lock held. The exception being sk_wake_async() processing. sk_wake_async() uses the callback lock to guard the state of the socket's fasync list. It is taken as a reader for scanners, and taken as a writer for thread paths which add entries to the fasync list. These locking schemes date back to 1999 as indicated by comments above sock_fasync(). To be honest, fasync usage is rare. It's a way to register processes which will receive SIGIO signals during IO events. The generic fasync code fs/fcntl.c:kill_fasync() uses a global lock for list protection and nobody has noticed. But we should be careful for this observation because the other uses of fasync handling are devices like input (keyboards, mice) and TTY devices. If the "fasync on sockets is rare" observation is accurate, we could simply use fasync_helper() and kill_fasync() to implement the necessary locking. Also, fasync is how SIGURG is implemented, again another rarely used feature of sockets. The next layer of (mis-)use of the sk_callback_lock are mis-level protocol layers that sit on top of real protocol stacks. For example sunrpc and iscsi. These subsystems want to override the socket callbacks so that they can do data-ready etc. processing directly or with modified behavior. Typically these code bases stack away the callbacks they override so that when they are done with the socket they restore those pointers. During this "callback pointer swap" they hold the sk_callback_lock, which on the surface seems pretty pointless since this does not protect other threads of execution from invoking these callbacks. SUNRPC also seems to use the sk->sk_callback_lock to stabilize the state of the sk->sk_user_data pointer in the XPRT socket layer. It also has an xprt->shutdown state which is tested additionally to the NULL'ness of the sk->sk_user_data XPRT pointer. A seperate xprt->transport_lock synchornizes the processing of incoming packets for a given XID session, so the sk_callback_lock isn't needed for that purpose. sock_orphan() and sock_graft(), in include/net/sock.h, are the main transition points of the parent socket relationship state. Unfortunately, not all protocols use these helper routines consistently. In particular the radio protocols (ax25, etc.), IPX, ECONET, and some others do this work by hand and in particular without holding the sk->sk_callback_lock. This should be consolidated so that the protocols do this stuff consistently. It also points us to the crummy SOCK_DEAD state bit we have. Really, all of the things guarded by SOCK_DEAD are totally harmless if we point sk->sk_socket and sk->sk_sleep to dummy objects instead of NULL. So, there is strong evidence that SOCK_DEAD could be eliminated. Also, this lack of consistent usage of sock_orphan() and sock_graft() means that some protocols are not invoking security_sock_graft() properly. So likely the first task in all of this is to get the protocol to use sock_orphan() and sock_graft() across the board without any exceptions. At this point we can build dummy socket and wait queue objects that sock allocation and sock_orphan() can point to. I believe at this point that sk_callback_lock acquisition can be removed from sock_orphan() and sock_graft(). Next, we convert the fasync bits to use the global fasync_lock via the helper functions fasync_helper() and kill_fasync() in fs/fcntl.c instead of the per-socket locking, by-hand list insertion, and __kill_fasync() invocations the socket layer uses currently. Now, it should be possible to entirely remove the sk_callback_lock grabs in the sock callbacks. The rest of the work is composed of walking over the socket wrapping layers such as SUNRPC, OCFS, and ISCSI and auditing their sk_callback_lock usage, trying to eliminate the lock as much as possible. The unconvertable parts are probably replacable with subsystem local object locks. It should be possible to delete sk->sk_callback_lock at that point. Killing SOCK_DEAD could be a followon task. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) 2008-06-11 9:16 ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) David Miller @ 2008-06-12 14:05 ` Herbert Xu 0 siblings, 0 replies; 9+ messages in thread From: Herbert Xu @ 2008-06-12 14:05 UTC (permalink / raw) To: David Miller; +Cc: GHaskins, PMullaney, netdev On Wed, Jun 11, 2008 at 02:16:42AM -0700, David Miller wrote: > > Deletion of sk->sk_callback_lock. Looks pretty reasonable. Thanks! -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor•apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] net/core/sock.c remove extra wakeup 2008-06-11 6:46 ` Herbert Xu 2008-06-11 9:16 ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) David Miller @ 2008-06-16 3:48 ` Gregory Haskins 1 sibling, 0 replies; 9+ messages in thread From: Gregory Haskins @ 2008-06-16 3:48 UTC (permalink / raw) To: Herbert Xu; +Cc: davem, Patrick Mullaney, netdev >>> On Wed, Jun 11, 2008 at 2:46 AM, in message <E1K6K60-00061s-00@gondolin•me.apana.org.au>, Herbert Xu <herbert@gondor•apana.org.au> wrote: > Gregory Haskins <GHaskins@novell•com> wrote: >> >> So basically it was waiting for a packet to arrive, and the net-rx softirq > would first trigger a NOSPACE wakeup (of which was a noop by the udp-rx code) > followed by an rx-wakeup. So the netperf thread woke twice for one packet. > Fixing this boosted performance by a large margin (don't recall exact figures > off the top of my head....somewhere around 20%) > > Please wrap your lines. Apologies. I don't think I have control over it in my corporate mailer environment :( but if I can figure out a way to do it I will make this change. You are not the first to complain ;) > > Anyway, you've lost me with this scenario. The patch is stopping > wake-ups on the write path yet you're talking about the read path. > So my question is why are you getting the extra wake-up on that > write path when you receive a packet? Heh..you tell me :) I don't recall the specifics off the top of my head, but IIRC it had to do with the fact that we were running a tx/rx netperf test and the way tx-completions are handled in the stack w.r.t. the softirqs. The tx-completions code is executing inside the softirq-net-rx context and is freeing up the transmitted skbs. The problem is that it is blindly signaling the wait-queue that there was now SPACE (even though no-one cared about that event at the moment...the only waiter was waiting on an rx event). So one way to look at it is the problem is that the single wait-queue serves both rx-wakeup, and tx-nospace events. Rather than open the can of worms of splitting the wait-queues into finer granularity, Pat chose to at least make the overloaded use of the single wait-queue more intelligent to know if anyone cared about the NOSPACE event or not. >From my perspective, I don't care how the issue is specifically solved...I would ultimately just like to see this wasted wake-up "go away" one way or the other. ;) It significantly degrades performance (at least in this synthetic test) and as far as I can tell there is no reason to structure the code this way. If we are mistaken in that assessment, please let me know. If we can offer any more details (such as the specific codepath that triggers the NOSPACE wakeup, for instance) we would be happy to oblige. Long story short this is easily repeatable. Regards, -Greg ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <483F99090200005A00037FFE@lucius.provo.novell.com>]
[parent not found: <483F990C0200005A00038001@lucius.provo.novell.com>]
* Re: [PATCH] net/core/sock.c remove extra wakeup [not found] ` <483F990C0200005A00038001@lucius.provo.novell.com> @ 2008-05-30 17:00 ` Patrick Mullaney 2008-06-01 21:03 ` Andi Kleen 0 siblings, 1 reply; 9+ messages in thread From: Patrick Mullaney @ 2008-05-30 17:00 UTC (permalink / raw) To: Gregory Haskins; +Cc: davem, netdev Not sure if Greg's post made it to the list but I would like to build on what he said(for those who didn't see it - see below). Point taken on the potential for a race, it was considered - I wanted to get feedback from the list to see if it was an issue. There is a somewhat similar looking piece of code in stream.c that clears this bit as well in a similar spot(sk_stream_write_space). Perhaps this is because stream apps must have a single thread per socket and this is not necessarily the case with datagram apps? The wakeup is expensive because it is completely unnecessary - the thread being woke has nothing to do. If this is a high priority thread, which is part of the reason I mentioned that we are testing on a PREEMPT_RT enabled kernel, this will cause undue jitter and degrading of performance(as Greg mentioned we are seeing a 20% degradation in loaded situations). Even in the case where it is sched_other, we see significant degradation. Would a better approach be? 1) clear the bit when the waiter returns(bottom of sock_wait_for_wmem) 2) add a new wait-queue to the socket to separate the writers 3) another idea? I'm having a hard time understanding the use of the NOSPACE flag in sock.c. It seems like it currently gets set but never gets cleared. I must be missing something. Thanks. Pat On Fri, 2008-05-30 at 10:05 +0000, Gregory Haskins wrote: > Sorry for the top post (on a phone) > > Point taken that the fix needs to be analysed for races, but the problem is very real. > We noticed that our udp netperf test would get two wake-ups per packet. > I don't remember the exact path where the udp waits occur off the top of my head, > but the fundamental problem is the overloaded use of the single wait-queue. > > So basically it was waiting for a packet to arrive, and the net-rx softirq would first trigger > a NOSPACE wakeup (of which was a noop by the udp-rx code) followed by an rx-wakeup. > So the netperf thread woke twice for one packet. Fixing this boosted performance by a large > margin (don't recall exact figures off the top of my head....somewhere around 20%) > > Long stort short, its potentially worth fixing this, even if this patch isn't "it" :) > > HTH > > Regards > -Greg > > -----Original Message----- > From: David Miller <davem@davemloft•net> > To: Patrick Mullaney <PMullaney@novell•com> > Cc: Gregory Haskins <GHaskins@novell•com> > Cc: <netdev@vger•kernel.org> > > Sent: 5/30/2008 3:52:54 AM > Subject: Re: [PATCH] net/core/sock.c remove extra wakeup > > From: "Patrick Mullaney" <pmullaney@novell•com> > Date: Thu, 29 May 2008 10:08:43 -0600 > > > We have noticed an extra wakeup of a task waiting on a udp receive > > while testing with CONFIG_PREEMPT_RT. The following patch > > eliminates the source of the extra wakeup by only waking a task > > if it is waiting on the writable condition(the SOCK_NOSPACE > > bit has been set). The patch is against 2.6.25. > > > > Signed-off-by: Patrick Mullaney <pmullaney@novell•com> > > This looks Ok on the surface, but I'm really worried about > races. > > If you clear that bit, another thread which set that bit > and already checked the socket space, might sleep and never > wake up. > > I think that is fundamentally why we don't try to optimize > this in that way. > > And why is this so expensive? Once the first wakeup occurs, > the subsequent attemps will merely see that the wait queue > is empty and do nothing. > > The only cost is the read lock on the callback lock, and that > is about as expensive as this new atomic operation you are > adding to clear the SOCK_NOSPACE bit. > > So the benefit is not clear, and neither is the correctness of > this change. > > Sorry. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] net/core/sock.c remove extra wakeup 2008-05-30 17:00 ` Patrick Mullaney @ 2008-06-01 21:03 ` Andi Kleen 0 siblings, 0 replies; 9+ messages in thread From: Andi Kleen @ 2008-06-01 21:03 UTC (permalink / raw) To: Patrick Mullaney; +Cc: Gregory Haskins, davem, netdev "Patrick Mullaney" <pmullaney@novell•com> writes: > > Would a better approach be? > > 1) clear the bit when the waiter returns(bottom of sock_wait_for_wmem) > 2) add a new wait-queue to the socket to separate the writers > 3) another idea? A large reason this is tricky is because it is essentially lockless. So alternative would be: 4) you add a spinlock which is always taken around the various condition checks and bit manipulation. Not 100% sure if it would be bad to general SMP scalability (you would need to ensure that the bouncing lock cache line isn't a performance problem), but at least it would be much safer and makes it much easier to verify your change is correct. Actually the wait queue already has such a lock, so it might not be that expensive if you switch the wait queue to a lockless version. -Andi ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] net/core/sock.c remove extra wakeup @ 2008-05-29 16:08 Patrick Mullaney 2008-05-30 9:52 ` David Miller 0 siblings, 1 reply; 9+ messages in thread From: Patrick Mullaney @ 2008-05-29 16:08 UTC (permalink / raw) To: davem; +Cc: Gregory Haskins, netdev From: Patrick Mullaney <pmullaney@novell•com> We have noticed an extra wakeup of a task waiting on a udp receive while testing with CONFIG_PREEMPT_RT. The following patch eliminates the source of the extra wakeup by only waking a task if it is waiting on the writable condition(the SOCK_NOSPACE bit has been set). The patch is against 2.6.25. Signed-off-by: Patrick Mullaney <pmullaney@novell•com> --- net/core/sock.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 7a0567b..df08fb2 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1628,12 +1628,17 @@ static void sock_def_readable(struct sock *sk, int len) static void sock_def_write_space(struct sock *sk) { + /* don't wake unless writer is waiting */ + if(!test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) + return; + read_lock(&sk->sk_callback_lock); /* Do not wake up a writer until he can make "significant" * progress. --DaveM */ if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf) { + clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags); if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) wake_up_interruptible_sync(sk->sk_sleep); ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] net/core/sock.c remove extra wakeup 2008-05-29 16:08 Patrick Mullaney @ 2008-05-30 9:52 ` David Miller 0 siblings, 0 replies; 9+ messages in thread From: David Miller @ 2008-05-30 9:52 UTC (permalink / raw) To: pmullaney; +Cc: GHaskins, netdev From: "Patrick Mullaney" <pmullaney@novell•com> Date: Thu, 29 May 2008 10:08:43 -0600 > We have noticed an extra wakeup of a task waiting on a udp receive > while testing with CONFIG_PREEMPT_RT. The following patch > eliminates the source of the extra wakeup by only waking a task > if it is waiting on the writable condition(the SOCK_NOSPACE > bit has been set). The patch is against 2.6.25. > > Signed-off-by: Patrick Mullaney <pmullaney@novell•com> This looks Ok on the surface, but I'm really worried about races. If you clear that bit, another thread which set that bit and already checked the socket space, might sleep and never wake up. I think that is fundamentally why we don't try to optimize this in that way. And why is this so expensive? Once the first wakeup occurs, the subsequent attemps will merely see that the wait queue is empty and do nothing. The only cost is the read lock on the callback lock, and that is about as expensive as this new atomic operation you are adding to clear the SOCK_NOSPACE bit. So the benefit is not clear, and neither is the correctness of this change. Sorry. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2008-06-16 3:48 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <483F99090200005A00037FFE@sinclair.provo.novell.com>
2008-05-30 10:05 ` [PATCH] net/core/sock.c remove extra wakeup Gregory Haskins
2008-06-11 6:46 ` Herbert Xu
2008-06-11 9:16 ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) David Miller
2008-06-12 14:05 ` Herbert Xu
2008-06-16 3:48 ` [PATCH] net/core/sock.c remove extra wakeup Gregory Haskins
[not found] <483F99090200005A00037FFE@lucius.provo.novell.com>
[not found] ` <483F990C0200005A00038001@lucius.provo.novell.com>
2008-05-30 17:00 ` Patrick Mullaney
2008-06-01 21:03 ` Andi Kleen
2008-05-29 16:08 Patrick Mullaney
2008-05-30 9:52 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox