[PATCH] net/core/sock.c remove extra wakeup

public inbox for netdev@vger.kernel.org 
 help / color / mirror / Atom feed

* [PATCH] net/core/sock.c remove extra wakeup
@ 2008-05-29 16:08 Patrick Mullaney
  2008-05-30  9:52 ` David Miller
  0 siblings, 1 reply; 9+ messages in thread
From: Patrick Mullaney @ 2008-05-29 16:08 UTC (permalink / raw)
  To: davem; +Cc: Gregory Haskins, netdev


From: Patrick Mullaney <pmullaney@novell•com>

We have noticed an extra wakeup of a task waiting on a udp receive
while testing with CONFIG_PREEMPT_RT. The following patch
eliminates the source of the extra wakeup by only waking a task
if it is waiting on the writable condition(the SOCK_NOSPACE
bit has been set). The patch is against 2.6.25.

Signed-off-by: Patrick Mullaney <pmullaney@novell•com>
---

 net/core/sock.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 7a0567b..df08fb2 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1628,12 +1628,17 @@ static void sock_def_readable(struct sock *sk, int len)
 
 static void sock_def_write_space(struct sock *sk)
 {
+	/* don't wake unless writer is waiting */
+	if(!test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
+		return;
+
 	read_lock(&sk->sk_callback_lock);
 
 	/* Do not wake up a writer until he can make "significant"
 	 * progress.  --DaveM
 	 */
 	if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf) {
+		clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 			wake_up_interruptible_sync(sk->sk_sleep);
 




^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] net/core/sock.c remove extra wakeup
  2008-05-29 16:08 Patrick Mullaney
@ 2008-05-30  9:52 ` David Miller
  0 siblings, 0 replies; 9+ messages in thread
From: David Miller @ 2008-05-30  9:52 UTC (permalink / raw)
  To: pmullaney; +Cc: GHaskins, netdev

From: "Patrick Mullaney" <pmullaney@novell•com>
Date: Thu, 29 May 2008 10:08:43 -0600

> We have noticed an extra wakeup of a task waiting on a udp receive
> while testing with CONFIG_PREEMPT_RT. The following patch
> eliminates the source of the extra wakeup by only waking a task
> if it is waiting on the writable condition(the SOCK_NOSPACE
> bit has been set). The patch is against 2.6.25.
> 
> Signed-off-by: Patrick Mullaney <pmullaney@novell•com>

This looks Ok on the surface, but I'm really worried about
races.

If you clear that bit, another thread which set that bit
and already checked the socket space, might sleep and never
wake up.

I think that is fundamentally why we don't try to optimize
this in that way.

And why is this so expensive?  Once the first wakeup occurs,
the subsequent attemps will merely see that the wait queue
is empty and do nothing.

The only cost is the read lock on the callback lock, and that
is about as expensive as this new atomic operation you are
adding to clear the SOCK_NOSPACE bit.

So the benefit is not clear, and neither is the correctness of
this change.

Sorry.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net/core/sock.c remove extra wakeup
       [not found] <483F99090200005A00037FFE@sinclair.provo.novell.com>
@ 2008-05-30 10:05 ` Gregory Haskins
  2008-06-11  6:46   ` Herbert Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Gregory Haskins @ 2008-05-30 10:05 UTC (permalink / raw)
  To: davem, Patrick Mullaney; +Cc: netdev

Sorry for the top post (on a phone)

Point taken that the fix needs to be analysed for races, but the problem is very real.  We noticed that our udp netperf test would get two wake-ups per packet.  I don't remember the exact path where the udp waits occur off the top of my head, but the fundamental problem is the overloaded use of the single wait-queue.

So basically it was waiting for a packet to arrive, and the net-rx softirq would first trigger a NOSPACE wakeup (of which was a noop by the udp-rx code) followed by an rx-wakeup.  So the netperf thread woke twice for one packet.  Fixing this boosted performance by a large margin (don't recall exact figures off the top of my head....somewhere around 20%)

Long stort short, its potentially worth fixing this, even if this patch isn't "it" :)

HTH

Regards
-Greg  

-----Original Message-----
From: David Miller <davem@davemloft•net>
To: Patrick Mullaney <PMullaney@novell•com>
Cc: Gregory Haskins <GHaskins@novell•com>
Cc:  <netdev@vger•kernel.org>

Sent: 5/30/2008 3:52:54 AM
Subject: Re: [PATCH] net/core/sock.c remove extra wakeup

From: "Patrick Mullaney" <pmullaney@novell•com>
Date: Thu, 29 May 2008 10:08:43 -0600

> We have noticed an extra wakeup of a task waiting on a udp receive
> while testing with CONFIG_PREEMPT_RT. The following patch
> eliminates the source of the extra wakeup by only waking a task
> if it is waiting on the writable condition(the SOCK_NOSPACE
> bit has been set). The patch is against 2.6.25.
> 
> Signed-off-by: Patrick Mullaney <pmullaney@novell•com>

This looks Ok on the surface, but I'm really worried about
races.

If you clear that bit, another thread which set that bit
and already checked the socket space, might sleep and never
wake up.

I think that is fundamentally why we don't try to optimize
this in that way.

And why is this so expensive?  Once the first wakeup occurs,
the subsequent attemps will merely see that the wait queue
is empty and do nothing.

The only cost is the read lock on the callback lock, and that
is about as expensive as this new atomic operation you are
adding to clear the SOCK_NOSPACE bit.

So the benefit is not clear, and neither is the correctness of
this change.

Sorry.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net/core/sock.c remove extra wakeup
       [not found] ` <483F990C0200005A00038001@lucius.provo.novell.com>
@ 2008-05-30 17:00   ` Patrick Mullaney
  2008-06-01 21:03     ` Andi Kleen
  0 siblings, 1 reply; 9+ messages in thread
From: Patrick Mullaney @ 2008-05-30 17:00 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: davem, netdev

Not sure if Greg's post made it to the list but I would like to
build on what he said(for those who didn't see it - see below).

Point taken on the potential for a race, it was considered - I wanted to
get feedback from the list to see if it was an issue. There is a
somewhat similar looking piece of code in stream.c that clears this
bit as well in a similar spot(sk_stream_write_space). Perhaps this is
because stream apps must have a single thread per socket and this
is not necessarily the case with datagram apps?

The wakeup is expensive because it is completely unnecessary - the
thread being woke has nothing to do. If this is a high priority thread,
which is part of the reason I mentioned that we are testing on a
PREEMPT_RT enabled kernel, this will cause undue jitter and degrading
of performance(as Greg mentioned we are seeing a 20% degradation in
loaded situations). Even in the case where it is sched_other, we see
significant degradation.

Would a better approach be?

1) clear the bit when the waiter returns(bottom of sock_wait_for_wmem)
2) add a new wait-queue to the socket to separate the writers
3) another idea?

I'm having a hard time understanding the use of the NOSPACE flag in
sock.c. It seems like it currently gets set but never gets cleared. I
must be missing something.

Thanks.
Pat

On Fri, 2008-05-30 at 10:05 +0000, Gregory Haskins wrote:
> Sorry for the top post (on a phone)
> 
> Point taken that the fix needs to be analysed for races, but the problem is very real.
>   We noticed that our udp netperf test would get two wake-ups per packet.
>   I don't remember the exact path where the udp waits occur off the top of my head,
>  but the fundamental problem is the overloaded use of the single wait-queue.
> 
> So basically it was waiting for a packet to arrive, and the net-rx softirq would first trigger
>  a NOSPACE wakeup (of which was a noop by the udp-rx code) followed by an rx-wakeup.
>   So the netperf thread woke twice for one packet.  Fixing this boosted performance by a large
>  margin (don't recall exact figures off the top of my head....somewhere around 20%)
> 
> Long stort short, its potentially worth fixing this, even if this patch isn't "it" :)
> 
> HTH
> 
> Regards
> -Greg  
>  
> -----Original Message-----
> From: David Miller <davem@davemloft•net>
> To: Patrick Mullaney <PMullaney@novell•com>
> Cc: Gregory Haskins <GHaskins@novell•com>
> Cc:  <netdev@vger•kernel.org>
> 
> Sent: 5/30/2008 3:52:54 AM
> Subject: Re: [PATCH] net/core/sock.c remove extra wakeup
> 
> From: "Patrick Mullaney" <pmullaney@novell•com>
> Date: Thu, 29 May 2008 10:08:43 -0600
> 
> > We have noticed an extra wakeup of a task waiting on a udp receive
> > while testing with CONFIG_PREEMPT_RT. The following patch
> > eliminates the source of the extra wakeup by only waking a task
> > if it is waiting on the writable condition(the SOCK_NOSPACE
> > bit has been set). The patch is against 2.6.25.
> > 
> > Signed-off-by: Patrick Mullaney <pmullaney@novell•com>
> 
> This looks Ok on the surface, but I'm really worried about
> races.
> 
> If you clear that bit, another thread which set that bit
> and already checked the socket space, might sleep and never
> wake up.
> 
> I think that is fundamentally why we don't try to optimize
> this in that way.
> 
> And why is this so expensive?  Once the first wakeup occurs,
> the subsequent attemps will merely see that the wait queue
> is empty and do nothing.
> 
> The only cost is the read lock on the callback lock, and that
> is about as expensive as this new atomic operation you are
> adding to clear the SOCK_NOSPACE bit.
> 
> So the benefit is not clear, and neither is the correctness of
> this change.
> 
> Sorry.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net/core/sock.c remove extra wakeup
  2008-05-30 17:00   ` Patrick Mullaney
@ 2008-06-01 21:03     ` Andi Kleen
  0 siblings, 0 replies; 9+ messages in thread
From: Andi Kleen @ 2008-06-01 21:03 UTC (permalink / raw)
  To: Patrick Mullaney; +Cc: Gregory Haskins, davem, netdev

"Patrick Mullaney" <pmullaney@novell•com> writes:
>
> Would a better approach be?
>
> 1) clear the bit when the waiter returns(bottom of sock_wait_for_wmem)
> 2) add a new wait-queue to the socket to separate the writers
> 3) another idea?

A large reason this is tricky is because it is essentially lockless. So
alternative would be:

4) you add a spinlock which is always taken around the various condition
checks and bit manipulation. Not 100% sure if it would be bad to 
general SMP scalability (you would need to ensure that the bouncing
lock cache line isn't a performance problem), but at least it would 
be much safer and makes it much easier to verify your change is correct.

Actually the wait queue already has such a lock, so it might not be
that expensive if you switch the wait queue to a lockless version.

-Andi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net/core/sock.c remove extra wakeup
  2008-05-30 10:05 ` [PATCH] net/core/sock.c remove extra wakeup Gregory Haskins
@ 2008-06-11  6:46   ` Herbert Xu
  2008-06-11  9:16     ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) David Miller
  2008-06-16  3:48     ` [PATCH] net/core/sock.c remove extra wakeup Gregory Haskins
  0 siblings, 2 replies; 9+ messages in thread
From: Herbert Xu @ 2008-06-11  6:46 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: davem, PMullaney, netdev

Gregory Haskins <GHaskins@novell•com> wrote:
> 
> So basically it was waiting for a packet to arrive, and the net-rx softirq would first trigger a NOSPACE wakeup (of which was a noop by the udp-rx code) followed by an rx-wakeup.  So the netperf thread woke twice for one packet.  Fixing this boosted performance by a large margin (don't recall exact figures off the top of my head....somewhere around 20%)

Please wrap your lines.

Anyway, you've lost me with this scenario.  The patch is stopping
wake-ups on the write path yet you're talking about the read path.
So my question is why are you getting the extra wake-up on that
write path when you receive a packet?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor•apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup)
  2008-06-11  6:46   ` Herbert Xu
@ 2008-06-11  9:16     ` David Miller
  2008-06-12 14:05       ` Herbert Xu
  2008-06-16  3:48     ` [PATCH] net/core/sock.c remove extra wakeup Gregory Haskins
  1 sibling, 1 reply; 9+ messages in thread
From: David Miller @ 2008-06-11  9:16 UTC (permalink / raw)
  To: herbert; +Cc: GHaskins, PMullaney, netdev

From: Herbert Xu <herbert@gondor•apana.org.au>
Date: Wed, 11 Jun 2008 16:46:12 +1000

> Anyway, you've lost me with this scenario.  The patch is stopping
> wake-ups on the write path yet you're talking about the read path.
> So my question is why are you getting the extra wake-up on that
> write path when you receive a packet?

During a short discussion on IRC with Herbert I explained
that indeed it's the write wakeup path that's causing the
issue here and we both agreed that the cost is likely
simply from the sk->sk_callback_lock and not the wakeups
themselves.  Once the first wakeup happens, the wait
queues should be inactive and the task awake already.

The sk->sk_callback_lock is a pile of mud that has existed since
around 1999 when we initially SMP threaded the networking stack.
It's long overdue to kill this thing, because for the most part it
is superfluous.

So I just sat for a bit and compiled some notes about the situation
and how we can move forward on this.  If I get some energy I will
attack this task some time soon:

--------------------

Deletion of sk->sk_callback_lock.

This lock is overloaded tremendously to protect several aspects of
socket state, most of which is superfluous and can be eliminated.

One aspect protected by this lock is the parent socket to child sock
relationship.  It makes sure that both the waitqueue and socket
pointer are updated atomically wrt. threads executing the sock
callbacks such as sk->sk_write_space().

A seperate lock was needed for this because socket callbacks can be
invoked asynchronously in softirq context via kfree_skb() and
therefore the socket lock would not be suitable for this purpose.

Basically, for this usage, all the callback cares about is that the
sk->sk_socket and sk->sk_sleep point to stable items.  Fortunately we
can ensure this by simply never letting these two things point to
NULL.  Instead we can point them at dummy waitqueue and socket objects
when there is no real parent currently associated with the sock.

This would allow large amounts of callback code to run lockless.  For
example, sock_def_write_space() would most not need the
sk_callback_lock held.  The exception being sk_wake_async()
processing.

sk_wake_async() uses the callback lock to guard the state of the
socket's fasync list.  It is taken as a reader for scanners, and taken
as a writer for thread paths which add entries to the fasync list.

These locking schemes date back to 1999 as indicated by comments above
sock_fasync().

To be honest, fasync usage is rare.  It's a way to register processes
which will receive SIGIO signals during IO events.  The generic fasync
code fs/fcntl.c:kill_fasync() uses a global lock for list protection
and nobody has noticed.  But we should be careful for this observation
because the other uses of fasync handling are devices like input
(keyboards, mice) and TTY devices.

If the "fasync on sockets is rare" observation is accurate, we could
simply use fasync_helper() and kill_fasync() to implement the
necessary locking.

Also, fasync is how SIGURG is implemented, again another rarely used
feature of sockets.

The next layer of (mis-)use of the sk_callback_lock are mis-level
protocol layers that sit on top of real protocol stacks.  For example
sunrpc and iscsi.  These subsystems want to override the socket
callbacks so that they can do data-ready etc. processing directly or
with modified behavior.

Typically these code bases stack away the callbacks they override so
that when they are done with the socket they restore those pointers.
During this "callback pointer swap" they hold the sk_callback_lock,
which on the surface seems pretty pointless since this does not
protect other threads of execution from invoking these callbacks.

SUNRPC also seems to use the sk->sk_callback_lock to stabilize the
state of the sk->sk_user_data pointer in the XPRT socket layer.  It
also has an xprt->shutdown state which is tested additionally to the
NULL'ness of the sk->sk_user_data XPRT pointer.  A seperate
xprt->transport_lock synchornizes the processing of incoming packets
for a given XID session, so the sk_callback_lock isn't needed for that
purpose.

sock_orphan() and sock_graft(), in include/net/sock.h, are the main
transition points of the parent socket relationship state.
Unfortunately, not all protocols use these helper routines
consistently.  In particular the radio protocols (ax25, etc.), IPX,
ECONET, and some others do this work by hand and in particular without
holding the sk->sk_callback_lock.

This should be consolidated so that the protocols do this stuff
consistently.  It also points us to the crummy SOCK_DEAD state bit we
have.  Really, all of the things guarded by SOCK_DEAD are totally
harmless if we point sk->sk_socket and sk->sk_sleep to dummy objects
instead of NULL.  So, there is strong evidence that SOCK_DEAD could be
eliminated.

Also, this lack of consistent usage of sock_orphan() and sock_graft()
means that some protocols are not invoking security_sock_graft()
properly.

So likely the first task in all of this is to get the protocol
to use sock_orphan() and sock_graft() across the board without
any exceptions.

At this point we can build dummy socket and wait queue objects that
sock allocation and sock_orphan() can point to.  I believe at this
point that sk_callback_lock acquisition can be removed from
sock_orphan() and sock_graft().

Next, we convert the fasync bits to use the global fasync_lock
via the helper functions fasync_helper() and kill_fasync() in
fs/fcntl.c instead of the per-socket locking, by-hand list insertion,
and __kill_fasync() invocations the socket layer uses currently.

Now, it should be possible to entirely remove the sk_callback_lock
grabs in the sock callbacks.

The rest of the work is composed of walking over the socket
wrapping layers such as SUNRPC, OCFS, and ISCSI and auditing
their sk_callback_lock usage, trying to eliminate the lock as
much as possible.  The unconvertable parts are probably replacable
with subsystem local object locks.

It should be possible to delete sk->sk_callback_lock at that point.
Killing SOCK_DEAD could be a followon task.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup)
  2008-06-11  9:16     ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) David Miller
@ 2008-06-12 14:05       ` Herbert Xu
  0 siblings, 0 replies; 9+ messages in thread
From: Herbert Xu @ 2008-06-12 14:05 UTC (permalink / raw)
  To: David Miller; +Cc: GHaskins, PMullaney, netdev

On Wed, Jun 11, 2008 at 02:16:42AM -0700, David Miller wrote:
> 
> Deletion of sk->sk_callback_lock.

Looks pretty reasonable.  Thanks!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor•apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net/core/sock.c remove extra wakeup
  2008-06-11  6:46   ` Herbert Xu
  2008-06-11  9:16     ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) David Miller
@ 2008-06-16  3:48     ` Gregory Haskins
  1 sibling, 0 replies; 9+ messages in thread
From: Gregory Haskins @ 2008-06-16  3:48 UTC (permalink / raw)
  To: Herbert Xu; +Cc: davem, Patrick Mullaney, netdev

>>> On Wed, Jun 11, 2008 at  2:46 AM, in message
<E1K6K60-00061s-00@gondolin•me.apana.org.au>, Herbert Xu
<herbert@gondor•apana.org.au> wrote: 
> Gregory Haskins <GHaskins@novell•com> wrote:
>> 
>> So basically it was waiting for a packet to arrive, and the net-rx softirq 
> would first trigger a NOSPACE wakeup (of which was a noop by the udp-rx code) 
> followed by an rx-wakeup.  So the netperf thread woke twice for one packet.  
> Fixing this boosted performance by a large margin (don't recall exact figures 
> off the top of my head....somewhere around 20%)
> 
> Please wrap your lines.

Apologies.  I don't think I have control over it in my corporate mailer environment :( but if I can figure out a way to do it I will make this change.  You are not the first to complain ;)

> 
> Anyway, you've lost me with this scenario.  The patch is stopping
> wake-ups on the write path yet you're talking about the read path.
> So my question is why are you getting the extra wake-up on that
> write path when you receive a packet?

Heh..you tell me :)

I don't recall the specifics off the top of my head, but IIRC it had to do with the fact that we were running a tx/rx netperf test and the way tx-completions are handled in the stack w.r.t. the softirqs.

The tx-completions code is executing inside the softirq-net-rx context and is freeing up the transmitted skbs.  The problem is that it is blindly signaling the wait-queue that there was now SPACE (even though no-one cared about that event at the moment...the only waiter was waiting on an rx event).

So one way to look at it is the problem is that the single wait-queue serves both rx-wakeup, and tx-nospace events.  Rather than open the can of worms of splitting the wait-queues into finer granularity,  Pat chose to at least make the overloaded use of the single wait-queue more intelligent to know if anyone cared about the NOSPACE event or not.

>From my perspective, I don't care how the issue is specifically solved...I would ultimately just like to see this wasted wake-up "go away" one way or the other.  ;)  It significantly degrades performance (at least in this synthetic test) and as far as I can tell there is no reason to structure the code this way.  If we are mistaken in that assessment, please let me know.

If we can offer any more details (such as the specific codepath that triggers the NOSPACE wakeup, for instance) we would be happy to oblige.  Long story short this is easily repeatable.

Regards,
-Greg

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-06-16  3:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <483F99090200005A00037FFE@sinclair.provo.novell.com>
2008-05-30 10:05 ` [PATCH] net/core/sock.c remove extra wakeup Gregory Haskins
2008-06-11  6:46   ` Herbert Xu
2008-06-11  9:16     ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) David Miller
2008-06-12 14:05       ` Herbert Xu
2008-06-16  3:48     ` [PATCH] net/core/sock.c remove extra wakeup Gregory Haskins
     [not found] <483F99090200005A00037FFE@lucius.provo.novell.com>
     [not found] ` <483F990C0200005A00038001@lucius.provo.novell.com>
2008-05-30 17:00   ` Patrick Mullaney
2008-06-01 21:03     ` Andi Kleen
2008-05-29 16:08 Patrick Mullaney
2008-05-30  9:52 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox