public inbox for netdev@vger.kernel.org 
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat•com>
To: Eric Dumazet <eric.dumazet@gmail•com>
Cc: Rick Jones <rick.jones2@hpe•com>,
	netdev@vger•kernel.org, brouer@redhat•com,
	Saeed Mahameed <saeedm@mellanox•com>,
	Tariq Toukan <tariqt@mellanox•com>
Subject: Re: Netperf UDP issue with connected sockets
Date: Thu, 17 Nov 2016 19:30:21 +0100	[thread overview]
Message-ID: <20161117193021.580589ae@redhat.com> (raw)
In-Reply-To: <1479399679.8455.255.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, 17 Nov 2016 08:21:19 -0800
Eric Dumazet <eric.dumazet@gmail•com> wrote:

> On Thu, 2016-11-17 at 15:57 +0100, Jesper Dangaard Brouer wrote:
> > On Thu, 17 Nov 2016 06:17:38 -0800
> > Eric Dumazet <eric.dumazet@gmail•com> wrote:
> >   
> > > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> > >   
> > > > I can see that qdisc layer does not activate xmit_more in this case.
> > > >     
> > > 
> > > Sure. Not enough pressure from the sender(s).
> > > 
> > > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> > > limit is kept at a small value.
> > > 
> > > (BTW not all NIC have expensive doorbells)  
> > 
> > I believe this NIC mlx5 (50G edition) does.
> > 
> > I'm seeing UDP TX of 1656017.55 pps, which is per packet:
> > 2414 cycles(tsc) 603.86 ns
> > 
> > Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
> > 
> >  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
> >    Overhead  Command        Shared Object        Symbol
> >  +    8.92%  udp_flood      [kernel.vmlinux]     [k] _raw_spin_lock
> >    - _raw_spin_lock
> >       + 90.78% __dev_queue_xmit
> >       + 7.83% dev_queue_xmit
> >       + 1.30% ___slab_alloc
> >  +    5.59%  udp_flood      [kernel.vmlinux]     [k] skb_set_owner_w  
> 
> Does TX completion happens on same cpu ?
> 
> >  +    4.77%  udp_flood      [mlx5_core]          [k] mlx5e_sq_xmit
> >  +    4.09%  udp_flood      [kernel.vmlinux]     [k] fib_table_lookup  
> 
> Why fib_table_lookup() is used with connected UDP flow ???

Don't know. I would be interested in hints howto avoid this!...

I also see it with netperf, and my udp_flood code is here:
 https://github.com/netoptimizer/network-testing/blob/master/src/udp_flood.c


> >  +    4.00%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
> >  +    3.11%  udp_flood      [kernel.vmlinux]     [k] __ip_route_output_key_hash  
> 
> Same here, this is suspect.

It is the function calling fib_table_lookup(), and it is called by ip_route_output_flow().
 
> >  +    2.49%  swapper        [kernel.vmlinux]     [k] __slab_free
> > 
> > In this setup the spinlock in __dev_queue_xmit should be uncongested.
> > An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
> > 
> > But 8.92% of the time is spend on it, which corresponds to a cost of 215
> > cycles (2414*0.0892).  This cost is too high, thus something else is
> > going on... I claim this mysterious extra cost is the tailptr/doorbell.  
> 
> Well, with no pressure, doorbell is triggered for each packet.
> 
> Since we can not predict that your application is going to send yet
> another packet one usec later, we can not avoid this.

The point is I can see a socket Send-Q forming, thus we do know the
application have something to send. Thus, and possibility for
non-opportunistic bulking. Allowing/implementing bulk enqueue from
socket layer into qdisc layer, should be fairly simple (and rest of
xmit_more is already in place).  


> Note that with the patches I am working on (busypolling extentions),
> we could decide to let the busypoller doing the doorbells, say one every
> 10 usec. (or after ~16 packets were queued)

Sounds interesting! but an opportunistically approach.

> But mlx4 uses two different NAPI for TX and RX, maybe mlx5 has the same
> strategy .

It is a possibility that TX completions were happening on another CPU
(but I don't think so for mlx5).

To rule that out, I reran the experiment making sure to pin everything
to CPU-0 and the results are the same.

sudo ethtool -L mlx5p2 combined 1

sudo sh -c '\                                                    
 for x in /proc/irq/*/mlx5*/../smp_affinity; do \
   echo 01 > $x; grep . -H $x; done \
'

$ taskset -c 0 ./udp_flood --sendto 198.18.50.1 --count $((10**9))

Perf report validating CPU in use:

$ perf report -g --no-children --sort cpu,comm,dso,symbol --stdio --call-graph none

# Overhead  CPU  Command         Shared Object        Symbol                                   
# ........  ...  ..............  ...................  .........................................
#
     9.97%  000  udp_flood       [kernel.vmlinux]     [k] _raw_spin_lock                       
     4.37%  000  udp_flood       [kernel.vmlinux]     [k] fib_table_lookup                     
     3.97%  000  udp_flood       [mlx5_core]          [k] mlx5e_sq_xmit                        
     3.06%  000  udp_flood       [kernel.vmlinux]     [k] __ip_route_output_key_hash           
     2.51%  000  udp_flood       [mlx5_core]          [k] mlx5e_poll_tx_cq                     
     2.48%  000  udp_flood       [kernel.vmlinux]     [k] copy_user_enhanced_fast_string       
     2.47%  000  udp_flood       [kernel.vmlinux]     [k] entry_SYSCALL_64                     
     2.42%  000  udp_flood       [kernel.vmlinux]     [k] udp_sendmsg                          
     2.39%  000  udp_flood       [kernel.vmlinux]     [k] __ip_append_data.isra.47             
     2.29%  000  udp_flood       [kernel.vmlinux]     [k] sock_def_write_space                 
     2.19%  000  udp_flood       [mlx5_core]          [k] mlx5e_get_cqe                        
     1.95%  000  udp_flood       [kernel.vmlinux]     [k] __ip_make_skb                        
     1.94%  000  udp_flood       [kernel.vmlinux]     [k] __alloc_skb                          
     1.90%  000  udp_flood       [kernel.vmlinux]     [k] sock_wfree                           
     1.85%  000  udp_flood       [kernel.vmlinux]     [k] skb_set_owner_w                      
     1.62%  000  udp_flood       [kernel.vmlinux]     [k] ip_finish_output2                    
     1.61%  000  udp_flood       [kernel.vmlinux]     [k] kfree                                
     1.54%  000  udp_flood       [kernel.vmlinux]     [k] entry_SYSCALL_64_fastpath            
     1.35%  000  udp_flood       libc-2.17.so         [.] __sendmsg_nocancel                   
     1.26%  000  udp_flood       [kernel.vmlinux]     [k] __kmalloc_node_track_caller          
     1.24%  000  udp_flood       [kernel.vmlinux]     [k] __rcu_read_unlock                    
     1.23%  000  udp_flood       [kernel.vmlinux]     [k] __local_bh_enable_ip                 
     1.22%  000  udp_flood       [kernel.vmlinux]     [k] ip_idents_reserve         

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

  reply	other threads:[~2016-11-17 18:30 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-03 14:59 High perf top ip_idents_reserve doing netperf UDP_STREAM Jesper Dangaard Brouer
2014-09-03 15:17 ` Eric Dumazet
2016-11-16 12:16   ` Netperf UDP issue with connected sockets Jesper Dangaard Brouer
2016-11-16 17:46     ` Rick Jones
2016-11-16 22:40       ` Jesper Dangaard Brouer
2016-11-16 22:50         ` Rick Jones
2016-11-17  0:34         ` Eric Dumazet
2016-11-17  8:16           ` Jesper Dangaard Brouer
2016-11-17 13:20             ` Eric Dumazet
2016-11-17 13:42               ` Jesper Dangaard Brouer
2016-11-17 14:17                 ` Eric Dumazet
2016-11-17 14:57                   ` Jesper Dangaard Brouer
2016-11-17 16:21                     ` Eric Dumazet
2016-11-17 18:30                       ` Jesper Dangaard Brouer [this message]
2016-11-17 18:51                         ` Eric Dumazet
2016-11-17 21:19                           ` Jesper Dangaard Brouer
2016-11-17 21:44                             ` Eric Dumazet
2016-11-17 23:08                               ` Rick Jones
2016-11-18  0:37                                 ` Julian Anastasov
2016-11-18  0:42                                   ` Rick Jones
2016-11-18 17:12                               ` Jesper Dangaard Brouer
2016-11-21 16:03                           ` Jesper Dangaard Brouer
2016-11-21 18:10                             ` Eric Dumazet
2016-11-29  6:58                               ` [WIP] net+mlx4: auto doorbell Eric Dumazet
2016-11-30 11:38                                 ` Jesper Dangaard Brouer
2016-11-30 15:56                                   ` Eric Dumazet
2016-11-30 19:17                                     ` Jesper Dangaard Brouer
2016-11-30 19:30                                       ` Eric Dumazet
2016-11-30 22:30                                         ` Jesper Dangaard Brouer
2016-11-30 22:40                                           ` Eric Dumazet
2016-12-01  0:27                                         ` Eric Dumazet
2016-12-01  1:16                                           ` Tom Herbert
2016-12-01  2:32                                             ` Eric Dumazet
2016-12-01  2:50                                               ` Eric Dumazet
2016-12-02 18:16                                                 ` Eric Dumazet
2016-12-01  5:03                                               ` Tom Herbert
2016-12-01 19:24                                                 ` Willem de Bruijn
2016-11-30 13:50                                 ` Saeed Mahameed
2016-11-30 15:44                                   ` Eric Dumazet
2016-11-30 16:27                                     ` Saeed Mahameed
2016-11-30 17:28                                       ` Eric Dumazet
2016-12-01 12:05                                       ` Jesper Dangaard Brouer
2016-12-01 14:24                                         ` Eric Dumazet
2016-12-01 16:04                                           ` Jesper Dangaard Brouer
2016-12-01 17:04                                             ` Eric Dumazet
2016-12-01 19:17                                               ` Jesper Dangaard Brouer
2016-12-01 20:11                                                 ` Eric Dumazet
2016-12-01 20:20                                               ` David Miller
2016-12-01 22:10                                                 ` Eric Dumazet
2016-12-02 14:23                                               ` Eric Dumazet
2016-12-01 21:32                                 ` Alexander Duyck
2016-12-01 22:04                                   ` Eric Dumazet
2016-11-17 17:34                     ` Netperf UDP issue with connected sockets David Laight
2016-11-17 22:39                       ` Alexander Duyck
2016-11-17 17:42             ` Rick Jones
2016-11-28 18:33             ` Rick Jones
2016-11-28 18:40               ` Rick Jones
2016-11-30 10:43               ` Jesper Dangaard Brouer
2016-11-30 17:42                 ` Rick Jones
2016-11-30 18:11                   ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161117193021.580589ae@redhat.com \
    --to=brouer@redhat$(echo .)com \
    --cc=eric.dumazet@gmail$(echo .)com \
    --cc=netdev@vger$(echo .)kernel.org \
    --cc=rick.jones2@hpe$(echo .)com \
    --cc=saeedm@mellanox$(echo .)com \
    --cc=tariqt@mellanox$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox