public inbox for netdev@vger.kernel.org 
 help / color / mirror / Atom feed
From: Aditya Garg <gargaditya@linux•microsoft.com>
To: Eric Dumazet <edumazet@google•com>
Cc: "David S . Miller" <davem@davemloft•net>,
	Jakub Kicinski <kuba@kernel•org>, Paolo Abeni <pabeni@redhat•com>,
	Simon Horman <horms@kernel•org>,
	Kuniyuki Iwashima <kuniyu@google•com>,
	Willem de Bruijn <willemb@google•com>,
	netdev@vger•kernel.org, eric.dumazet@gmail•com,
	ssengar@linux•microsoft.com, gargaditya@microsoft•com
Subject: Re: [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs
Date: Wed, 12 Nov 2025 10:44:28 +0530	[thread overview]
Message-ID: <c6e3a182-b326-4a6d-901d-c445d95643eb@linux.microsoft.com> (raw)
In-Reply-To: <CANn89i+e2XDw4b_iHaj_LPTeg6M2+-+vroECks21Fs7Hfg2Aow@mail.gmail.com>

On 11-11-2025 22:47, Eric Dumazet wrote:
> On Tue, Nov 11, 2025 at 8:56 AM Aditya Garg
> <gargaditya@linux•microsoft.com> wrote:
>>
>> On Thu, Nov 06, 2025 at 08:29:34PM +0000, Eric Dumazet wrote:
>>> There is a lack of NUMA awareness and more generally lack
>>> of slab caches affinity on TX completion path.
>>>
>>> Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
>>> in per-cpu caches so that they can be recycled in RX path.
>>>
>>> Only use this if the skb was allocated on the same cpu,
>>> otherwise use skb_attempt_defer_free() so that the skb
>>> is freed on the original cpu.
>>>
>>> This removes contention on SLUB spinlocks and data structures.
>>>
>>> After this patch, I get ~50% improvement for an UDP tx workload
>>> on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).
>>>
>>> 80 Mpps -> 120 Mpps.
>>>
>>> Profiling one of the 32 cpus servicing NIC interrupts :
>>>
>>> Before:
>>>
>>> mpstat -P 511 1 1
>>>
>>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> Average:     511    0.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    0.00    2.00
>>>
>>>      31.01%  ksoftirqd/511    [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>>>      12.45%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>>>       5.60%  ksoftirqd/511    [kernel.kallsyms]  [k] __slab_free
>>>       3.31%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>>>       3.27%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>>>       2.95%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_start
>>>       2.52%  ksoftirqd/511    [kernel.kallsyms]  [k] fq_dequeue
>>>       2.32%  ksoftirqd/511    [kernel.kallsyms]  [k] read_tsc
>>>       2.25%  ksoftirqd/511    [kernel.kallsyms]  [k] build_detached_freelist
>>>       2.15%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free
>>>       2.11%  swapper          [kernel.kallsyms]  [k] __slab_free
>>>       2.06%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_features_check
>>>       2.01%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>>>       1.97%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_data
>>>       1.52%  ksoftirqd/511    [kernel.kallsyms]  [k] sock_wfree
>>>       1.34%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>>>       1.23%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>>>       1.15%  ksoftirqd/511    [kernel.kallsyms]  [k] dma_unmap_page_attrs
>>>       1.11%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>>>       1.03%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>>>       0.94%  swapper          [kernel.kallsyms]  [k] kmem_cache_free
>>>       0.93%  swapper          [kernel.kallsyms]  [k] read_tsc
>>>       0.81%  ksoftirqd/511    [kernel.kallsyms]  [k] napi_consume_skb
>>>       0.79%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>>>       0.77%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_free_head
>>>       0.76%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>>>       0.72%  swapper          [kernel.kallsyms]  [k] skb_release_data
>>>       0.69%  swapper          [kernel.kallsyms]  [k] build_detached_freelist
>>>       0.58%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_head_state
>>>       0.56%  ksoftirqd/511    [kernel.kallsyms]  [k] __put_partials
>>>       0.55%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free_bulk
>>>       0.48%  swapper          [kernel.kallsyms]  [k] sock_wfree
>>>
>>> After:
>>>
>>> mpstat -P 511 1 1
>>>
>>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> Average:     511    0.00    0.00    0.00    0.00    0.00   51.49    0.00    0.00    0.00   48.51
>>>
>>>      19.10%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
>>>      13.86%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
>>>      10.80%  swapper          [kernel.kallsyms]  [k] skb_attempt_defer_free
>>>      10.57%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
>>>       7.18%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>>>       6.69%  swapper          [kernel.kallsyms]  [k] sock_wfree
>>>       5.55%  swapper          [kernel.kallsyms]  [k] dma_unmap_page_attrs
>>>       3.10%  swapper          [kernel.kallsyms]  [k] fq_dequeue
>>>       3.00%  swapper          [kernel.kallsyms]  [k] skb_release_head_state
>>>       2.73%  swapper          [kernel.kallsyms]  [k] read_tsc
>>>       2.48%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
>>>       1.20%  swapper          [kernel.kallsyms]  [k] idpf_features_check
>>>       1.13%  swapper          [kernel.kallsyms]  [k] napi_consume_skb
>>>       0.93%  swapper          [kernel.kallsyms]  [k] idpf_vport_splitq_napi_poll
>>>       0.64%  swapper          [kernel.kallsyms]  [k] native_send_call_func_single_ipi
>>>       0.60%  swapper          [kernel.kallsyms]  [k] acpi_processor_ffh_cstate_enter
>>>       0.53%  swapper          [kernel.kallsyms]  [k] io_idle
>>>       0.43%  swapper          [kernel.kallsyms]  [k] netif_skb_features
>>>       0.41%  swapper          [kernel.kallsyms]  [k] __direct_call_cpuidle_state_enter2
>>>       0.40%  swapper          [kernel.kallsyms]  [k] native_irq_return_iret
>>>       0.40%  swapper          [kernel.kallsyms]  [k] idpf_tx_buf_hw_update
>>>       0.36%  swapper          [kernel.kallsyms]  [k] sched_clock_noinstr
>>>       0.34%  swapper          [kernel.kallsyms]  [k] handle_softirqs
>>>       0.32%  swapper          [kernel.kallsyms]  [k] net_rx_action
>>>       0.32%  swapper          [kernel.kallsyms]  [k] dql_completed
>>>       0.32%  swapper          [kernel.kallsyms]  [k] validate_xmit_skb
>>>       0.31%  swapper          [kernel.kallsyms]  [k] skb_network_protocol
>>>       0.29%  swapper          [kernel.kallsyms]  [k] skb_csum_hwoffload_help
>>>       0.29%  swapper          [kernel.kallsyms]  [k] x2apic_send_IPI
>>>       0.28%  swapper          [kernel.kallsyms]  [k] ktime_get
>>>       0.24%  swapper          [kernel.kallsyms]  [k] __qdisc_run
>>>
>>> Signed-off-by: Eric Dumazet <edumazet@google•com>
>>> ---
>>>   net/core/skbuff.c | 5 +++++
>>>   1 file changed, 5 insertions(+)
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index eeddb9e737ff28e47c77739db7b25ea68e5aa735..7ac5f8aa1235a55db02b40b5a0f51bb3fa53fa03 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -1476,6 +1476,11 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
>>>
>>>        DEBUG_NET_WARN_ON_ONCE(!in_softirq());
>>>
>>> +     if (skb->alloc_cpu != smp_processor_id() && !skb_shared(skb)) {
>>> +             skb_release_head_state(skb);
>>> +             return skb_attempt_defer_free(skb);
>>> +     }
>>> +
>>>        if (!skb_unref(skb))
>>>                return;
>>>
>>> --
>>> 2.51.2.1041.gc1ab5b90ca-goog
>>>
>>
>> I ran these tests on latest net-next for MANA driver and I am observing a regression here.
>>
>> lisatest@lisa--747-e0-n0:~$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
>> Connecting to host 10.0.0.4, port 5201
>> [  5] local 10.0.0.5 port 48692 connected to 10.0.0.4 port 5201
>> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>> [  5]   0.00-1.00   sec  4.57 GBytes  39.2 Gbits/sec  586   1.04 MBytes
>> [  5]   1.00-2.00   sec  4.74 GBytes  40.7 Gbits/sec  520   1.13 MBytes
>> [  5]   2.00-3.00   sec  5.16 GBytes  44.3 Gbits/sec  191   1.20 MBytes
>> [  5]   3.00-4.00   sec  5.13 GBytes  44.1 Gbits/sec  520   1.11 MBytes
>> [  5]   4.00-5.00   sec   678 MBytes  5.69 Gbits/sec   93   1.37 KBytes
>> [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  10.00-11.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  11.00-12.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  12.00-13.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  13.00-14.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  14.00-15.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  15.00-16.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  16.00-17.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  17.00-18.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  18.00-19.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  19.00-20.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  20.00-21.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  21.00-22.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  22.00-23.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  23.00-24.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  24.00-25.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  25.00-26.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  26.00-27.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  27.00-28.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  28.00-29.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> [  5]  29.00-30.00  sec  0.00 Bytes  0.00 bits/sec    0   1.37 KBytes
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bitrate         Retr
>> [  5]   0.00-30.00  sec  20.3 GBytes  5.80 Gbits/sec  1910             sender
>> [  5]   0.00-30.00  sec  20.3 GBytes  5.80 Gbits/sec                  receiver
>>
>> iperf Done.
>>
>>
>> I tested again by reverting this patch and regression was not there.
>>
>> lisatest@lisa--747-e0-n0:~/net-next$ iperf3 -c 10.0.0.4 -t 30 -l 1048576
>> Connecting to host 10.0.0.4, port 5201
>> [  5] local 10.0.0.5 port 58188 connected to 10.0.0.4 port 5201
>> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>> [  5]   0.00-1.00   sec  4.95 GBytes  42.5 Gbits/sec  541   1.10 MBytes
>> [  5]   1.00-2.00   sec  4.92 GBytes  42.3 Gbits/sec  599    878 KBytes
>> [  5]   2.00-3.00   sec  4.51 GBytes  38.7 Gbits/sec  438    803 KBytes
>> [  5]   3.00-4.00   sec  4.69 GBytes  40.3 Gbits/sec  647   1.17 MBytes
>> [  5]   4.00-5.00   sec  4.18 GBytes  35.9 Gbits/sec  1183    715 KBytes
>> [  5]   5.00-6.00   sec  5.05 GBytes  43.4 Gbits/sec  484    975 KBytes
>> [  5]   6.00-7.00   sec  5.32 GBytes  45.7 Gbits/sec  520    836 KBytes
>> [  5]   7.00-8.00   sec  5.29 GBytes  45.5 Gbits/sec  436   1.10 MBytes
>> [  5]   8.00-9.00   sec  5.27 GBytes  45.2 Gbits/sec  464   1.30 MBytes
>> [  5]   9.00-10.00  sec  5.25 GBytes  45.1 Gbits/sec  425   1.13 MBytes
>> [  5]  10.00-11.00  sec  5.29 GBytes  45.4 Gbits/sec  268   1.19 MBytes
>> [  5]  11.00-12.00  sec  4.98 GBytes  42.8 Gbits/sec  711    793 KBytes
>> [  5]  12.00-13.00  sec  3.80 GBytes  32.6 Gbits/sec  1255    801 KBytes
>> [  5]  13.00-14.00  sec  3.80 GBytes  32.7 Gbits/sec  1130    642 KBytes
>> [  5]  14.00-15.00  sec  4.31 GBytes  37.0 Gbits/sec  1024   1.11 MBytes
>> [  5]  15.00-16.00  sec  5.18 GBytes  44.5 Gbits/sec  359   1.25 MBytes
>> [  5]  16.00-17.00  sec  5.23 GBytes  44.9 Gbits/sec  265    900 KBytes
>> [  5]  17.00-18.00  sec  4.70 GBytes  40.4 Gbits/sec  769    715 KBytes
>> [  5]  18.00-19.00  sec  3.77 GBytes  32.4 Gbits/sec  1841    889 KBytes
>> [  5]  19.00-20.00  sec  3.77 GBytes  32.4 Gbits/sec  1084    827 KBytes
>> [  5]  20.00-21.00  sec  5.01 GBytes  43.0 Gbits/sec  558    994 KBytes
>> [  5]  21.00-22.00  sec  5.27 GBytes  45.3 Gbits/sec  450   1.25 MBytes
>> [  5]  22.00-23.00  sec  5.25 GBytes  45.1 Gbits/sec  338   1.18 MBytes
>> [  5]  23.00-24.00  sec  5.29 GBytes  45.4 Gbits/sec  200   1.14 MBytes
>> [  5]  24.00-25.00  sec  5.29 GBytes  45.5 Gbits/sec  518   1.02 MBytes
>> [  5]  25.00-26.00  sec  4.28 GBytes  36.7 Gbits/sec  1258    792 KBytes
>> [  5]  26.00-27.00  sec  3.87 GBytes  33.2 Gbits/sec  1365    799 KBytes
>> [  5]  27.00-28.00  sec  4.77 GBytes  41.0 Gbits/sec  530   1.09 MBytes
>> [  5]  28.00-29.00  sec  5.31 GBytes  45.6 Gbits/sec  419   1.06 MBytes
>> [  5]  29.00-30.00  sec  5.32 GBytes  45.7 Gbits/sec  222   1.10 MBytes
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bitrate         Retr
>> [  5]   0.00-30.00  sec   144 GBytes  41.2 Gbits/sec  20301             sender
>> [  5]   0.00-30.00  sec   144 GBytes  41.2 Gbits/sec                  receiver
>>
>> iperf Done.
>>
>>
>> I am still figuring out technicalities of this patch, but wanted to share initial findings for your input. Please let me know your thoughts on this!
> 
> Perhaps try : https://patchwork.kernel.org/project/netdevbpf/patch/20251111151235.1903659-1-edumazet@google.com/
> 
> Thanks !

Thanks Eric, it works fine after above fix.

Regards,
Aditya

  reply	other threads:[~2025-11-12  5:14 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-06 20:29 [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() Eric Dumazet
2025-11-06 20:29 ` [PATCH net-next 1/3] net: allow skb_release_head_state() to be called multiple times Eric Dumazet
2025-11-07  6:42   ` Kuniyuki Iwashima
2025-11-07 11:23   ` Toke Høiland-Jørgensen
2025-11-06 20:29 ` [PATCH net-next 2/3] net: fix napi_consume_skb() with alien skbs Eric Dumazet
2025-11-07  6:46   ` Kuniyuki Iwashima
2025-11-07 11:23   ` Toke Høiland-Jørgensen
2025-11-07 12:28     ` Eric Dumazet
2025-11-07 14:34       ` Toke Høiland-Jørgensen
2025-11-07 15:44   ` Jason Xing
2025-11-11 16:56   ` Aditya Garg
2025-11-11 17:17     ` Eric Dumazet
2025-11-12  5:14       ` Aditya Garg [this message]
2025-11-12 14:03   ` Jon Hunter
2025-11-12 14:08     ` Eric Dumazet
2025-11-12 15:26       ` Jon Hunter
2025-11-12 15:32         ` Eric Dumazet
2026-03-09 21:00   ` Long delay in freeing skbs Tony Battersby
2026-03-09 21:07     ` Eric Dumazet
2026-03-09 21:18       ` Tony Battersby
2026-03-09 21:24         ` Eric Dumazet
2026-03-09 21:47           ` Tony Battersby
2026-03-10 14:49             ` Tony Battersby
2026-03-10 15:25               ` Eric Dumazet
2025-11-06 20:29 ` [PATCH net-next 3/3] net: increase skb_defer_max default to 128 Eric Dumazet
2025-11-07  6:47   ` Kuniyuki Iwashima
2025-11-07 11:23   ` Toke Høiland-Jørgensen
2025-11-07 15:37   ` Jason Xing
2025-11-07 15:47     ` Eric Dumazet
2025-11-07 15:49       ` Jason Xing
2025-11-07 16:00         ` Eric Dumazet
2025-11-07 16:03           ` Jason Xing
2025-11-07 16:08             ` Eric Dumazet
2025-11-07 16:20               ` Jason Xing
2025-11-07 16:26                 ` Eric Dumazet
2025-11-07 16:59                   ` Jason Xing
2025-11-08  3:10 ` [PATCH net-next 0/3] net: use skb_attempt_defer_free() in napi_consume_skb() patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c6e3a182-b326-4a6d-901d-c445d95643eb@linux.microsoft.com \
    --to=gargaditya@linux$(echo .)microsoft.com \
    --cc=davem@davemloft$(echo .)net \
    --cc=edumazet@google$(echo .)com \
    --cc=eric.dumazet@gmail$(echo .)com \
    --cc=gargaditya@microsoft$(echo .)com \
    --cc=horms@kernel$(echo .)org \
    --cc=kuba@kernel$(echo .)org \
    --cc=kuniyu@google$(echo .)com \
    --cc=netdev@vger$(echo .)kernel.org \
    --cc=pabeni@redhat$(echo .)com \
    --cc=ssengar@linux$(echo .)microsoft.com \
    --cc=willemb@google$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox