tc filter insertion rate degradation

public inbox for netdev@vger.kernel.org 
 help / color / mirror / Atom feed

* tc filter insertion rate degradation
@ 2019-01-21 11:24 Vlad Buslov
  2019-01-22 17:33 ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: Vlad Buslov @ 2019-01-21 11:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linux Kernel Network Developers, Yevgeny Kliteynik, Yossef Efraim,
	Maor Gottlieb

Hi Eric,

I've been investigating significant tc filter insertion rate degradation
and it seems it is caused by your commit 001c96db0181 ("net: align
gnet_stats_basic_cpu struct"). With this commit insertion rate is
reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
from file in tc batch mode on my machine. 

Tc perf profile indicates that pcpu allocator now consumes 2x CPU:

1) Before:

Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
  Children      Self  Co  Shared Object     Symbol
+   21.19%     3.38%  tc  [kernel.vmlinux]  [k] pcpu_alloc
+    3.45%     0.25%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area

2) After:

Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
  Children      Self  Co  Shared Object     Symbol
+   44.67%     3.99%  tc  [kernel.vmlinux]  [k] pcpu_alloc
+   19.25%     0.22%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area

It seems that it takes much more work for pcpu allocator to perform
allocation with new stricter alignment requirements. Not sure if it is
expected behavior or not in this case.

Regards,
Vlad

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: tc filter insertion rate degradation
  2019-01-21 11:24 tc filter insertion rate degradation Vlad Buslov
@ 2019-01-22 17:33 ` Eric Dumazet
  2019-01-22 21:18   ` Tejun Heo
  2019-01-24 17:21   ` Dennis Zhou
  0 siblings, 2 replies; 6+ messages in thread
From: Eric Dumazet @ 2019-01-22 17:33 UTC (permalink / raw)
  To: Vlad Buslov, Dennis Zhou, Tejun Heo
  Cc: Linux Kernel Network Developers, Yevgeny Kliteynik, Yossef Efraim,
	Maor Gottlieb

On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vladbu@mellanox•com> wrote:
>
> Hi Eric,
>
> I've been investigating significant tc filter insertion rate degradation
> and it seems it is caused by your commit 001c96db0181 ("net: align
> gnet_stats_basic_cpu struct"). With this commit insertion rate is
> reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
> from file in tc batch mode on my machine.
>
> Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
>
> 1) Before:
>
> Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
>   Children      Self  Co  Shared Object     Symbol
> +   21.19%     3.38%  tc  [kernel.vmlinux]  [k] pcpu_alloc
> +    3.45%     0.25%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
>
> 2) After:
>
> Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
>   Children      Self  Co  Shared Object     Symbol
> +   44.67%     3.99%  tc  [kernel.vmlinux]  [k] pcpu_alloc
> +   19.25%     0.22%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
>
> It seems that it takes much more work for pcpu allocator to perform
> allocation with new stricter alignment requirements. Not sure if it is
> expected behavior or not in this case.
>
> Regards,
> Vlad

Hi Vlad

I guess this is more a question for per-cpu allocator experts / maintainers ?

16-bytes alignment for 16-bytes objects sound quite reasonable [1]

It also means that if your workload is mostly being able to setup /
dismantle tc filters,
instead of really using them, you might go back to atomics instead of
expensive per cpu storage.

(Ie optimize control path instead of data path)

Thanks !

[1] We even might make this generic as in :

diff --git a/mm/percpu.c b/mm/percpu.c
index 27a25bf1275b7233d28cc0b126256e0f8a2b7f4f..bbf4ad37ae893fc1da5523889dd147f046852cc7
100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1362,7 +1362,11 @@ static void __percpu *pcpu_alloc(size_t size,
size_t align, bool reserved,
         */
        if (unlikely(align < PCPU_MIN_ALLOC_SIZE))
                align = PCPU_MIN_ALLOC_SIZE;
-
+       while (align < L1_CACHE_BYTES && (align << 1) <= size) {
+               if (size % (align << 1))
+                       break;
+               align <<= 1;
+       }
        size = ALIGN(size, PCPU_MIN_ALLOC_SIZE);
        bits = size >> PCPU_MIN_ALLOC_SHIFT;
        bit_align = align >> PCPU_MIN_ALLOC_SHIFT;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: tc filter insertion rate degradation
  2019-01-22 17:33 ` Eric Dumazet
@ 2019-01-22 21:18   ` Tejun Heo
  2019-01-22 22:40     ` Eric Dumazet
  2019-01-24 17:21   ` Dennis Zhou
  1 sibling, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2019-01-22 21:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Vlad Buslov, Dennis Zhou, Linux Kernel Network Developers,
	Yevgeny Kliteynik, Yossef Efraim, Maor Gottlieb

Hello,

On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote:
> > 1) Before:
> >
> > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
> >   Children      Self  Co  Shared Object     Symbol
> > +   21.19%     3.38%  tc  [kernel.vmlinux]  [k] pcpu_alloc
> > +    3.45%     0.25%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
> >
> > 2) After:
> >
> > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
> >   Children      Self  Co  Shared Object     Symbol
> > +   44.67%     3.99%  tc  [kernel.vmlinux]  [k] pcpu_alloc
> > +   19.25%     0.22%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area

The allocator hint only remembers the max available size per chunk but
not the alignment, so depending on the allocation pattern, alignment
requirement change can lead to way more scanning per alloc attempt.
Shouldn't be too difficult to improve tho.

> I guess this is more a question for per-cpu allocator experts / maintainers ?
> 
> 16-bytes alignment for 16-bytes objects sound quite reasonable [1]
> 
> It also means that if your workload is mostly being able to setup /
> dismantle tc filters,
> instead of really using them, you might go back to atomics instead of
> expensive per cpu storage.
> 
> (Ie optimize control path instead of data path)
> 
> Thanks !
> 
> [1] We even might make this generic as in :
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 27a25bf1275b7233d28cc0b126256e0f8a2b7f4f..bbf4ad37ae893fc1da5523889dd147f046852cc7
> 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1362,7 +1362,11 @@ static void __percpu *pcpu_alloc(size_t size,
> size_t align, bool reserved,
>          */
>         if (unlikely(align < PCPU_MIN_ALLOC_SIZE))
>                 align = PCPU_MIN_ALLOC_SIZE;
> -
> +       while (align < L1_CACHE_BYTES && (align << 1) <= size) {
> +               if (size % (align << 1))
> +                       break;
> +               align <<= 1;
> +       }

Percpu storage is expensive and cache line sharing tends to be less of
a problem (cuz they're per-cpu), so it is useful to support custom
alignments for tighter packing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: tc filter insertion rate degradation
  2019-01-22 21:18   ` Tejun Heo
@ 2019-01-22 22:40     ` Eric Dumazet
  0 siblings, 0 replies; 6+ messages in thread
From: Eric Dumazet @ 2019-01-22 22:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vlad Buslov, Dennis Zhou, Linux Kernel Network Developers,
	Yevgeny Kliteynik, Yossef Efraim, Maor Gottlieb

On Tue, Jan 22, 2019 at 1:18 PM Tejun Heo <tj@kernel•org> wrote:
>
> Hello,
>

> Percpu storage is expensive and cache line sharing tends to be less of
> a problem (cuz they're per-cpu), so it is useful to support custom
> alignments for tighter packing.
>


We have BPF percpu maps of two 8-byte counters  (packets and bytes
counter), with millions of slots.

We update the pair for every packet sent on the hosts.

BPF uses an alignment of 8 (that can not be changed/tuned, at least
all call sites from kernel/bpf/hashtab.c )

If we are lucky, all these pairs are allocated using a single cache line.
But when we are not lucky, 25% of the pairs are crossing a cache line,
reducing performance under DDOS.

Using a nicer alignment in our case does not consume more ram, and we
did not notice
extra cost of per-cpu allocations because we keep them in the slow
path (control path)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: tc filter insertion rate degradation
  2019-01-22 17:33 ` Eric Dumazet
  2019-01-22 21:18   ` Tejun Heo
@ 2019-01-24 17:21   ` Dennis Zhou
  2019-01-29 19:22     ` Vlad Buslov
  1 sibling, 1 reply; 6+ messages in thread
From: Dennis Zhou @ 2019-01-24 17:21 UTC (permalink / raw)
  To: Eric Dumazet, Vlad Buslov
  Cc: Tejun Heo, Linux Kernel Network Developers, Yevgeny Kliteynik,
	Yossef Efraim, Maor Gottlieb

Hi Vlad and Eric,

On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote:
> On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vladbu@mellanox•com> wrote:
> >
> > Hi Eric,
> >
> > I've been investigating significant tc filter insertion rate degradation
> > and it seems it is caused by your commit 001c96db0181 ("net: align
> > gnet_stats_basic_cpu struct"). With this commit insertion rate is
> > reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
> > from file in tc batch mode on my machine.
> >
> > Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
> >
> > 1) Before:
> >
> > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
> >   Children      Self  Co  Shared Object     Symbol
> > +   21.19%     3.38%  tc  [kernel.vmlinux]  [k] pcpu_alloc
> > +    3.45%     0.25%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
> >
> > 2) After:
> >
> > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
> >   Children      Self  Co  Shared Object     Symbol
> > +   44.67%     3.99%  tc  [kernel.vmlinux]  [k] pcpu_alloc
> > +   19.25%     0.22%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
> >
> > It seems that it takes much more work for pcpu allocator to perform
> > allocation with new stricter alignment requirements. Not sure if it is
> > expected behavior or not in this case.
> >
> > Regards,
> > Vlad

Would you mind sharing a little more information with me:
1) output before and after a run of /sys/kernel/debug/percpu_stats
2) a full perf output
3) a reproducer

I'm a little surprised we're spending time in pcpu_alloc_area(), but it
might be due to constantly breaking the hint as an immediate guess.

> 
> Hi Vlad
> 
> I guess this is more a question for per-cpu allocator experts / maintainers ?
> 
> 16-bytes alignment for 16-bytes objects sound quite reasonable [1]
> 

The alignment request seems reasonable. But as Tejun mentioned in a
reply to this, the overhead of forced alignment would be both in percpu
memory itself and in allocation time due to the stricter requirement.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: tc filter insertion rate degradation
  2019-01-24 17:21   ` Dennis Zhou
@ 2019-01-29 19:22     ` Vlad Buslov
  0 siblings, 0 replies; 6+ messages in thread
From: Vlad Buslov @ 2019-01-29 19:22 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Eric Dumazet, Tejun Heo, Linux Kernel Network Developers,
	Yevgeny Kliteynik, Yossef Efraim, Maor Gottlieb


On Thu 24 Jan 2019 at 17:21, Dennis Zhou <dennis@kernel•org> wrote:
> Hi Vlad and Eric,
>
> On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote:
>> On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vladbu@mellanox•com> wrote:
>> >
>> > Hi Eric,
>> >
>> > I've been investigating significant tc filter insertion rate degradation
>> > and it seems it is caused by your commit 001c96db0181 ("net: align
>> > gnet_stats_basic_cpu struct"). With this commit insertion rate is
>> > reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
>> > from file in tc batch mode on my machine.
>> >
>> > Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
>> >
>> > 1) Before:
>> >
>> > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
>> >   Children      Self  Co  Shared Object     Symbol
>> > +   21.19%     3.38%  tc  [kernel.vmlinux]  [k] pcpu_alloc
>> > +    3.45%     0.25%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
>> >
>> > 2) After:
>> >
>> > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
>> >   Children      Self  Co  Shared Object     Symbol
>> > +   44.67%     3.99%  tc  [kernel.vmlinux]  [k] pcpu_alloc
>> > +   19.25%     0.22%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
>> >
>> > It seems that it takes much more work for pcpu allocator to perform
>> > allocation with new stricter alignment requirements. Not sure if it is
>> > expected behavior or not in this case.
>> >
>> > Regards,
>> > Vlad
>
> Would you mind sharing a little more information with me:
> 1) output before and after a run of /sys/kernel/debug/percpu_stats

Hi Dennis,

Some of these files are quite large, so I put them to my Dropbox.

Output before:

Percpu Memory Statistics
Allocation Info:
----------------------------------------
  unit_size           :       262144
  static_size         :       139160
  reserved_size       :         8192
  dyn_size            :        28776
  atom_size           :      2097152
  alloc_size          :      2097152

Global Stats:
----------------------------------------
  nr_alloc            :         3343
  nr_dealloc          :          752
  nr_cur_alloc        :         2591
  nr_max_alloc        :         2598
  nr_chunks           :            3
  nr_max_chunks       :            3
  min_alloc_size      :            4
  max_alloc_size      :         8208
  empty_pop_pages     :            3

Per Chunk Stats:
----------------------------------------
Chunk: <- Reserved Chunk
  nr_alloc            :            5
  max_alloc_size      :          320
  empty_pop_pages     :            0
  first_bit           :         1002
  free_bytes          :         7448
  contig_bytes        :         7424
  sum_frag            :           24
  max_frag            :           24
  cur_min_alloc       :           16
  cur_med_alloc       :           64
  cur_max_alloc       :          320

Chunk: <- First Chunk
  nr_alloc            :          479
  max_alloc_size      :         8208
  empty_pop_pages     :            0
  first_bit           :         8192
  free_bytes          :            0
  contig_bytes        :            0
  sum_frag            :            0
  max_frag            :            0
  cur_min_alloc       :            4
  cur_med_alloc       :           24
  cur_max_alloc       :         8208

Chunk:
  nr_alloc            :         1925
  max_alloc_size      :         8208
  empty_pop_pages     :            0
  first_bit           :        63102
  free_bytes          :          852
  contig_bytes        :           12
  sum_frag            :          852
  max_frag            :           12
  cur_min_alloc       :            4
  cur_med_alloc       :            8
  cur_max_alloc       :         8208

Chunk:
  nr_alloc            :          182
  max_alloc_size      :          936
  empty_pop_pages     :            3
  first_bit           :           21
  free_bytes          :       256452
  contig_bytes        :       255120
  sum_frag            :         1332
  max_frag            :          368
  cur_min_alloc       :            8
  cur_med_alloc       :           20
  cur_max_alloc       :          320


After: https://www.dropbox.com/s/unyzhx4vgo2x30e/stats_after?dl=0

> 2) a full perf output

https://www.dropbox.com/s/isfcxca3npn5slx/perf.data?dl=0

> 3) a reproducer

$ sudo tc -b add.0

Example batch file: https://www.dropbox.com/s/ey7cbl5nwu5p0tg/add.0?dl=0

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-01-29 19:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-01-21 11:24 tc filter insertion rate degradation Vlad Buslov
2019-01-22 17:33 ` Eric Dumazet
2019-01-22 21:18   ` Tejun Heo
2019-01-22 22:40     ` Eric Dumazet
2019-01-24 17:21   ` Dennis Zhou
2019-01-29 19:22     ` Vlad Buslov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox