* tc filter insertion rate degradation
@ 2019-01-21 11:24 Vlad Buslov
2019-01-22 17:33 ` Eric Dumazet
0 siblings, 1 reply; 6+ messages in thread
From: Vlad Buslov @ 2019-01-21 11:24 UTC (permalink / raw)
To: Eric Dumazet
Cc: Linux Kernel Network Developers, Yevgeny Kliteynik, Yossef Efraim,
Maor Gottlieb
Hi Eric,
I've been investigating significant tc filter insertion rate degradation
and it seems it is caused by your commit 001c96db0181 ("net: align
gnet_stats_basic_cpu struct"). With this commit insertion rate is
reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
from file in tc batch mode on my machine.
Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
1) Before:
Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
Children Self Co Shared Object Symbol
+ 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc
+ 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area
2) After:
Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
Children Self Co Shared Object Symbol
+ 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc
+ 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area
It seems that it takes much more work for pcpu allocator to perform
allocation with new stricter alignment requirements. Not sure if it is
expected behavior or not in this case.
Regards,
Vlad
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: tc filter insertion rate degradation 2019-01-21 11:24 tc filter insertion rate degradation Vlad Buslov @ 2019-01-22 17:33 ` Eric Dumazet 2019-01-22 21:18 ` Tejun Heo 2019-01-24 17:21 ` Dennis Zhou 0 siblings, 2 replies; 6+ messages in thread From: Eric Dumazet @ 2019-01-22 17:33 UTC (permalink / raw) To: Vlad Buslov, Dennis Zhou, Tejun Heo Cc: Linux Kernel Network Developers, Yevgeny Kliteynik, Yossef Efraim, Maor Gottlieb On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vladbu@mellanox•com> wrote: > > Hi Eric, > > I've been investigating significant tc filter insertion rate degradation > and it seems it is caused by your commit 001c96db0181 ("net: align > gnet_stats_basic_cpu struct"). With this commit insertion rate is > reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules > from file in tc batch mode on my machine. > > Tc perf profile indicates that pcpu allocator now consumes 2x CPU: > > 1) Before: > > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071 > Children Self Co Shared Object Symbol > + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc > + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area > > 2) After: > > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550 > Children Self Co Shared Object Symbol > + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc > + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area > > It seems that it takes much more work for pcpu allocator to perform > allocation with new stricter alignment requirements. Not sure if it is > expected behavior or not in this case. > > Regards, > Vlad Hi Vlad I guess this is more a question for per-cpu allocator experts / maintainers ? 16-bytes alignment for 16-bytes objects sound quite reasonable [1] It also means that if your workload is mostly being able to setup / dismantle tc filters, instead of really using them, you might go back to atomics instead of expensive per cpu storage. (Ie optimize control path instead of data path) Thanks ! [1] We even might make this generic as in : diff --git a/mm/percpu.c b/mm/percpu.c index 27a25bf1275b7233d28cc0b126256e0f8a2b7f4f..bbf4ad37ae893fc1da5523889dd147f046852cc7 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1362,7 +1362,11 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, */ if (unlikely(align < PCPU_MIN_ALLOC_SIZE)) align = PCPU_MIN_ALLOC_SIZE; - + while (align < L1_CACHE_BYTES && (align << 1) <= size) { + if (size % (align << 1)) + break; + align <<= 1; + } size = ALIGN(size, PCPU_MIN_ALLOC_SIZE); bits = size >> PCPU_MIN_ALLOC_SHIFT; bit_align = align >> PCPU_MIN_ALLOC_SHIFT; ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: tc filter insertion rate degradation 2019-01-22 17:33 ` Eric Dumazet @ 2019-01-22 21:18 ` Tejun Heo 2019-01-22 22:40 ` Eric Dumazet 2019-01-24 17:21 ` Dennis Zhou 1 sibling, 1 reply; 6+ messages in thread From: Tejun Heo @ 2019-01-22 21:18 UTC (permalink / raw) To: Eric Dumazet Cc: Vlad Buslov, Dennis Zhou, Linux Kernel Network Developers, Yevgeny Kliteynik, Yossef Efraim, Maor Gottlieb Hello, On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote: > > 1) Before: > > > > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071 > > Children Self Co Shared Object Symbol > > + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc > > + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area > > > > 2) After: > > > > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550 > > Children Self Co Shared Object Symbol > > + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc > > + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area The allocator hint only remembers the max available size per chunk but not the alignment, so depending on the allocation pattern, alignment requirement change can lead to way more scanning per alloc attempt. Shouldn't be too difficult to improve tho. > I guess this is more a question for per-cpu allocator experts / maintainers ? > > 16-bytes alignment for 16-bytes objects sound quite reasonable [1] > > It also means that if your workload is mostly being able to setup / > dismantle tc filters, > instead of really using them, you might go back to atomics instead of > expensive per cpu storage. > > (Ie optimize control path instead of data path) > > Thanks ! > > [1] We even might make this generic as in : > > diff --git a/mm/percpu.c b/mm/percpu.c > index 27a25bf1275b7233d28cc0b126256e0f8a2b7f4f..bbf4ad37ae893fc1da5523889dd147f046852cc7 > 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -1362,7 +1362,11 @@ static void __percpu *pcpu_alloc(size_t size, > size_t align, bool reserved, > */ > if (unlikely(align < PCPU_MIN_ALLOC_SIZE)) > align = PCPU_MIN_ALLOC_SIZE; > - > + while (align < L1_CACHE_BYTES && (align << 1) <= size) { > + if (size % (align << 1)) > + break; > + align <<= 1; > + } Percpu storage is expensive and cache line sharing tends to be less of a problem (cuz they're per-cpu), so it is useful to support custom alignments for tighter packing. Thanks. -- tejun ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: tc filter insertion rate degradation 2019-01-22 21:18 ` Tejun Heo @ 2019-01-22 22:40 ` Eric Dumazet 0 siblings, 0 replies; 6+ messages in thread From: Eric Dumazet @ 2019-01-22 22:40 UTC (permalink / raw) To: Tejun Heo Cc: Vlad Buslov, Dennis Zhou, Linux Kernel Network Developers, Yevgeny Kliteynik, Yossef Efraim, Maor Gottlieb On Tue, Jan 22, 2019 at 1:18 PM Tejun Heo <tj@kernel•org> wrote: > > Hello, > > Percpu storage is expensive and cache line sharing tends to be less of > a problem (cuz they're per-cpu), so it is useful to support custom > alignments for tighter packing. > We have BPF percpu maps of two 8-byte counters (packets and bytes counter), with millions of slots. We update the pair for every packet sent on the hosts. BPF uses an alignment of 8 (that can not be changed/tuned, at least all call sites from kernel/bpf/hashtab.c ) If we are lucky, all these pairs are allocated using a single cache line. But when we are not lucky, 25% of the pairs are crossing a cache line, reducing performance under DDOS. Using a nicer alignment in our case does not consume more ram, and we did not notice extra cost of per-cpu allocations because we keep them in the slow path (control path) ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: tc filter insertion rate degradation 2019-01-22 17:33 ` Eric Dumazet 2019-01-22 21:18 ` Tejun Heo @ 2019-01-24 17:21 ` Dennis Zhou 2019-01-29 19:22 ` Vlad Buslov 1 sibling, 1 reply; 6+ messages in thread From: Dennis Zhou @ 2019-01-24 17:21 UTC (permalink / raw) To: Eric Dumazet, Vlad Buslov Cc: Tejun Heo, Linux Kernel Network Developers, Yevgeny Kliteynik, Yossef Efraim, Maor Gottlieb Hi Vlad and Eric, On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote: > On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vladbu@mellanox•com> wrote: > > > > Hi Eric, > > > > I've been investigating significant tc filter insertion rate degradation > > and it seems it is caused by your commit 001c96db0181 ("net: align > > gnet_stats_basic_cpu struct"). With this commit insertion rate is > > reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules > > from file in tc batch mode on my machine. > > > > Tc perf profile indicates that pcpu allocator now consumes 2x CPU: > > > > 1) Before: > > > > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071 > > Children Self Co Shared Object Symbol > > + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc > > + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area > > > > 2) After: > > > > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550 > > Children Self Co Shared Object Symbol > > + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc > > + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area > > > > It seems that it takes much more work for pcpu allocator to perform > > allocation with new stricter alignment requirements. Not sure if it is > > expected behavior or not in this case. > > > > Regards, > > Vlad Would you mind sharing a little more information with me: 1) output before and after a run of /sys/kernel/debug/percpu_stats 2) a full perf output 3) a reproducer I'm a little surprised we're spending time in pcpu_alloc_area(), but it might be due to constantly breaking the hint as an immediate guess. > > Hi Vlad > > I guess this is more a question for per-cpu allocator experts / maintainers ? > > 16-bytes alignment for 16-bytes objects sound quite reasonable [1] > The alignment request seems reasonable. But as Tejun mentioned in a reply to this, the overhead of forced alignment would be both in percpu memory itself and in allocation time due to the stricter requirement. Thanks, Dennis ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: tc filter insertion rate degradation 2019-01-24 17:21 ` Dennis Zhou @ 2019-01-29 19:22 ` Vlad Buslov 0 siblings, 0 replies; 6+ messages in thread From: Vlad Buslov @ 2019-01-29 19:22 UTC (permalink / raw) To: Dennis Zhou Cc: Eric Dumazet, Tejun Heo, Linux Kernel Network Developers, Yevgeny Kliteynik, Yossef Efraim, Maor Gottlieb On Thu 24 Jan 2019 at 17:21, Dennis Zhou <dennis@kernel•org> wrote: > Hi Vlad and Eric, > > On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote: >> On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vladbu@mellanox•com> wrote: >> > >> > Hi Eric, >> > >> > I've been investigating significant tc filter insertion rate degradation >> > and it seems it is caused by your commit 001c96db0181 ("net: align >> > gnet_stats_basic_cpu struct"). With this commit insertion rate is >> > reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules >> > from file in tc batch mode on my machine. >> > >> > Tc perf profile indicates that pcpu allocator now consumes 2x CPU: >> > >> > 1) Before: >> > >> > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071 >> > Children Self Co Shared Object Symbol >> > + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc >> > + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area >> > >> > 2) After: >> > >> > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550 >> > Children Self Co Shared Object Symbol >> > + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc >> > + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area >> > >> > It seems that it takes much more work for pcpu allocator to perform >> > allocation with new stricter alignment requirements. Not sure if it is >> > expected behavior or not in this case. >> > >> > Regards, >> > Vlad > > Would you mind sharing a little more information with me: > 1) output before and after a run of /sys/kernel/debug/percpu_stats Hi Dennis, Some of these files are quite large, so I put them to my Dropbox. Output before: Percpu Memory Statistics Allocation Info: ---------------------------------------- unit_size : 262144 static_size : 139160 reserved_size : 8192 dyn_size : 28776 atom_size : 2097152 alloc_size : 2097152 Global Stats: ---------------------------------------- nr_alloc : 3343 nr_dealloc : 752 nr_cur_alloc : 2591 nr_max_alloc : 2598 nr_chunks : 3 nr_max_chunks : 3 min_alloc_size : 4 max_alloc_size : 8208 empty_pop_pages : 3 Per Chunk Stats: ---------------------------------------- Chunk: <- Reserved Chunk nr_alloc : 5 max_alloc_size : 320 empty_pop_pages : 0 first_bit : 1002 free_bytes : 7448 contig_bytes : 7424 sum_frag : 24 max_frag : 24 cur_min_alloc : 16 cur_med_alloc : 64 cur_max_alloc : 320 Chunk: <- First Chunk nr_alloc : 479 max_alloc_size : 8208 empty_pop_pages : 0 first_bit : 8192 free_bytes : 0 contig_bytes : 0 sum_frag : 0 max_frag : 0 cur_min_alloc : 4 cur_med_alloc : 24 cur_max_alloc : 8208 Chunk: nr_alloc : 1925 max_alloc_size : 8208 empty_pop_pages : 0 first_bit : 63102 free_bytes : 852 contig_bytes : 12 sum_frag : 852 max_frag : 12 cur_min_alloc : 4 cur_med_alloc : 8 cur_max_alloc : 8208 Chunk: nr_alloc : 182 max_alloc_size : 936 empty_pop_pages : 3 first_bit : 21 free_bytes : 256452 contig_bytes : 255120 sum_frag : 1332 max_frag : 368 cur_min_alloc : 8 cur_med_alloc : 20 cur_max_alloc : 320 After: https://www.dropbox.com/s/unyzhx4vgo2x30e/stats_after?dl=0 > 2) a full perf output https://www.dropbox.com/s/isfcxca3npn5slx/perf.data?dl=0 > 3) a reproducer $ sudo tc -b add.0 Example batch file: https://www.dropbox.com/s/ey7cbl5nwu5p0tg/add.0?dl=0 Thanks, Vlad ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-01-29 19:22 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-01-21 11:24 tc filter insertion rate degradation Vlad Buslov 2019-01-22 17:33 ` Eric Dumazet 2019-01-22 21:18 ` Tejun Heo 2019-01-22 22:40 ` Eric Dumazet 2019-01-24 17:21 ` Dennis Zhou 2019-01-29 19:22 ` Vlad Buslov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox