From: Peter Zijlstra <peterz@infradead•org>
To: Srikar Dronamraju <srikar@linux•vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel•org>,
LKML <linux-kernel@vger•kernel.org>,
Mel Gorman <mgorman@techsingularity•net>,
Rik van Riel <riel@surriel•com>,
Thomas Gleixner <tglx@linutronix•de>,
Michael Ellerman <mpe@ellerman•id.au>,
Heiko Carstens <heiko.carstens@de•ibm.com>,
Suravee Suthikulpanit <suravee.suthikulpanit@amd•com>,
linuxppc-dev <linuxppc-dev@lists•ozlabs.org>,
Benjamin Herrenschmidt <benh@au1•ibm.com>
Subject: Re: [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch
Date: Fri, 31 Aug 2018 13:12:53 +0200 [thread overview]
Message-ID: <20180831111253.GJ24124@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20180831102724.GB8437@linux.vnet.ibm.com>
On Fri, Aug 31, 2018 at 03:27:24AM -0700, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead•org> [2018-08-29 10:02:19]:
> Powerpc lpars running on Phyp have 2 modes. Dedicated and shared.
>
> Dedicated lpars are similar to kvm guest with vcpupin.
Like i know what that means... I'm not big on virt. I suppose you're
saying it has a fixed virt to phys mapping.
> Shared lpars are similar to kvm guest without any pinning. When running
> shared lpar mode, Phyp allows overcommitting. Now if more lpars are
> created/destroyed, Phyp will internally move / consolidate the cores. The
> objective is similar to what autonuma tries achieves on the host but with a
> different approach (consolidating to optimal nodes to achieve the best
> possible output). This would mean that the actual underlying cpus/node
> mapping has changed.
AFAIK Linux can _not_ handle cpu:node relations changing. And I'm pretty
sure I told you that before.
> Phyp will propogate upwards an event to the lpar. The
> lpar / os can choose to ignore or act on the same.
>
> We have found that acting on the event will provide upto 40% improvement
> over ignoring the event. Acting on the event would mean moving the cpu from
> one node to the other, and topology_work_fn exactly does that.
How? Last time I checked there was a ton of code that relies on
cpu_to_node() not changing during the runtime of the kernel.
Stuff like the per-cpu memory allocations are done using the boot time
cpu_to_node() map for instance. Similarly, kthread creation uses the
cpu_to_node() map at the time of creation.
A lot of stuff is not re-evaluated. If you're dynamically changing the
node map, you're in for a world of hurt.
> In the case where we didn't have the NUMA sched domain, we would build the
> independent (aka overlap) sched_groups. With NUMA sched domain
> introduction, we try to reuse sched_groups (aka non-overlay). This results
> in the above, which I thought I tried to explain in
> https://lwn.net/ml/linux-kernel/20180810164533.GB42350@linux.vnet.ibm.com
That email was a ton of confusion; you show an error and you don't
explain how you get there.
> In the typical case above, lets take 2 node, 8 core each having SMT 8
> threads. Initially all the 8 cores might come from node 0. Hence
> sched_domains_numa_masks[NODE][node1] and
> sched_domains_numa_mask[NUMA][node1] is set at sched_init_numa will have
> blank cpumasks.
>
> Let say Phyp decides to move some of the load to another node, node 1, which
> till now has 0 cpus. Hence we will see
>
> "BUG: arch topology borken \n the DIE domain not a subset of the NODE
> domain" which is probably okay. This problem is even present even before
> NODE domain was created and systems still booted and ran.
No that is _NOT_ OKAY. The fact that it boots and runs just means we
cope with it, but it violates a base assumption when building domains.
> However with the introduction of NODE sched_domain,
> init_sched_groups_capacity() gets called for non-overlay sched_domains which
> gets us into even worse problems. Here we will end up in a situation where
> sgA->sgB->sgC-sgD->sgA gets converted into sgA->sgB->sgC->sgB which ends up
> creating cpu stalls.
>
> So the request is to expose the sched_domains_numa_masks_set /
> sched_domains_numa_masks_clear to arch, so that on topology update i.e event
> from phyp, arch set the mask correctly. The scheduler seems to take care of
> everything else.
NAK, not until you've fixed every cpu_to_node() user in the kernel to
deal with that mask changing.
This is absolutely insane.
next prev parent reply other threads:[~2018-08-31 11:13 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <reply-to=<20180808081942.GA37418@linux.vnet.ibm.com>
2018-08-10 17:00 ` [PATCH 1/2] sched/topology: Set correct numa topology type Srikar Dronamraju
2018-08-10 17:00 ` [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch Srikar Dronamraju
2018-08-29 8:02 ` Peter Zijlstra
2018-08-31 10:27 ` Srikar Dronamraju
2018-08-31 11:12 ` Peter Zijlstra [this message]
2018-08-31 11:26 ` Peter Zijlstra
2018-08-31 11:53 ` Srikar Dronamraju
2018-08-31 12:05 ` Peter Zijlstra
2018-08-31 12:08 ` Peter Zijlstra
2018-08-21 11:02 ` [PATCH 1/2] sched/topology: Set correct numa topology type Srikar Dronamraju
2018-08-21 13:59 ` Peter Zijlstra
2018-09-10 10:06 ` [tip:sched/core] sched/topology: Set correct NUMA " tip-bot for Srikar Dronamraju
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180831111253.GJ24124@hirez.programming.kicks-ass.net \
--to=peterz@infradead$(echo .)org \
--cc=benh@au1$(echo .)ibm.com \
--cc=heiko.carstens@de$(echo .)ibm.com \
--cc=linux-kernel@vger$(echo .)kernel.org \
--cc=linuxppc-dev@lists$(echo .)ozlabs.org \
--cc=mgorman@techsingularity$(echo .)net \
--cc=mingo@kernel$(echo .)org \
--cc=mpe@ellerman$(echo .)id.au \
--cc=riel@surriel$(echo .)com \
--cc=srikar@linux$(echo .)vnet.ibm.com \
--cc=suravee.suthikulpanit@amd$(echo .)com \
--cc=tglx@linutronix$(echo .)de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox