From: Alistair Popple <alistair@popple•id.au>
To: Alexey Kardashevskiy <aik@ozlabs•ru>
Cc: Reza Arbab <arbab@linux•ibm.com>,
linuxppc-dev@lists•ozlabs.org,
David Gibson <david@gibson•dropbear.id.au>
Subject: Re: [PATCH kernel v2] powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2
Date: Thu, 18 Oct 2018 12:05:52 +1100 [thread overview]
Message-ID: <11968808.7zhiiyAbUF@townsend> (raw)
In-Reply-To: <45098cfd-5420-f4bd-f7fe-394ce2c704d0@ozlabs.ru>
Hi Alexey,
> > wouldn't you also need to do that somewhere? Unless the driver
> > does it at startup?
>
> VFIO performs GPU reset so I'd expect the GPUs to flush its caches
> without any software interactions. Am I hoping for too much here?
Sadly you are. It's not the GPU caches that need flushing, it's the CPU caches.
This needs to happen as part of the reset sequence, so I guess you would need
to add it to the VFIO driver.
- Alistair
>
> > - Alistair
> >
> >>> - Alistair
> >>>
> >>>>> - Alistair
> >>>>>
> >>>>>>> - Alistair
> >>>>>>>
> >>>>>>> On Monday, 15 October 2018 6:17:51 PM AEDT Alexey Kardashevskiy
wrote:
> >>>>>>>> Ping?
> >>>>>>>>
> >>>>>>>> On 02/10/2018 13:20, Alexey Kardashevskiy wrote:
> >>>>>>>>> The skiboot firmware has a hot reset handler which fences the
> >>>>>>>>> NVIDIA V100
> >>>>>>>>> GPU RAM on Witherspoons and makes accesses no-op instead of
> >>>>>>>>> throwing HMIs:
> >>>>>>>>> https://github.com/open-power/skiboot/commit/fca2b2b839a67
> >>>>>>>>>
> >>>>>>>>> Now we are going to pass V100 via VFIO which most certainly
> >>>>>>>>> involves
> >>>>>>>>> KVM guests which are often terminated without getting a chance to
> >>>>>>>>> offline
> >>>>>>>>> GPU RAM so we end up with a running machine with misconfigured
> >>>>>>>>> memory.
> >>>>>>>>> Accessing this memory produces hardware management interrupts
> >>>>>>>>> (HMI)
> >>>>>>>>> which bring the host down.
> >>>>>>>>>
> >>>>>>>>> To suppress HMIs, this wires up this hot reset hook to
> >>>>>>>>> vfio_pci_disable()
> >>>>>>>>> via pci_disable_device() which switches NPU2 to a safe mode and
> >>>>>>>>> prevents
> >>>>>>>>> HMIs.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs•ru>
> >>>>>>>>> ---
> >>>>>>>>> Changes:
> >>>>>>>>> v2:
> >>>>>>>>> * updated the commit log
> >>>>>>>>> ---
> >>>>>>>>>
> >>>>>>>>> arch/powerpc/platforms/powernv/pci-ioda.c | 10 ++++++++++
> >>>>>>>>> 1 file changed, 10 insertions(+)
> >>>>>>>>>
> >>>>>>>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>>>>>> b/arch/powerpc/platforms/powernv/pci-ioda.c index
> >>>>>>>>> cde7102..e37b9cc 100644
> >>>>>>>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>>>>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>>>>>> @@ -3688,6 +3688,15 @@ static void pnv_pci_release_device(struct
> >>>>>>>>> pci_dev *pdev)>>>>>>>>>
> >>>>>>>>> pnv_ioda_release_pe(pe);
> >>>>>>>>>
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> +static void pnv_npu_disable_device(struct pci_dev *pdev)
> >>>>>>>>> +{
> >>>>>>>>> + struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev);
> >>>>>>>>> + struct eeh_pe *eehpe = edev ? edev->pe : NULL;
> >>>>>>>>> +
> >>>>>>>>> + if (eehpe && eeh_ops && eeh_ops->reset)
> >>>>>>>>> + eeh_ops->reset(eehpe, EEH_RESET_HOT);
> >>>>>>>>> +}
> >>>>>>>>> +
> >>>>>>>>>
> >>>>>>>>> static void pnv_pci_ioda_shutdown(struct pci_controller *hose)
> >>>>>>>>> {
> >>>>>>>>>
> >>>>>>>>> struct pnv_phb *phb = hose->private_data;
> >>>>>>>>>
> >>>>>>>>> @@ -3732,6 +3741,7 @@ static const struct pci_controller_ops
> >>>>>>>>> pnv_npu_ioda_controller_ops = {>>>>>>>>>
> >>>>>>>>> .reset_secondary_bus = pnv_pci_reset_secondary_bus,
> >>>>>>>>> .dma_set_mask = pnv_npu_dma_set_mask,
> >>>>>>>>> .shutdown = pnv_pci_ioda_shutdown,
> >>>>>>>>>
> >>>>>>>>> + .disable_device = pnv_npu_disable_device,
> >>>>>>>>>
> >>>>>>>>> };
> >>>>>>>>>
> >>>>>>>>> static const struct pci_controller_ops
> >>>>>>>>> pnv_npu_ocapi_ioda_controller_ops = {
next prev parent reply other threads:[~2018-10-18 1:07 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-02 3:20 [PATCH kernel v2] powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2 Alexey Kardashevskiy
2018-10-15 7:17 ` Alexey Kardashevskiy
2018-10-16 0:38 ` Alistair Popple
2018-10-16 1:37 ` Alexey Kardashevskiy
2018-10-16 1:44 ` Alistair Popple
2018-10-16 2:02 ` Alexey Kardashevskiy
2018-10-16 2:19 ` Alistair Popple
2018-10-16 2:22 ` Alexey Kardashevskiy
2018-10-16 7:32 ` Alistair Popple
2018-10-16 7:55 ` Alexey Kardashevskiy
2018-10-18 1:05 ` Alistair Popple [this message]
2018-10-19 1:20 ` Alexey Kardashevskiy
2018-10-19 1:47 ` Alistair Popple
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=11968808.7zhiiyAbUF@townsend \
--to=alistair@popple$(echo .)id.au \
--cc=aik@ozlabs$(echo .)ru \
--cc=arbab@linux$(echo .)ibm.com \
--cc=david@gibson$(echo .)dropbear.id.au \
--cc=linuxppc-dev@lists$(echo .)ozlabs.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox