From: David Gibson <david@gibson•dropbear.id.au>
To: Alexey Kardashevskiy <aik@ozlabs•ru>
Cc: Alex Williamson <alex.williamson@redhat•com>,
linuxppc-dev@lists•ozlabs.org, Paul Mackerras <paulus@samba•org>
Subject: Re: [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
Date: Mon, 15 Aug 2016 21:07:08 +1000 [thread overview]
Message-ID: <20160815110708.GC27484@voom.fritz.box> (raw)
In-Reply-To: <a80d4f1d-ab5b-6a31-07b1-4be92d0af7bc@ozlabs.ru>
[-- Attachment #1: Type: text/plain, Size: 13077 bytes --]
On Fri, Aug 12, 2016 at 04:12:17PM +1000, Alexey Kardashevskiy wrote:
> On 12/08/16 15:46, David Gibson wrote:
> > On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote:
> >> On Wed, 10 Aug 2016 15:37:17 +1000
> >> Alexey Kardashevskiy <aik@ozlabs•ru> wrote:
> >>
> >>> On 09/08/16 22:16, Alex Williamson wrote:
> >>>> On Tue, 9 Aug 2016 15:19:39 +1000
> >>>> Alexey Kardashevskiy <aik@ozlabs•ru> wrote:
> >>>>
> >>>>> On 09/08/16 02:43, Alex Williamson wrote:
> >>>>>> On Wed, 3 Aug 2016 18:40:55 +1000
> >>>>>> Alexey Kardashevskiy <aik@ozlabs•ru> wrote:
> >>>>>>
> >>>>>>> This exports helpers which are needed to keep a VFIO container in
> >>>>>>> memory while there are external users such as KVM.
> >>>>>>>
> >>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs•ru>
> >>>>>>> ---
> >>>>>>> drivers/vfio/vfio.c | 30 ++++++++++++++++++++++++++++++
> >>>>>>> drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++-
> >>>>>>> include/linux/vfio.h | 6 ++++++
> >>>>>>> 3 files changed, 51 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >>>>>>> index d1d70e0..baf6a9c 100644
> >>>>>>> --- a/drivers/vfio/vfio.c
> >>>>>>> +++ b/drivers/vfio/vfio.c
> >>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
> >>>>>>> EXPORT_SYMBOL_GPL(vfio_external_check_extension);
> >>>>>>>
> >>>>>>> /**
> >>>>>>> + * External user API for containers, exported by symbols to be linked
> >>>>>>> + * dynamically.
> >>>>>>> + *
> >>>>>>> + */
> >>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep)
> >>>>>>> +{
> >>>>>>> + struct vfio_container *container = filep->private_data;
> >>>>>>> +
> >>>>>>> + if (filep->f_op != &vfio_fops)
> >>>>>>> + return ERR_PTR(-EINVAL);
> >>>>>>> +
> >>>>>>> + vfio_container_get(container);
> >>>>>>> +
> >>>>>>> + return container;
> >>>>>>> +}
> >>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext);
> >>>>>>> +
> >>>>>>> +void vfio_container_put_ext(struct vfio_container *container)
> >>>>>>> +{
> >>>>>>> + vfio_container_put(container);
> >>>>>>> +}
> >>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext);
> >>>>>>> +
> >>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container *container)
> >>>>>>> +{
> >>>>>>> + return container->iommu_data;
> >>>>>>> +}
> >>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext);
> >>>>>>> +
> >>>>>>> +/**
> >>>>>>> * Sub-module support
> >>>>>>> */
> >>>>>>> /*
> >>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>> index 3594ad3..fceea3d 100644
> >>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
> >>>>>>> .detach_group = tce_iommu_detach_group,
> >>>>>>> };
> >>>>>>>
> >>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data,
> >>>>>>> + u64 offset)
> >>>>>>> +{
> >>>>>>> + struct tce_container *container = iommu_data;
> >>>>>>> + struct iommu_table *tbl = NULL;
> >>>>>>> +
> >>>>>>> + if (tce_iommu_find_table(container, offset, &tbl) < 0)
> >>>>>>> + return NULL;
> >>>>>>> +
> >>>>>>> + iommu_table_get(tbl);
> >>>>>>> +
> >>>>>>> + return tbl;
> >>>>>>> +}
> >>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext);
> >>>>>>> +
> >>>>>>> static int __init tce_iommu_init(void)
> >>>>>>> {
> >>>>>>> return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> >>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION);
> >>>>>>> MODULE_LICENSE("GPL v2");
> >>>>>>> MODULE_AUTHOR(DRIVER_AUTHOR);
> >>>>>>> MODULE_DESCRIPTION(DRIVER_DESC);
> >>>>>>> -
> >>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >>>>>>> index 0ecae0b..1c2138a 100644
> >>>>>>> --- a/include/linux/vfio.h
> >>>>>>> +++ b/include/linux/vfio.h
> >>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group);
> >>>>>>> extern int vfio_external_user_iommu_id(struct vfio_group *group);
> >>>>>>> extern long vfio_external_check_extension(struct vfio_group *group,
> >>>>>>> unsigned long arg);
> >>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file *filep);
> >>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container);
> >>>>>>> +extern void *vfio_container_get_iommu_data_ext(
> >>>>>>> + struct vfio_container *container);
> >>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext(
> >>>>>>> + void *iommu_data, u64 offset);
> >>>>>>>
> >>>>>>> /*
> >>>>>>> * Sub-module helpers
> >>>>>>
> >>>>>>
> >>>>>> I think you need to take a closer look of the lifecycle of a container,
> >>>>>> having a reference means the container itself won't go away, but only
> >>>>>> having a group set within that container holds the actual IOMMU
> >>>>>> references. container->iommu_data is going to be NULL once the
> >>>>>> groups are lost. Thanks,
> >>>>>
> >>>>>
> >>>>> Container owns the iommu tables and this is what I care about here, groups
> >>>>> attached or not - this is handled separately via IOMMU group list in a
> >>>>> specific iommu_table struct, these groups get detached from iommu_table
> >>>>> when they are removed from a container.
> >>>>
> >>>> The container doesn't own anything, the container is privileged by the
> >>>> groups being attached to it. When groups are closed, they detach from
> >>>> the container and once the container group list is empty the iommu
> >>>> backend is released and iommu_data is NULL. A container reference
> >>>> doesn't give you what you're looking for. It implies nothing about the
> >>>> iommu backend.
> >>>
> >>>
> >>> Well. Backend is a part of a container and since a backend owns tables, a
> >>> container owns them too.
> >>
> >> The IOMMU backend is accessed through the container, but that backend
> >> is privileged by the groups it contains. Once those groups are gone,
> >> the IOMMU backend is released, regardless of whatever reference you
> >> have to the container itself such as you're attempting to do here. In
> >> that sense, the container does not own those tables.
> >
> > So, the thing is that what KVM fundamentally needs is a handle on the
> > container. KVM is essentially modelling the DMA address space of a
> > single guest bus, and the container is what's attached to that.
> >
> > The first part of the problem is that KVM wants to basically invoke
> > vfio_dma_map() operations without bouncing via qemu. Because
> > vfio_dma_map() works on the container level, that's the handle that
> > KVM needs to hold.
>
>
> Well, I do not need to hold the reference to the container all the time, I
> just need it to get to the IOMMU backend, get+reference an iommu_table from
> it, referencing here helps to make sure the backend is not going away
> before we reference iommu_table.
Yes, but I don't see a compelling reason *not* to hold the container
reference either - it seems like principle of least surprise would
suggest retaining the reference.
For example, I can imagine having a container reset call which threw
away the back end iommu table and created a new one. It seems like
what you'd expect in this case is for the guest bus to remain bound to
the same container, not to the now stale iommu table.
> After that I only keep a reference to the container to know if/when I can
> release a particular iommu_table. This is can workaround by counting how
> many groups were attached to this particular KVM-spapt-tce-table and
> looking at the IOMMU group list attached to an iommu_table - if the list is
> empty, decrement the iommu_table reference counter and that's it, no extra
> references to a VFIO container.
>
> Or I need an alternative way of getting iommu_table's, i.e. QEMU should
> somehow tell KVM that this LIOBN is this VFIO container fd (easy - can be
> done via region_add/region_del interface)
Um.. yes.. that's what I was expecting, I thought that was what you
were doing.x
> or VFIO IOMMU group fd(s) (more
> tricky as this needs to be done from more places - vfio-pci hotplug/unplug,
> window add/remove).
More tricky and also wrong. Again, having one group but not the whole
container bound to the guest LIOBN doesn't make any sense - by
definition, all the devices in the container should share the same DMA
address space.
> > The second part of the problem is that in order to reduce overhead
> > further, we want to operate in real mode, which means bypassing most
> > of the usual VFIO structure and going directly(ish) from the KVM
> > hcall emulation to the IOMMU backend behind VFIO. This complicates
> > matters a fair bit. Because it is, explicitly, a performance hack,
> > some degree of ugliness is probably inevitable.
> >
> > Alexey - actually implementing this in two stages might make this
> > clearer. The first stage wouldn't allow real mode, and would call
> > through the same vfio_dma_map() path as qemu calls through now. The
> > second stage would then put in place the necessary hacks to add real
> > mode support.
> >
> >>> The problem I am trying to solve here is when KVM may release the
> >>> iommu_table objects.
> >>>
> >>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really
> >>> matter) makes a link between KVM-spapr-tce-table and container and KVM can
> >>> start using tables (with referencing them).
> >>>
> >>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it
> >>> from region_del() and this works if QEMU removes a window. However if QEMU
> >>> removes a vfio-pci device, region_del() is not called and KVM does not get
> >>> notified that it can release the iommu_table's because the
> >>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is
> >>> still used by emulated devices or other containers).
> >>>
> >>> So it was suggested that we could do such "unset" somehow later assuming,
> >>> for example, on every "set" I could check if some of currently attached
> >>> containers are no more used - and this is where being able to know if there
> >>> is no backend helps - KVM remembers a container pointer and can check this
> >>> via vfio_container_get_iommu_data_ext().
> >>>
> >>> The other option would be changing vfio_container_get_ext() to take a
> >>> callback+opaque which container would call when it destroys iommu_data.
> >>> This looks more intrusive and not very intuitive how to make it right -
> >>> container would have to keep track of all registered external users and
> >>> vfio_container_put_ext() would have to pass the same callback+opaque to
> >>> unregister the exact external user.
> >>
> >> I'm not in favor of anything resembling the code above or extensions
> >> beyond it, the container is the wrong place to do this.
> >>
> >>> Or I could store container file* in KVM. Then iommu_data would never be
> >>> released until KVM-spapr-tce-table is destroyed.
> >>
> >> See above, holding a file pointer to the container doesn't do squat.
> >> The groups that are held by the container empower the IOMMU backend,
> >> references to the container itself don't matter. Those references will
> >> not maintain the IOMMU data.
> >>
> >>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd
> >>> would "unset" container from KVM-spapr-tce-table) is not an option as there
> >>> still may be devices using this KVM-spapr-tce-table.
> >>>
> >>> What obvious and nice solution am I missing here? Thanks.
> >>
> >> The interactions with the IOMMU backend that seem relevant are
> >> vfio_iommu_drivers_ops.{detach_group,release}. The kvm-vfio pseudo
> >> device is also used to tell kvm about groups as they come and go and
> >> has a way to check extensions, and thus properties of the IOMMU
> >> backend. All of these are available for your {ab}use. Thanks,
> >
> > So, Alexey started trying to do this via the KVM-VFIO device, but it's
> > a really bad fit. As noted above, fundamentally it's a container we
> > need to attach to the kvm-spapr-tce-table object, since what that
> > represents is a guest bus DMA address space, and by definition all the
> > groups in a container must have the same DMA address space.
>
>
> Well, in a bad case a LIOBN/kvm-spapr-tce-table has multiple containers
> attached so it is not 1:1...
I never said it was. It's n:1, but it's *not* n:m. You can have
multiple containers to a LIOBN, but never multiple LIOBNs ot a
container.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
next prev parent reply other threads:[~2016-08-15 11:52 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-08-03 8:40 [PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
2016-08-03 8:40 ` [PATCH kernel 01/15] Revert "iommu: Add a function to find an iommu group by id" Alexey Kardashevskiy
2016-08-15 4:58 ` Paul Mackerras
2016-08-03 8:40 ` [PATCH kernel 02/15] KVM: PPC: Finish enabling VFIO KVM device on POWER Alexey Kardashevskiy
2016-08-04 5:21 ` David Gibson
2016-08-03 8:40 ` [PATCH kernel 03/15] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number Alexey Kardashevskiy
2016-08-03 8:40 ` [PATCH kernel 04/15] powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again Alexey Kardashevskiy
2016-08-04 5:23 ` David Gibson
2016-08-09 11:26 ` [kernel, " Michael Ellerman
2016-08-03 8:40 ` [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx Alexey Kardashevskiy
2016-08-03 10:10 ` Nicholas Piggin
2016-08-05 7:00 ` Michael Ellerman
2016-08-09 5:29 ` Alexey Kardashevskiy
2016-08-09 4:43 ` Balbir Singh
2016-08-09 6:04 ` Nicholas Piggin
2016-08-09 6:17 ` Balbir Singh
2016-08-12 2:57 ` David Gibson
2016-08-12 4:56 ` Alexey Kardashevskiy
2016-08-15 10:58 ` David Gibson
2016-08-03 8:40 ` [PATCH kernel 06/15] powerpc/mm/iommu: Put pages on process exit Alexey Kardashevskiy
2016-08-03 10:11 ` Nicholas Piggin
2016-08-12 3:13 ` David Gibson
2016-08-03 8:40 ` [PATCH kernel 07/15] powerpc/iommu: Cleanup iommu_table disposal Alexey Kardashevskiy
2016-08-12 3:18 ` David Gibson
2016-08-03 8:40 ` [PATCH kernel 08/15] powerpc/vfio_spapr_tce: Add reference counting to iommu_table Alexey Kardashevskiy
2016-08-12 3:29 ` David Gibson
2016-08-03 8:40 ` [PATCH kernel 09/15] powerpc/mmu: Add real mode support for IOMMU preregistered memory Alexey Kardashevskiy
2016-08-12 4:02 ` David Gibson
2016-08-03 8:40 ` [PATCH kernel 10/15] KVM: PPC: Use preregistered memory API to access TCE list Alexey Kardashevskiy
2016-08-12 4:17 ` David Gibson
2016-08-03 8:40 ` [PATCH kernel 11/15] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange() Alexey Kardashevskiy
2016-08-12 4:29 ` David Gibson
2016-08-03 8:40 ` [PATCH kernel 12/15] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently Alexey Kardashevskiy
2016-08-12 4:34 ` David Gibson
2016-08-03 8:40 ` [PATCH kernel 13/15] KVM: PPC: Pass kvm* to kvmppc_find_table() Alexey Kardashevskiy
2016-08-12 4:45 ` David Gibson
2016-08-03 8:40 ` [PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users Alexey Kardashevskiy
2016-08-08 16:43 ` Alex Williamson
2016-08-09 5:19 ` Alexey Kardashevskiy
2016-08-09 12:16 ` Alex Williamson
2016-08-10 5:37 ` Alexey Kardashevskiy
2016-08-10 16:46 ` Alex Williamson
2016-08-12 5:46 ` David Gibson
2016-08-12 6:12 ` Alexey Kardashevskiy
2016-08-15 11:07 ` David Gibson [this message]
2016-08-17 8:31 ` Alexey Kardashevskiy
2016-08-12 15:22 ` Alex Williamson
2016-08-17 3:17 ` David Gibson
2016-08-18 0:22 ` Alexey Kardashevskiy
2016-08-29 6:35 ` Alexey Kardashevskiy
2016-08-29 13:27 ` David Gibson
2016-09-07 9:09 ` Alexey Kardashevskiy
2016-09-21 6:56 ` Alexey Kardashevskiy
2016-09-23 7:12 ` David Gibson
2016-10-17 6:06 ` Alexey Kardashevskiy
2016-10-18 1:42 ` David Gibson
2016-08-15 3:59 ` Paul Mackerras
2016-08-15 15:32 ` Alex Williamson
2016-08-12 5:25 ` David Gibson
2016-08-03 8:40 ` [PATCH kernel 15/15] KVM: PPC: Add in-kernel acceleration for VFIO Alexey Kardashevskiy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160815110708.GC27484@voom.fritz.box \
--to=david@gibson$(echo .)dropbear.id.au \
--cc=aik@ozlabs$(echo .)ru \
--cc=alex.williamson@redhat$(echo .)com \
--cc=linuxppc-dev@lists$(echo .)ozlabs.org \
--cc=paulus@samba$(echo .)org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox