Up to now vSMMUv3 has not been integrated with VFIO. VFIO
integration requires to program the physical IOMMU consistently
with the guest mappings. However, as opposed to VTD, SMMUv3 has
no "Caching Mode" which allows easy trapping of guest mappings.
This means the vSMMUV3 cannot use the same VFIO integration as VTD.
However SMMUv3 has 2 translation stages. This was devised with
virtualization use case in mind where stage 1 is "owned" by the
guest whereas the host uses stage 2 for VM isolation.
This series sets up this nested translation stage. It only works
if there is one physical SMMUv3 used along with QEMU vSMMUv3 (in
other words, it does not work if there is a physical SMMUv2).
The series uses a new kernel user API [1], still under definition.
- We force the host to use stage 2 instead of stage 1, when we
detect a vSMMUV3 is behind a VFIO device. For a VFIO device
without any virtual IOMMU, we still use stage 1 as many existing
SMMUs expect this behavior.
- We introduce new IOTLB "config" notifiers, requested to notify
changes in the config of a given iommu memory region. So now
we have notifiers for IOTLB changes and config changes.
- vSMMUv3 calls config notifiers when STE (Stream Table Entries)
are updated by the guest.
- We implement a specific UNMAP notifier that conveys guest
IOTLB invalidations to the host
- We implement a new MAP notifiers only used for MSI IOVAs so
that the host can build a nested stage translation for MSI IOVAs
- As the legacy MAP notifier is not called anymore, we must make
sure stage 2 mappings are set. This is achieved through another
memory listener.
- Physical SMMUs faults are reported to the guest via en eventfd
mechanism and reinjected into this latter.
Note: The first patch is a code cleanup and was sent separately.
Best Regards
Eric
This series can be found at:
https://github.com/eauger/qemu/tree/v4.0.0-2stage-rfcv4
Compatible with kernel series:
[PATCH v8 00/29] SMMUv3 Nested Stage Setup
(https://lkml.org/lkml/2019/5/26/95)
History:
v3 -> v4:
- adapt to changes in uapi (asid cache invalidation)
- check VFIO_PCI_DMA_FAULT_IRQ_INDEX is supported at kernel level
before attempting to set signaling for it.
- sync on 5.2-rc1 kernel headers + Drew's patch that imports sve_context.h
- fix MSI binding for MSI (not MSIX)
- fix mingw compilation
v2 -> v3:
- rework fault handling
- MSI binding registration done in vfio-pci. MSI binding tear down called
on container cleanup path
- leaf parameter propagated
v1 -> v2:
- Fixed dual assignment (asid now correctly propagated on TLB invalidations)
- Integrated fault reporting
Andrew Jones (1):
update-linux-headers: Add sve_context.h to asm-arm64
Eric Auger (26):
vfio/common: Introduce vfio_set_irq_signaling helper
update-linux-headers: Import iommu.h
header update against 5.2.0-rc1 and IOMMU/VFIO nested stage APIs
memory: add IOMMU_ATTR_VFIO_NESTED IOMMU memory region attribute
memory: add IOMMU_ATTR_MSI_TRANSLATE IOMMU memory region attribute
hw/arm/smmuv3: Advertise VFIO_NESTED and MSI_TRANSLATE attributes
hw/vfio/common: Force nested if iommu requires it
memory: Prepare for different kinds of IOMMU MR notifiers
memory: Add IOMMUConfigNotifier
memory: Add arch_id and leaf fields in IOTLBEntry
hw/arm/smmuv3: Store the PASID table GPA in the translation config
hw/arm/smmuv3: Implement dummy replay
hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation
hw/arm/smmuv3: Fill the IOTLBEntry leaf field on NH_VA invalidation
hw/arm/smmuv3: Notify on config changes
hw/vfio/common: Introduce vfio_alloc_guest_iommu helper
hw/vfio/common: Introduce hostwin_from_range helper
hw/vfio/common: Introduce helpers to DMA map/unmap a RAM section
hw/vfio/common: Setup nested stage mappings
hw/vfio/common: Register a MAP notifier for MSI binding
vfio-pci: Expose MSI stage 1 bindings to the host
memory: Introduce IOMMU Memory Region inject_faults API
hw/arm/smmuv3: Implement fault injection
vfio-pci: register handler for iommu fault
vfio-pci: Set up fault regions
vfio-pci: Implement the DMA fault handler
exec.c | 12 +-
hw/arm/smmu-common.c | 10 +-
hw/arm/smmuv3.c | 198 +++++++++--
hw/arm/trace-events | 3 +-
hw/i386/amd_iommu.c | 2 +-
hw/i386/intel_iommu.c | 25 +-
hw/misc/tz-mpc.c | 8 +-
hw/ppc/spapr_iommu.c | 2 +-
hw/s390x/s390-pci-inst.c | 4 +-
hw/vfio/common.c | 572 ++++++++++++++++++++++++++------
hw/vfio/pci.c | 471 ++++++++++++++++----------
hw/vfio/pci.h | 4 +
hw/vfio/platform.c | 54 ++-
hw/vfio/trace-events | 8 +-
hw/virtio/vhost.c | 14 +-
include/exec/memory.h | 158 +++++++--
include/hw/arm/smmu-common.h | 1 +
include/hw/vfio/vfio-common.h | 10 +
linux-headers/linux/iommu.h | 280 ++++++++++++++++
linux-headers/linux/vfio.h | 107 ++++++
memory.c | 67 +++-
scripts/update-linux-headers.sh | 5 +-
22 files changed, 1593 insertions(+), 422 deletions(-)
create mode 100644 linux-headers/linux/iommu.h
--
2.20.1
We introduce a new IOMMU Memory Region attribute, IOMMU_ATTR_VFIO_NESTED
which tells whether the virtual IOMMU requires physical nested
stages for VFIO integration. Intel virtual IOMMU supports Caching
Mode and does not require 2 stages at physical level. However virtual
ARM SMMU does not implement such caching mode and requires to use
physical stage 1 for VFIO integration.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
include/exec/memory.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 9144a47f57..352a00169f 100644
--- a/include/exec/memory.h+++ b/include/exec/memory.h
@@ -204,7 +204,8 @@ struct MemoryRegionOps {
};
enum IOMMUMemoryRegionAttr {
- IOMMU_ATTR_SPAPR_TCE_FD+ IOMMU_ATTR_SPAPR_TCE_FD,+ IOMMU_ATTR_VFIO_NESTED,};
/**
--
2.20.1
[Qemu-devel] [RFC v4 07/27] hw/arm/smmuv3: Advertise VFIO_NESTED and MSI_TRANSLATE attributes
We introduce a new IOMMU Memory Region attribute, IOMMU_ATTR_MSI_TRANSLATE
which tells whether the virtual IOMMU translates MSIs. ARM SMMU
will expose this attribute since, as opposed to Intel DMAR, MSIs
are translated as any other DMA requests.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
include/exec/memory.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 352a00169f..146a6096da 100644
--- a/include/exec/memory.h+++ b/include/exec/memory.h
@@ -206,6 +206,7 @@ struct MemoryRegionOps {
enum IOMMUMemoryRegionAttr {
IOMMU_ATTR_SPAPR_TCE_FD,
IOMMU_ATTR_VFIO_NESTED,
+ IOMMU_ATTR_MSI_TRANSLATE,};
/**
--
2.20.1
[Qemu-devel] [RFC v4 04/27] header update against 5.2.0-rc1 and IOMMU/VFIO nested stage APIs
This is an update against the following development branch:
https://github.com/eauger/linux/tree/v5.2.0-rc1-2stage-v8.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
linux-headers/linux/iommu.h | 280 ++++++++++++++++++++++++++++++++++++linux-headers/linux/vfio.h | 107 ++++++++++++++
2 files changed, 387 insertions(+)
create mode 100644 linux-headers/linux/iommu.h
diff --git a/linux-headers/linux/iommu.h b/linux-headers/linux/iommu.h
new file mode 100644
index 0000000000..0a59d6439c
--- /dev/null+++ b/linux-headers/linux/iommu.h
@@ -0,0 +1,280 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */+/*+ * IOMMU user API definitions+ */++#ifndef _IOMMU_H+#define _IOMMU_H++#include <linux/types.h>++#define IOMMU_FAULT_PERM_READ (1 << 0) /* read */+#define IOMMU_FAULT_PERM_WRITE (1 << 1) /* write */+#define IOMMU_FAULT_PERM_EXEC (1 << 2) /* exec */+#define IOMMU_FAULT_PERM_PRIV (1 << 3) /* privileged */++/* Generic fault types, can be expanded IRQ remapping fault */+enum iommu_fault_type {+ IOMMU_FAULT_DMA_UNRECOV = 1, /* unrecoverable fault */+ IOMMU_FAULT_PAGE_REQ, /* page request fault */+};++enum iommu_fault_reason {+ IOMMU_FAULT_REASON_UNKNOWN = 0,++ /* Could not access the PASID table (fetch caused external abort) */+ IOMMU_FAULT_REASON_PASID_FETCH,++ /* PASID entry is invalid or has configuration errors */+ IOMMU_FAULT_REASON_BAD_PASID_ENTRY,++ /*+ * PASID is out of range (e.g. exceeds the maximum PASID+ * supported by the IOMMU) or disabled.+ */+ IOMMU_FAULT_REASON_PASID_INVALID,++ /*+ * An external abort occurred fetching (or updating) a translation+ * table descriptor+ */+ IOMMU_FAULT_REASON_WALK_EABT,++ /*+ * Could not access the page table entry (Bad address),+ * actual translation fault+ */+ IOMMU_FAULT_REASON_PTE_FETCH,++ /* Protection flag check failed */+ IOMMU_FAULT_REASON_PERMISSION,++ /* access flag check failed */+ IOMMU_FAULT_REASON_ACCESS,++ /* Output address of a translation stage caused Address Size fault */+ IOMMU_FAULT_REASON_OOR_ADDRESS,+};++/**+ * struct iommu_fault_unrecoverable - Unrecoverable fault data+ * @reason: reason of the fault, from &enum iommu_fault_reason+ * @flags: parameters of this fault (IOMMU_FAULT_UNRECOV_* values)+ * @pasid: Process Address Space ID+ * @perm: Requested permission access using by the incoming transaction+ * (IOMMU_FAULT_PERM_* values)+ * @addr: offending page address+ * @fetch_addr: address that caused a fetch abort, if any+ */+struct iommu_fault_unrecoverable {+ __u32 reason;+#define IOMMU_FAULT_UNRECOV_PASID_VALID (1 << 0)+#define IOMMU_FAULT_UNRECOV_ADDR_VALID (1 << 1)+#define IOMMU_FAULT_UNRECOV_FETCH_ADDR_VALID (1 << 2)+ __u32 flags;+ __u32 pasid;+ __u32 perm;+ __u64 addr;+ __u64 fetch_addr;+};++/**+ * struct iommu_fault_page_request - Page Request data+ * @flags: encodes whether the corresponding fields are valid and whether this+ * is the last page in group (IOMMU_FAULT_PAGE_REQUEST_* values)+ * @pasid: Process Address Space ID+ * @grpid: Page Request Group Index+ * @perm: requested page permissions (IOMMU_FAULT_PERM_* values)+ * @addr: page address+ * @private_data: device-specific private information+ */+struct iommu_fault_page_request {+#define IOMMU_FAULT_PAGE_REQUEST_PASID_VALID (1 << 0)+#define IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE (1 << 1)+#define IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA (1 << 2)+ __u32 flags;+ __u32 pasid;+ __u32 grpid;+ __u32 perm;+ __u64 addr;+ __u64 private_data[2];+};++/**+ * struct iommu_fault - Generic fault data+ * @type: fault type from &enum iommu_fault_type+ * @padding: reserved for future use (should be zero)+ * @event: Fault event, when @type is %IOMMU_FAULT_DMA_UNRECOV+ * @prm: Page Request message, when @type is %IOMMU_FAULT_PAGE_REQ+ */+struct iommu_fault {+ __u32 type;+ __u32 padding;+ union {+ struct iommu_fault_unrecoverable event;+ struct iommu_fault_page_request prm;+ };+};++/**+ * struct iommu_pasid_smmuv3 - ARM SMMUv3 Stream Table Entry stage 1 related+ * information+ * @version: API version of this structure+ * @s1fmt: STE s1fmt (format of the CD table: single CD, linear table+ * or 2-level table)+ * @s1dss: STE s1dss (specifies the behavior when @pasid_bits != 0+ * and no PASID is passed along with the incoming transaction)+ * @padding: reserved for future use (should be zero)+ *+ * The PASID table is referred to as the Context Descriptor (CD) table on ARM+ * SMMUv3. Please refer to the ARM SMMU 3.x spec (ARM IHI 0070A) for full+ * details.+ */+struct iommu_pasid_smmuv3 {+#define PASID_TABLE_SMMUV3_CFG_VERSION_1 1+ __u32 version;+ __u8 s1fmt;+ __u8 s1dss;+ __u8 padding[2];+};++/**+ * struct iommu_pasid_table_config - PASID table data used to bind guest PASID+ * table to the host IOMMU+ * @version: API version to prepare for future extensions+ * @format: format of the PASID table+ * @base_ptr: guest physical address of the PASID table+ * @pasid_bits: number of PASID bits used in the PASID table+ * @config: indicates whether the guest translation stage must+ * be translated, bypassed or aborted.+ * @padding: reserved for future use (should be zero)+ * @smmuv3: table information when @format is %IOMMU_PASID_FORMAT_SMMUV3+ */+struct iommu_pasid_table_config {+#define PASID_TABLE_CFG_VERSION_1 1+ __u32 version;+#define IOMMU_PASID_FORMAT_SMMUV3 1+ __u32 format;+ __u64 base_ptr;+ __u8 pasid_bits;+#define IOMMU_PASID_CONFIG_TRANSLATE 1+#define IOMMU_PASID_CONFIG_BYPASS 2+#define IOMMU_PASID_CONFIG_ABORT 3+ __u8 config;+ __u8 padding[6];+ union {+ struct iommu_pasid_smmuv3 smmuv3;+ };+};++/* defines the granularity of the invalidation */+enum iommu_inv_granularity {+ IOMMU_INV_GRANU_DOMAIN, /* domain-selective invalidation */+ IOMMU_INV_GRANU_PASID, /* PASID-selective invalidation */+ IOMMU_INV_GRANU_ADDR, /* page-selective invalidation */+ IOMMU_INV_GRANU_NR, /* number of invalidation granularities */+};++/**+ * struct iommu_inv_addr_info - Address Selective Invalidation Structure+ *+ * @flags: indicates the granularity of the address-selective invalidation+ * - If the PASID bit is set, the @pasid field is populated and the invalidation+ * relates to cache entries tagged with this PASID and matching the address+ * range.+ * - If ARCHID bit is set, @archid is populated and the invalidation relates+ * to cache entries tagged with this architecture specific ID and matching+ * the address range.+ * - Both PASID and ARCHID can be set as they may tag different caches.+ * - If neither PASID or ARCHID is set, global addr invalidation applies.+ * - The LEAF flag indicates whether only the leaf PTE caching needs to be+ * invalidated and other paging structure caches can be preserved.+ * @pasid: process address space ID+ * @archid: architecture-specific ID+ * @addr: first stage/level input address+ * @granule_size: page/block size of the mapping in bytes+ * @nb_granules: number of contiguous granules to be invalidated+ */+struct iommu_inv_addr_info {+#define IOMMU_INV_ADDR_FLAGS_PASID (1 << 0)+#define IOMMU_INV_ADDR_FLAGS_ARCHID (1 << 1)+#define IOMMU_INV_ADDR_FLAGS_LEAF (1 << 2)+ __u32 flags;+ __u32 archid;+ __u64 pasid;+ __u64 addr;+ __u64 granule_size;+ __u64 nb_granules;+};++/**+ * struct iommu_inv_pasid_info - PASID Selective Invalidation Structure+ *+ * @flags: indicates the granularity of the PASID-selective invalidation+ * - If the PASID bit is set, the @pasid field is populated and the invalidation+ * relates to cache entries tagged with this PASID and matching the address+ * range.+ * - If the ARCHID bit is set, the @archid is populated and the invalidation+ * relates to cache entries tagged with this architecture specific ID and+ * matching the address range.+ * - Both PASID and ARCHID can be set as they may tag different caches.+ * - At least one of PASID or ARCHID must be set.+ * @pasid: process address space ID+ * @archid: architecture-specific ID+ */+struct iommu_inv_pasid_info {+#define IOMMU_INV_PASID_FLAGS_PASID (1 << 0)+#define IOMMU_INV_PASID_FLAGS_ARCHID (1 << 1)+ __u32 flags;+ __u32 archid;+ __u64 pasid;+};++/**+ * struct iommu_cache_invalidate_info - First level/stage invalidation+ * information+ * @version: API version of this structure+ * @cache: bitfield that allows to select which caches to invalidate+ * @granularity: defines the lowest granularity used for the invalidation:+ * domain > PASID > addr+ * @padding: reserved for future use (should be zero)+ * @pasid_info: invalidation data when @granularity is %IOMMU_INV_GRANU_PASID+ * @addr_info: invalidation data when @granularity is %IOMMU_INV_GRANU_ADDR+ *+ * Not all the combinations of cache/granularity are valid:+ *+ * +--------------+---------------+---------------+---------------++ * | type / | DEV_IOTLB | IOTLB | PASID |+ * | granularity | | | cache |+ * +==============+===============+===============+===============++ * | DOMAIN | N/A | Y | Y |+ * +--------------+---------------+---------------+---------------++ * | PASID | Y | Y | Y |+ * +--------------+---------------+---------------+---------------++ * | ADDR | Y | Y | N/A |+ * +--------------+---------------+---------------+---------------++ *+ * Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take any argument other than+ * @version and @cache.+ *+ * If multiple cache types are invalidated simultaneously, they all+ * must support the used granularity.+ */+struct iommu_cache_invalidate_info {+#define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1+ __u32 version;+/* IOMMU paging structure cache */+#define IOMMU_CACHE_INV_TYPE_IOTLB (1 << 0) /* IOMMU IOTLB */+#define IOMMU_CACHE_INV_TYPE_DEV_IOTLB (1 << 1) /* Device IOTLB */+#define IOMMU_CACHE_INV_TYPE_PASID (1 << 2) /* PASID cache */+#define IOMMU_CACHE_INV_TYPE_NR (3)+ __u8 cache;+ __u8 granularity;+ __u8 padding[2];+ union {+ struct iommu_inv_pasid_info pasid_info;+ struct iommu_inv_addr_info addr_info;+ };+};++#endif /* _IOMMU_H */
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 24f505199f..f8e355896c 100644
--- a/linux-headers/linux/vfio.h+++ b/linux-headers/linux/vfio.h
@@ -14,6 +14,7 @@
#include <linux/types.h>
#include <linux/ioctl.h>
+#include <linux/iommu.h>#define VFIO_API_VERSION 0
@@ -306,6 +307,10 @@ struct vfio_region_info_cap_type {
#define VFIO_REGION_TYPE_GFX (1)
#define VFIO_REGION_SUBTYPE_GFX_EDID (1)
+#define VFIO_REGION_TYPE_NESTED (2)+#define VFIO_REGION_SUBTYPE_NESTED_FAULT_PROD (1)+#define VFIO_REGION_SUBTYPE_NESTED_FAULT_CONS (2)+/**
* struct vfio_region_gfx_edid - EDID region layout.
*
@@ -554,6 +559,7 @@ enum {
VFIO_PCI_MSIX_IRQ_INDEX,
VFIO_PCI_ERR_IRQ_INDEX,
VFIO_PCI_REQ_IRQ_INDEX,
+ VFIO_PCI_DMA_FAULT_IRQ_INDEX, VFIO_PCI_NUM_IRQS
};
@@ -700,6 +706,44 @@ struct vfio_device_ioeventfd {
#define VFIO_DEVICE_IOEVENTFD _IO(VFIO_TYPE, VFIO_BASE + 16)
++/*+ * Capability exposed by the Producer Fault Region+ * @version: max fault ABI version supported by the kernel+ */+#define VFIO_REGION_INFO_CAP_PRODUCER_FAULT 6++struct vfio_region_info_cap_fault {+ struct vfio_info_cap_header header;+ __u32 version;+};++/*+ * Producer Fault Region (Read-Only from user space perspective)+ * Contains the fault circular buffer and the producer index+ * @version: version of the fault record uapi+ * @entry_size: size of each fault record+ * @offset: offset of the start of the queue+ * @prod: producer index relative to the start of the queue+ */+struct vfio_region_fault_prod {+ __u32 version;+ __u32 nb_entries;+ __u32 entry_size;+ __u32 offset;+ __u32 prod;+};++/*+ * Consumer Fault Region (Write-Only from the user space perspective)+ * @version: ABI version requested by the userspace+ * @cons: consumer index relative to the start of the queue+ */+struct vfio_region_fault_cons {+ __u32 version;+ __u32 cons;+};+/* -------- API for Type1 VFIO IOMMU -------- */
/**
@@ -763,6 +807,69 @@ struct vfio_iommu_type1_dma_unmap {
#define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
#define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
+/**+ * VFIO_IOMMU_ATTACH_PASID_TABLE - _IOWR(VFIO_TYPE, VFIO_BASE + 22,+ * struct vfio_iommu_type1_attach_pasid_table)+ *+ * Passes the PASID table to the host. Calling ATTACH_PASID_TABLE+ * while a table is already installed is allowed: it replaces the old+ * table. DETACH does a comprehensive tear down of the nested mode.+ */+struct vfio_iommu_type1_attach_pasid_table {+ __u32 argsz;+ __u32 flags;+ struct iommu_pasid_table_config config;+};+#define VFIO_IOMMU_ATTACH_PASID_TABLE _IO(VFIO_TYPE, VFIO_BASE + 22)++/**+ * VFIO_IOMMU_DETACH_PASID_TABLE - - _IOWR(VFIO_TYPE, VFIO_BASE + 23)+ * Detaches the PASID table+ */+#define VFIO_IOMMU_DETACH_PASID_TABLE _IO(VFIO_TYPE, VFIO_BASE + 23)++/**+ * VFIO_IOMMU_CACHE_INVALIDATE - _IOWR(VFIO_TYPE, VFIO_BASE + 24,+ * struct vfio_iommu_type1_cache_invalidate)+ *+ * Propagate guest IOMMU cache invalidation to the host.+ */+struct vfio_iommu_type1_cache_invalidate {+ __u32 argsz;+ __u32 flags;+ struct iommu_cache_invalidate_info info;+};+#define VFIO_IOMMU_CACHE_INVALIDATE _IO(VFIO_TYPE, VFIO_BASE + 24)++/**+ * VFIO_IOMMU_BIND_MSI - _IOWR(VFIO_TYPE, VFIO_BASE + 25,+ * struct vfio_iommu_type1_bind_msi)+ *+ * Pass a stage 1 MSI doorbell mapping to the host so that this+ * latter can build a nested stage2 mapping+ */+struct vfio_iommu_type1_bind_msi {+ __u32 argsz;+ __u32 flags;+ __u64 iova;+ __u64 gpa;+ __u64 size;+};+#define VFIO_IOMMU_BIND_MSI _IO(VFIO_TYPE, VFIO_BASE + 25)++/**+ * VFIO_IOMMU_UNBIND_MSI - _IOWR(VFIO_TYPE, VFIO_BASE + 26,+ * struct vfio_iommu_type1_unbind_msi)+ *+ * Unregister an MSI mapping+ */+struct vfio_iommu_type1_unbind_msi {+ __u32 argsz;+ __u32 flags;+ __u64 iova;+};+#define VFIO_IOMMU_UNBIND_MSI _IO(VFIO_TYPE, VFIO_BASE + 26)+/* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
/*
--
2.20.1
[Qemu-devel] [RFC v4 12/27] hw/arm/smmuv3: Store the PASID table GPA in the translation config
For VFIO integration we will need to pass the Context Descriptor (CD)
table GPA to the host. The CD table is also referred to as the PASID
table. Its GPA corresponds to the s1ctrptr field of the Stream Table
Entry. So let's decode and store it in the configuration structure.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
hw/arm/smmuv3.c | 1 +include/hw/arm/smmu-common.h | 1 +
2 files changed, 2 insertions(+)
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 1744874e72..96d4147533 100644
--- a/hw/arm/smmuv3.c+++ b/hw/arm/smmuv3.c
@@ -351,6 +351,7 @@ static int decode_ste(SMMUv3State *s, SMMUTransCfg *cfg,
"SMMUv3 S1 stalling fault model not allowed yet\n");
goto bad_ste;
}
+ cfg->s1ctxptr = STE_CTXPTR(ste); return 0;
bad_ste:
diff --git a/include/hw/arm/smmu-common.h b/include/hw/arm/smmu-common.h
index 1f37844e5c..353668f4ea 100644
--- a/include/hw/arm/smmu-common.h+++ b/include/hw/arm/smmu-common.h
@@ -68,6 +68,7 @@ typedef struct SMMUTransCfg {
uint8_t tbi; /* Top Byte Ignore */
uint16_t asid;
SMMUTransTableInfo tt[2];
+ dma_addr_t s1ctxptr; uint32_t iotlb_hits; /* counts IOTLB hits for this asid */
uint32_t iotlb_misses; /* counts IOTLB misses for this asid */
} SMMUTransCfg;
--
2.20.1
The default implementation of memory_region_iommu_replay() shall
not be used as it forces the translation of the whole RAM range.
The purpose of this function is to update the shadow page tables.
However in case of nested stage, there is no shadow page table so
we can simply return.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
hw/arm/smmuv3.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 96d4147533..8db605adab 100644
--- a/hw/arm/smmuv3.c+++ b/hw/arm/smmuv3.c
@@ -1507,6 +1507,11 @@ static int smmuv3_get_attr(IOMMUMemoryRegion *iommu,
return -EINVAL;
}
+static inline void+smmuv3_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)+{+}+static void smmuv3_iommu_memory_region_class_init(ObjectClass *klass,
void *data)
{
@@ -1515,6 +1520,7 @@ static void smmuv3_iommu_memory_region_class_init(ObjectClass *klass,
imrc->translate = smmuv3_translate;
imrc->notify_flag_changed = smmuv3_notify_flag_changed;
imrc->get_attr = smmuv3_get_attr;
+ imrc->replay = smmuv3_replay;}
static const TypeInfo smmuv3_type_info = {
--
2.20.1
TLB entries are usually tagged with some ids such as the asid
or pasid. When propagating an invalidation command from the
guest to the host, we need to pass this id.
Also we add a leaf field which indicates, in case of invalidation
notification whether only cache entries for the last level of
translation are required to be invalidated.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
include/exec/memory.h | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 701cb83367..9f107ebedb 100644
--- a/include/exec/memory.h+++ b/include/exec/memory.h
@@ -69,12 +69,30 @@ typedef enum {
#define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0))
+/**+ * IOMMUTLBEntry - IOMMU TLB entry+ *+ * Structure used when performing a translation or when notifying MAP or+ * UNMAP (invalidation) events+ *+ * @target_as: target address space+ * @iova: IO virtual address (input)+ * @translated_addr: translated address (output)+ * @addr_mask: address mask (0xfff means 4K binding), must be multiple of 2+ * @perm: permission flag of the mapping (NONE encodes no mapping or+ * invalidation notification)+ * @arch_id: architecture specific ID tagging the TLB+ * @leaf: when @perm is NONE, indicates whether only caches for the last+ * level of translation need to be invalidated.+ */struct IOMMUTLBEntry {
AddressSpace *target_as;
hwaddr iova;
hwaddr translated_addr;
- hwaddr addr_mask; /* 0xfff = 4k translation */+ hwaddr addr_mask; IOMMUAccessFlags perm;
+ uint32_t arch_id;+ bool leaf;};
typedef struct IOMMUConfig {
--
2.20.1
[Qemu-devel] [RFC v4 14/27] hw/arm/smmuv3: Fill the IOTLBEntry arch_id on NH_VA invalidation
When the guest invalidates one S1 entry, it passes the asid.
When propagating this invalidation downto the host, the asid
information also must be passed. So let's fill the arch_id field
introduced for that purpose.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
hw/arm/smmuv3.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 8db605adab..b6eb61304d 100644
--- a/hw/arm/smmuv3.c+++ b/hw/arm/smmuv3.c
@@ -822,6 +822,7 @@ static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
entry.iova = iova;
entry.addr_mask = (1 << tt->granule_sz) - 1;
entry.perm = IOMMU_NONE;
+ entry.arch_id = asid; memory_region_iotlb_notify_one(n, &entry);
}
--
2.20.1
When the guest is exposed with a virtual IOMMU that translates
MSIs, the guest allocates an IOVA (gIOVA) that maps the virtual
doorbell (gDB). In nested mode, when the MSI is setup, we pass
this stage1 mapping to the host so that it can use this stage1
binding to create a nested stage translating into the physical
doorbell. Conversely, when the MSI setup os torn down, we
unregister this binding.
For registration, We directly use the iommu memory region
translate() callback since the addr_mask is returned in the
IOTLB entry. address_space_translate does not return this information.
Now that we use a MAP notifier, let's remove warning against
the usage of map notifiers (historically used along with Intel's
caching mode).
Signed-off-by: Eric Auger <eric.auger@redhat.com>
---
v3 -> v4:
- move the MSI binding registration in vfio_enable_vectors
to address the MSI use case
---
hw/arm/smmuv3.c | 8 -------hw/vfio/pci.c | 50 +++++++++++++++++++++++++++++++++++++++++++-hw/vfio/trace-events | 2 ++
3 files changed, 51 insertions(+), 9 deletions(-)
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index db03313672..a697968ace 100644
--- a/hw/arm/smmuv3.c+++ b/hw/arm/smmuv3.c
@@ -1521,14 +1521,6 @@ static void smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
SMMUv3State *s3 = sdev->smmu;
SMMUState *s = &(s3->smmu_state);
- if (new & IOMMU_NOTIFIER_IOTLB_MAP) {- int bus_num = pci_bus_num(sdev->bus);- PCIDevice *pcidev = pci_find_device(sdev->bus, bus_num, sdev->devfn);-- warn_report("SMMUv3 does not support notification on MAP: "- "device %s will not function properly", pcidev->name);- }- if (old == IOMMU_NOTIFIER_NONE) {
trace_smmuv3_notify_flag_add(iommu->parent_obj.name);
QLIST_INSERT_HEAD(&s->devices_with_notifiers, sdev, next);
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 3095379747..b613b20501 100644
--- a/hw/vfio/pci.c+++ b/hw/vfio/pci.c
@@ -358,6 +358,48 @@ static void vfio_msi_interrupt(void *opaque)
notify(&vdev->pdev, nr);
}
+static int vfio_register_msi_binding(VFIOPCIDevice *vdev, int vector_n)+{+ PCIDevice *dev = &vdev->pdev;+ AddressSpace *as = pci_device_iommu_address_space(dev);+ MSIMessage msg = pci_get_msi_message(dev, vector_n);+ IOMMUMemoryRegionClass *imrc;+ IOMMUMemoryRegion *iommu_mr;+ bool msi_translate = false, nested = false;;+ IOMMUTLBEntry entry;++ if (as == &address_space_memory) {+ return 0;+ }++ iommu_mr = IOMMU_MEMORY_REGION(as->root);+ memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_MSI_TRANSLATE,+ (void *)&msi_translate);+ memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_VFIO_NESTED,+ (void *)&nested);+ imrc = memory_region_get_iommu_class_nocheck(iommu_mr);++ if (!nested || !msi_translate) {+ return 0;+ }++ /* MSI doorbell address is translated by an IOMMU */++ rcu_read_lock();+ entry = imrc->translate(iommu_mr, msg.address, IOMMU_WO, 0);+ rcu_read_unlock();++ if (entry.perm == IOMMU_NONE) {+ return -ENOENT;+ }++ trace_vfio_register_msi_binding(vdev->vbasedev.name, vector_n,+ msg.address, entry.translated_addr);++ memory_region_iotlb_notify_iommu(iommu_mr, 0, entry);+ return 0;+}+static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
{
struct vfio_irq_set *irq_set;
@@ -375,7 +417,7 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
fds = (int32_t *)&irq_set->data;
for (i = 0; i < vdev->nr_vectors; i++) {
- int fd = -1;+ int ret, fd = -1; /*
* MSI vs MSI-X - The guest has direct access to MSI mask and pending
@@ -390,6 +432,12 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
} else {
fd = event_notifier_get_fd(&vdev->msi_vectors[i].kvm_interrupt);
}
+ ret = vfio_register_msi_binding(vdev, i);+ if (ret) {+ error_report("%s failed to register S1 MSI binding "+ "for vector %d(%d)", __func__, i, ret);+ return ret;+ } }
fds[i] = fd;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 9f1868af2d..5de97a8882 100644
--- a/hw/vfio/trace-events+++ b/hw/vfio/trace-events
@@ -117,6 +117,8 @@ vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype
vfio_dma_unmap_overflow_workaround(void) ""
vfio_iommu_addr_inv_iotlb(int asid, uint64_t addr, uint64_t size, uint64_t nb_granules, bool leaf) "nested IOTLB invalidate asid=%d, addr=0x%"PRIx64" granule_size=0x%"PRIx64" nb_granules=0x%"PRIx64" leaf=%d"
vfio_iommu_asid_inv_iotlb(int asid) "nested IOTLB invalidate asid=%d"
+vfio_register_msi_binding(const char *name, int vector, uint64_t giova, uint64_t gdb) "%s: register vector %d gIOVA=0x%"PRIx64 "-> gDB=0x%"PRIx64" stage 1 mapping"+vfio_unregister_msi_binding(const char *name, int vector, uint64_t giova) "%s: unregister vector %d gIOVA=0x%"PRIx64 " stage 1 mapping"# platform.c
vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
--
2.20.1
[Qemu-devel] [RFC v4 25/27] vfio-pci: register handler for iommu fault
Patchew URL: https://patchew.org/QEMU/20190527114203.2762-1-eric.auger@redhat.com/
Hi,
This series failed build test on s390x host. Please find the details below.
=== TEST SCRIPT BEGIN ===
#!/bin/bash
# Testing script will be invoked under the git checkout with
# HEAD pointing to a commit that has the patches applied on top of "base"
# branch
set -e
CC=$HOME/bin/cc
INSTALL=$PWD/install
BUILD=$PWD/build
mkdir -p $BUILD $INSTALL
SRC=$PWD
cd $BUILD
$SRC/configure --cc=$CC --prefix=$INSTALL
make -j4
# XXX: we need reliable clean up
# make check -j4 V=1
make install
echo
echo "=== ENV ==="
env
echo
echo "=== PACKAGES ==="
rpm -qa
=== TEST SCRIPT END ===
CC ppc-softmmu/hw/display/vga.o
CC mips-softmmu/hw/mips/mips_r4k.o
/var/tmp/patchew-tester-tmp-oaqfmxu5/src/hw/ppc/spapr_iommu.c: In function ‘spapr_tce_replay’:
/var/tmp/patchew-tester-tmp-oaqfmxu5/src/hw/ppc/spapr_iommu.c:161:14: error: ‘IOMMUNotifier’ {aka ‘struct IOMMUNotifier’} has no member named ‘notify’
161 | n->notify(n, &iotlb);
| ^~
make[1]: *** [/var/tmp/patchew-tester-tmp-oaqfmxu5/src/rules.mak:69: hw/ppc/spapr_iommu.o] Error 1
The full log is available at
http://patchew.org/logs/20190527114203.2762-1-eric.auger@redhat.com/testing.s390x/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
Re: [Qemu-devel] [RFC v4 08/27] hw/vfio/common: Force nested if iommu requires it
On Mon, May 27, 2019 at 01:41:44PM +0200, Eric Auger wrote:
> In case we detect the address space is translated by> a virtual IOMMU which requires nested stages, let's set up> the container with the VFIO_TYPE1_NESTING_IOMMU iommu_type.> > Signed-off-by: Eric Auger <eric.auger@redhat.com>> > ---> > v2 -> v3:> - add "nested only is selected if requested by @force_nested"> comment in this patch> ---> hw/vfio/common.c | 27 +++++++++++++++++++++++----> 1 file changed, 23 insertions(+), 4 deletions(-)> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c> index 1f1deff360..99ade21056 100644> --- a/hw/vfio/common.c> +++ b/hw/vfio/common.c> @@ -1136,14 +1136,19 @@ static void vfio_put_address_space(VFIOAddressSpace *space)> * vfio_get_iommu_type - selects the richest iommu_type (v2 first)> */> static int vfio_get_iommu_type(VFIOContainer *container,> + bool force_nested,> Error **errp)> {> - int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,> + int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,> + VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,> VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };> int i;> > for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {> if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {> + if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU && !force_nested) {
If force_nested==true and if the kernel does not support
VFIO_TYPE1_NESTING_IOMMU, we will still return other iommu types?
That seems to not match with what "force" mean here.
What I feel like is that we want an "iommu_nest_types[]" which only
contains VFIO_TYPE1_NESTING_IOMMU. Then:
if (nested) {
target_types = iommu_nest_types;
} else {
target_types = iommu_types;
}
foreach (target_types)
...
return -EINVAL;
Might be clearer? Then we can drop [2] below since we'll fail earlier
at [1].
> + continue;> + }> return iommu_types[i];> }> }> @@ -1152,11 +1157,11 @@ static int vfio_get_iommu_type(VFIOContainer *container,> }> > static int vfio_init_container(VFIOContainer *container, int group_fd,> - Error **errp)> + bool force_nested, Error **errp)> {> int iommu_type, ret;> > - iommu_type = vfio_get_iommu_type(container, errp);> + iommu_type = vfio_get_iommu_type(container, force_nested, errp);> if (iommu_type < 0) {> return iommu_type;
[1]
> }> @@ -1192,6 +1197,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,> VFIOContainer *container;> int ret, fd;> VFIOAddressSpace *space;> + IOMMUMemoryRegion *iommu_mr;> + bool force_nested = false;> +> + if (as != &address_space_memory && memory_region_is_iommu(as->root)) {> + iommu_mr = IOMMU_MEMORY_REGION(as->root);> + memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_VFIO_NESTED,> + (void *)&force_nested);> + }> > space = vfio_get_address_space(as);> > @@ -1252,12 +1265,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,> QLIST_INIT(&container->giommu_list);> QLIST_INIT(&container->hostwin_list);> > - ret = vfio_init_container(container, group->fd, errp);> + ret = vfio_init_container(container, group->fd, force_nested, errp);> if (ret) {> goto free_container_exit;> }> > + if (force_nested && container->iommu_type != VFIO_TYPE1_NESTING_IOMMU) {> + error_setg(errp, "nested mode requested by the virtual IOMMU "> + "but not supported by the vfio iommu");> + }
[2]
> +> switch (container->iommu_type) {> + case VFIO_TYPE1_NESTING_IOMMU:> case VFIO_TYPE1v2_IOMMU:> case VFIO_TYPE1_IOMMU:> {> -- > 2.20.1>
Regards,
--
Peter Xu
Re: [Qemu-devel] [RFC v4 09/27] memory: Prepare for different kinds of IOMMU MR notifiers
On Mon, May 27, 2019 at 01:41:45PM +0200, Eric Auger wrote:
[...]
> @@ -3368,8 +3368,9 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)> {> IOMMUTLBEntry entry;> hwaddr size;> - hwaddr start = n->start;> - hwaddr end = n->end;> +
(extra new line)
> + hwaddr start = n->iotlb_notifier.start;> + hwaddr end = n->iotlb_notifier.end;> IntelIOMMUState *s = as->iommu_state;> DMAMap map;
[...]
> typedef void (*IOMMUNotify)(struct IOMMUNotifier *notifier,> IOMMUTLBEntry *data);> > -struct IOMMUNotifier {> +typedef struct IOMMUIOLTBNotifier {> IOMMUNotify notify;
Hi, Eric,
I wasn't following the thread much before so sorry to ask this if too
late - have you thought about using the Notifier struct direct?
Because then it'll (1) allow the user to register with both IOTLB |
CONFIG flags in the same notifier while currently we'll need to
register one for each (and this worries me a bit on when we grow the
types of flags further then one register can have quite a few
notifiers) (2) the notifier part can be shared by different events.
Then when notify the (void *) data can be an union:
struct IOMMUEvent {
int event; // can be one of the notifier flags
union {
struct IOTLBEvent {
...
};
struct PASIDEvent {
...
};
}
}
Then the handler hook would be simple too:
handler (data)
{
switch (data.event) {
...
}
}
I would be fine with current patch if this series is close to be
merged because even if we want that we can do that on top when we
introduce even more notifiers, but just to ask loud first.
> - IOMMUNotifierFlag notifier_flags;> /* Notify for address space range start <= addr <= end */> hwaddr start;> hwaddr end;> +} IOMMUIOLTBNotifier;> +> +struct IOMMUNotifier {> + IOMMUNotifierFlag notifier_flags;> + union {> + IOMMUIOLTBNotifier iotlb_notifier;> + };> int iommu_idx;> QLIST_ENTRY(IOMMUNotifier) node;> };> @@ -126,15 +132,18 @@ typedef struct IOMMUNotifier IOMMUNotifier;> /* RAM is a persistent kind memory */> #define RAM_PMEM (1 << 5)> > -static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,> - IOMMUNotifierFlag flags,> - hwaddr start, hwaddr end,> - int iommu_idx)> +static inline void iommu_iotlb_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,> + IOMMUNotifierFlag flags,> + hwaddr start, hwaddr end,> + int iommu_idx)> {> - n->notify = fn;> + assert(flags & IOMMU_NOTIFIER_IOTLB_MAP ||> + flags & IOMMU_NOTIFIER_IOTLB_UNMAP);
Can use IOMMU_NOTIFIER_IOTLB_ALL directly?
> + assert(start < end);> n->notifier_flags = flags;> - n->start = start;> - n->end = end;> + n->iotlb_notifier.notify = fn;> + n->iotlb_notifier.start = start;> + n->iotlb_notifier.end = end;> n->iommu_idx = iommu_idx;> }
Otherwise the patch looks good to me.
Regards,
--
Peter Xu
Re: [Qemu-devel] [RFC v4 08/27] hw/vfio/common: Force nested if iommu requires it
Hi Peter,
On 5/28/19 4:47 AM, Peter Xu wrote:
> On Mon, May 27, 2019 at 01:41:44PM +0200, Eric Auger wrote:>> In case we detect the address space is translated by>> a virtual IOMMU which requires nested stages, let's set up>> the container with the VFIO_TYPE1_NESTING_IOMMU iommu_type.>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>>>>> --->>>> v2 -> v3:>> - add "nested only is selected if requested by @force_nested">> comment in this patch>> --->> hw/vfio/common.c | 27 +++++++++++++++++++++++---->> 1 file changed, 23 insertions(+), 4 deletions(-)>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c>> index 1f1deff360..99ade21056 100644>> --- a/hw/vfio/common.c>> +++ b/hw/vfio/common.c>> @@ -1136,14 +1136,19 @@ static void vfio_put_address_space(VFIOAddressSpace *space)>> * vfio_get_iommu_type - selects the richest iommu_type (v2 first)>> */>> static int vfio_get_iommu_type(VFIOContainer *container,>> + bool force_nested,>> Error **errp)>> {>> - int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,>> + int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,>> + VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,>> VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };>> int i;>> >> for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {>> if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {>> + if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU && !force_nested) {> > If force_nested==true and if the kernel does not support> VFIO_TYPE1_NESTING_IOMMU, we will still return other iommu types?> That seems to not match with what "force" mean here.> > What I feel like is that we want an "iommu_nest_types[]" which only> contains VFIO_TYPE1_NESTING_IOMMU. Then:> > if (nested) {> target_types = iommu_nest_types;> } else {> target_types = iommu_types;> }> > foreach (target_types)> ...> > return -EINVAL;> > Might be clearer? Then we can drop [2] below since we'll fail earlier> at [1].
agreed. I can fail immediately in case the nested mode was requested and
not supported. This will be clearer.
Thanks!
Eric
> >> + continue;>> + }>> return iommu_types[i];>> }>> }>> @@ -1152,11 +1157,11 @@ static int vfio_get_iommu_type(VFIOContainer *container,>> }>> >> static int vfio_init_container(VFIOContainer *container, int group_fd,>> - Error **errp)>> + bool force_nested, Error **errp)>> {>> int iommu_type, ret;>> >> - iommu_type = vfio_get_iommu_type(container, errp);>> + iommu_type = vfio_get_iommu_type(container, force_nested, errp);>> if (iommu_type < 0) {>> return iommu_type;> > [1]> >> }>> @@ -1192,6 +1197,14 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,>> VFIOContainer *container;>> int ret, fd;>> VFIOAddressSpace *space;>> + IOMMUMemoryRegion *iommu_mr;>> + bool force_nested = false;>> +>> + if (as != &address_space_memory && memory_region_is_iommu(as->root)) {>> + iommu_mr = IOMMU_MEMORY_REGION(as->root);>> + memory_region_iommu_get_attr(iommu_mr, IOMMU_ATTR_VFIO_NESTED,>> + (void *)&force_nested);>> + }>> >> space = vfio_get_address_space(as);>> >> @@ -1252,12 +1265,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,>> QLIST_INIT(&container->giommu_list);>> QLIST_INIT(&container->hostwin_list);>> >> - ret = vfio_init_container(container, group->fd, errp);>> + ret = vfio_init_container(container, group->fd, force_nested, errp);>> if (ret) {>> goto free_container_exit;>> }>> >> + if (force_nested && container->iommu_type != VFIO_TYPE1_NESTING_IOMMU) {>> + error_setg(errp, "nested mode requested by the virtual IOMMU ">> + "but not supported by the vfio iommu");>> + }> > [2]> >> +>> switch (container->iommu_type) {>> + case VFIO_TYPE1_NESTING_IOMMU:>> case VFIO_TYPE1v2_IOMMU:>> case VFIO_TYPE1_IOMMU:>> {>> -- >> 2.20.1>>> > Regards,>
Re: [Qemu-devel] [RFC v4 09/27] memory: Prepare for different kinds of IOMMU MR notifiers
Hi Peter,
On 5/28/19 6:48 AM, Peter Xu wrote:
> On Mon, May 27, 2019 at 01:41:45PM +0200, Eric Auger wrote:> > [...]> >> @@ -3368,8 +3368,9 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)>> {>> IOMMUTLBEntry entry;>> hwaddr size;>> - hwaddr start = n->start;>> - hwaddr end = n->end;>> +> > (extra new line)> >> + hwaddr start = n->iotlb_notifier.start;>> + hwaddr end = n->iotlb_notifier.end;>> IntelIOMMUState *s = as->iommu_state;>> DMAMap map;> > [...]> >> typedef void (*IOMMUNotify)(struct IOMMUNotifier *notifier,>> IOMMUTLBEntry *data);>> >> -struct IOMMUNotifier {>> +typedef struct IOMMUIOLTBNotifier {>> IOMMUNotify notify;> > Hi, Eric,> > I wasn't following the thread much before so sorry to ask this if too> late - have you thought about using the Notifier struct direct?> Because then it'll (1) allow the user to register with both IOTLB |> CONFIG flags in the same notifier while currently we'll need to> register one for each (and this worries me a bit on when we grow the> types of flags further then one register can have quite a few> notifiers) (2) the notifier part can be shared by different events.> Then when notify the (void *) data can be an union:> > struct IOMMUEvent {> int event; // can be one of the notifier flags> union {> struct IOTLBEvent {> ...> };> struct PASIDEvent {> ...> };> }> }
I am currently prototyping your suggestion. I think this would clarify
some parts of the code to see clearly the type of event that is
propagated. I will send a separate RFC for this change.
Thanks!
Eric
> > Then the handler hook would be simple too:> > handler (data)> {> switch (data.event) {> ...> }> }> > I would be fine with current patch if this series is close to be> merged because even if we want that we can do that on top when we> introduce even more notifiers, but just to ask loud first.> >> - IOMMUNotifierFlag notifier_flags;>> /* Notify for address space range start <= addr <= end */>> hwaddr start;>> hwaddr end;>> +} IOMMUIOLTBNotifier;>> +>> +struct IOMMUNotifier {>> + IOMMUNotifierFlag notifier_flags;>> + union {>> + IOMMUIOLTBNotifier iotlb_notifier;>> + };>> int iommu_idx;>> QLIST_ENTRY(IOMMUNotifier) node;>> };>> @@ -126,15 +132,18 @@ typedef struct IOMMUNotifier IOMMUNotifier;>> /* RAM is a persistent kind memory */>> #define RAM_PMEM (1 << 5)>> >> -static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,>> - IOMMUNotifierFlag flags,>> - hwaddr start, hwaddr end,>> - int iommu_idx)>> +static inline void iommu_iotlb_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,>> + IOMMUNotifierFlag flags,>> + hwaddr start, hwaddr end,>> + int iommu_idx)>> {>> - n->notify = fn;>> + assert(flags & IOMMU_NOTIFIER_IOTLB_MAP ||>> + flags & IOMMU_NOTIFIER_IOTLB_UNMAP);> > Can use IOMMU_NOTIFIER_IOTLB_ALL directly?> >> + assert(start < end);>> n->notifier_flags = flags;>> - n->start = start;>> - n->end = end;>> + n->iotlb_notifier.notify = fn;>> + n->iotlb_notifier.start = start;>> + n->iotlb_notifier.end = end;>> n->iommu_idx = iommu_idx;>> }> > Otherwise the patch looks good to me.> > Regards,>
On Mon, May 27, 2019 at 7:44 PM Eric Auger <eric.auger@redhat.com> wrote:
>> Up to now vSMMUv3 has not been integrated with VFIO. VFIO> integration requires to program the physical IOMMU consistently> with the guest mappings. However, as opposed to VTD, SMMUv3 has> no "Caching Mode" which allows easy trapping of guest mappings.> This means the vSMMUV3 cannot use the same VFIO integration as VTD.>> However SMMUv3 has 2 translation stages. This was devised with> virtualization use case in mind where stage 1 is "owned" by the> guest whereas the host uses stage 2 for VM isolation.>> This series sets up this nested translation stage. It only works> if there is one physical SMMUv3 used along with QEMU vSMMUv3 (in> other words, it does not work if there is a physical SMMUv2).>> The series uses a new kernel user API [1], still under definition.>> - We force the host to use stage 2 instead of stage 1, when we> detect a vSMMUV3 is behind a VFIO device. For a VFIO device> without any virtual IOMMU, we still use stage 1 as many existing> SMMUs expect this behavior.> - We introduce new IOTLB "config" notifiers, requested to notify> changes in the config of a given iommu memory region. So now> we have notifiers for IOTLB changes and config changes.> - vSMMUv3 calls config notifiers when STE (Stream Table Entries)> are updated by the guest.> - We implement a specific UNMAP notifier that conveys guest> IOTLB invalidations to the host> - We implement a new MAP notifiers only used for MSI IOVAs so> that the host can build a nested stage translation for MSI IOVAs> - As the legacy MAP notifier is not called anymore, we must make> sure stage 2 mappings are set. This is achieved through another> memory listener.> - Physical SMMUs faults are reported to the guest via en eventfd> mechanism and reinjected into this latter.>> Note: The first patch is a code cleanup and was sent separately.>> Best Regards>> Eric>> This series can be found at:> https://github.com/eauger/qemu/tree/v4.0.0-2stage-rfcv4>> Compatible with kernel series:> [PATCH v8 00/29] SMMUv3 Nested Stage Setup> (https://lkml.org/lkml/2019/5/26/95)>
Have tested vfio mode in qemu on arm64 platform.
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
qemu: https://github.com/eauger/qemu/tree/v4.0.0-2stage-rfcv4
kernel: https://github.com/eauger/linux/tree/v5.2-rc1-2stage-v8
Hi Zhangfei,
On 7/11/19 3:53 AM, Zhangfei Gao wrote:
> On Mon, May 27, 2019 at 7:44 PM Eric Auger <eric.auger@redhat.com> wrote:>>>> Up to now vSMMUv3 has not been integrated with VFIO. VFIO>> integration requires to program the physical IOMMU consistently>> with the guest mappings. However, as opposed to VTD, SMMUv3 has>> no "Caching Mode" which allows easy trapping of guest mappings.>> This means the vSMMUV3 cannot use the same VFIO integration as VTD.>>>> However SMMUv3 has 2 translation stages. This was devised with>> virtualization use case in mind where stage 1 is "owned" by the>> guest whereas the host uses stage 2 for VM isolation.>>>> This series sets up this nested translation stage. It only works>> if there is one physical SMMUv3 used along with QEMU vSMMUv3 (in>> other words, it does not work if there is a physical SMMUv2).>>>> The series uses a new kernel user API [1], still under definition.>>>> - We force the host to use stage 2 instead of stage 1, when we>> detect a vSMMUV3 is behind a VFIO device. For a VFIO device>> without any virtual IOMMU, we still use stage 1 as many existing>> SMMUs expect this behavior.>> - We introduce new IOTLB "config" notifiers, requested to notify>> changes in the config of a given iommu memory region. So now>> we have notifiers for IOTLB changes and config changes.>> - vSMMUv3 calls config notifiers when STE (Stream Table Entries)>> are updated by the guest.>> - We implement a specific UNMAP notifier that conveys guest>> IOTLB invalidations to the host>> - We implement a new MAP notifiers only used for MSI IOVAs so>> that the host can build a nested stage translation for MSI IOVAs>> - As the legacy MAP notifier is not called anymore, we must make>> sure stage 2 mappings are set. This is achieved through another>> memory listener.>> - Physical SMMUs faults are reported to the guest via en eventfd>> mechanism and reinjected into this latter.>>>> Note: The first patch is a code cleanup and was sent separately.>>>> Best Regards>>>> Eric>>>> This series can be found at:>> https://github.com/eauger/qemu/tree/v4.0.0-2stage-rfcv4>>>> Compatible with kernel series:>> [PATCH v8 00/29] SMMUv3 Nested Stage Setup>> (https://lkml.org/lkml/2019/5/26/95)>>> > Have tested vfio mode in qemu on arm64 platform.> > Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>> qemu: https://github.com/eauger/qemu/tree/v4.0.0-2stage-rfcv4> kernel: https://github.com/eauger/linux/tree/v5.2-rc1-2stage-v8
Your testing is really appreciated.
Both kernel and QEMU series will be respinned. I am currently waiting
for 5.3 kernel window as it will resolve some dependencies on the fault
reporting APIs. My focus is to get the updated kernel series reviewed
and tested and then refine the QEMU integration accordingly.
Thanks
Eric
>
On Thu, Jul 11, 2019 at 1:55 PM Auger Eric <eric.auger@redhat.com> wrote:
>> Hi Zhangfei,>> On 7/11/19 3:53 AM, Zhangfei Gao wrote:> > On Mon, May 27, 2019 at 7:44 PM Eric Auger <eric.auger@redhat.com> wrote:> >>> >> Up to now vSMMUv3 has not been integrated with VFIO. VFIO> >> integration requires to program the physical IOMMU consistently> >> with the guest mappings. However, as opposed to VTD, SMMUv3 has> >> no "Caching Mode" which allows easy trapping of guest mappings.> >> This means the vSMMUV3 cannot use the same VFIO integration as VTD.> >>> >> However SMMUv3 has 2 translation stages. This was devised with> >> virtualization use case in mind where stage 1 is "owned" by the> >> guest whereas the host uses stage 2 for VM isolation.> >>> >> This series sets up this nested translation stage. It only works> >> if there is one physical SMMUv3 used along with QEMU vSMMUv3 (in> >> other words, it does not work if there is a physical SMMUv2).> >>> >> The series uses a new kernel user API [1], still under definition.> >>> >> - We force the host to use stage 2 instead of stage 1, when we> >> detect a vSMMUV3 is behind a VFIO device. For a VFIO device> >> without any virtual IOMMU, we still use stage 1 as many existing> >> SMMUs expect this behavior.> >> - We introduce new IOTLB "config" notifiers, requested to notify> >> changes in the config of a given iommu memory region. So now> >> we have notifiers for IOTLB changes and config changes.> >> - vSMMUv3 calls config notifiers when STE (Stream Table Entries)> >> are updated by the guest.> >> - We implement a specific UNMAP notifier that conveys guest> >> IOTLB invalidations to the host> >> - We implement a new MAP notifiers only used for MSI IOVAs so> >> that the host can build a nested stage translation for MSI IOVAs> >> - As the legacy MAP notifier is not called anymore, we must make> >> sure stage 2 mappings are set. This is achieved through another> >> memory listener.> >> - Physical SMMUs faults are reported to the guest via en eventfd> >> mechanism and reinjected into this latter.> >>> >> Note: The first patch is a code cleanup and was sent separately.> >>> >> Best Regards> >>> >> Eric> >>> >> This series can be found at:> >> https://github.com/eauger/qemu/tree/v4.0.0-2stage-rfcv4> >>> >> Compatible with kernel series:> >> [PATCH v8 00/29] SMMUv3 Nested Stage Setup> >> (https://lkml.org/lkml/2019/5/26/95)> >>> >> > Have tested vfio mode in qemu on arm64 platform.> >> > Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>> > qemu: https://github.com/eauger/qemu/tree/v4.0.0-2stage-rfcv4> > kernel: https://github.com/eauger/linux/tree/v5.2-rc1-2stage-v8>> Your testing is really appreciated.>> Both kernel and QEMU series will be respinned. I am currently waiting> for 5.3 kernel window as it will resolve some dependencies on the fault> reporting APIs. My focus is to get the updated kernel series reviewed> and tested and then refine the QEMU integration accordingly.>
Thanks Eric, that's great
Since I found kernel part (drivers/iommu/arm-smmu-v3.c) will be
conflicting with Jean's sva patch.
Especially this one: iommu/smmuv3: Dynamically allocate s1_cfg and s2_cfg
Thanks