linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-26 10:32:25 -04:00

Author	SHA1	Message	Date
Nicolin Chen	5d5c85ff62	iommufd: Allow passing in iopt_access_list_id to iopt_remove_access() This is a preparatory change for ioas replacement support for accesses. The replacement routine does an iopt_add_access() for a new IOAS first and then iopt_remove_access() for the old IOAS upon the success of the first call. However, the first call overrides the iopt_access_list_id in the access struct, resulting in iopt_remove_access() being unable to work on the old IOAS. Add an iopt_access_list_id as a parameter to iopt_remove_access, so the replacement routine can save the id before it gets overwritten. Pass the id in iopt_remove_access() for a proper cleanup. The existing callers should just pass in access->iopt_access_list_id. Link: https://lore.kernel.org/r/7bb939b9e0102da0c099572bb3de78ab7622221e.1690523699.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-28 13:31:24 -03:00
Jason Gunthorpe	6583c865de	iommufd/selftest: Add a selftest for IOMMU_HWPT_ALLOC Test the basic flow. Link: https://lore.kernel.org/r/19-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:20:41 -03:00
Jason Gunthorpe	34f327a985	iommufd: Keep track of each device's reserved regions instead of groups The driver facing API in the iommu core makes the reserved regions per-device. An algorithm in the core code consolidates the regions of all the devices in a group to return the group view. To allow for devices to be hotplugged into the group iommufd would re-load the entire group's reserved regions for each device, just in case they changed. Further iommufd already has to deal with duplicated/overlapping reserved regions as it must union all the groups together. Thus simplify all of this to just use the device reserved regions interface directly from the iommu driver. Link: https://lore.kernel.org/r/5-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Suggested-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:19:32 -03:00
Jason Gunthorpe	804ca14d04	iommufd: Do not access the area pointer after unlocking A concurrent unmap can trigger freeing of the area pointers while we are generating an unmapping notification for accesses. syzkaller reports: BUG: KASAN: slab-use-after-free in iopt_unmap_iova_range+0x5ba/0x5f0 Read of size 4 at addr ffff888075996184 by task syz-executor.2/31160 CPU: 1 PID: 31160 Comm: syz-executor.2 Not tainted 6.4.0-rc5-syzkaller-00313-g4c605260bc60 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023 Call Trace: <TASK> dump_stack_lvl+0xd9/0x150 print_address_description.constprop.0+0x2c/0x3c0 kasan_report+0x11c/0x130 iopt_unmap_iova_range+0x5ba/0x5f0 iopt_unmap_all+0x27/0x50 iommufd_ioas_unmap+0x3d0/0x490 iommufd_fops_ioctl+0x317/0x4b0 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f0812c8c169 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f0813914168 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f0812dabf80 RCX: 00007f0812c8c169 RDX: 0000000020000100 RSI: 0000000000003b86 RDI: 0000000000000005 RBP: 00007f0812ce7ca1 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f0812ecfb1f R14: 00007f0813914300 R15: 0000000000022000 </TASK> Allocated by task 31160: kasan_save_stack+0x22/0x40 kasan_set_track+0x25/0x30 __kasan_kmalloc+0xa2/0xb0 iopt_alloc_area_pages+0x94/0x560 iopt_map_user_pages+0x205/0x4e0 iommufd_ioas_map+0x329/0x5f0 iommufd_fops_ioctl+0x317/0x4b0 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 31161: kasan_save_stack+0x22/0x40 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2e/0x40 ____kasan_slab_free+0x160/0x1c0 slab_free_freelist_hook+0x8b/0x1c0 __kmem_cache_free+0xaf/0x2d0 iopt_unmap_iova_range+0x288/0x5f0 iopt_unmap_all+0x27/0x50 iommufd_ioas_unmap+0x3d0/0x490 iommufd_fops_ioctl+0x317/0x4b0 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd The buggy address belongs to the object at ffff888075996100 which belongs to the cache kmalloc-cg-192 of size 192 The buggy address is located 132 bytes inside of freed 192-byte region [ffff888075996100, ffff8880759961c0) The buggy address belongs to the physical page: page:ffffea0001d66580 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x75996 memcg:ffff88801f1c2701 flags: 0xfff00000000200(slab\|node=0\|zone=1\|lastcpupid=0x7ff) page_type: 0xffffffff() raw: 00fff00000000200 ffff88801244ddc0 dead000000000122 0000000000000000 raw: 0000000000000000 0000000080100010 00000001ffffffff ffff88801f1c2701 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112cc0(GFP_USER\|__GFP_NOWARN\|__GFP_NORETRY), pid 31157, tgid 31154 (syz-executor.0), ts 1984547323469, free_ts 1983933451331 post_alloc_hook+0x2db/0x350 get_page_from_freelist+0xf41/0x2c00 __alloc_pages+0x1cb/0x4a0 alloc_pages+0x1aa/0x270 allocate_slab+0x25f/0x390 ___slab_alloc+0xa91/0x1400 __slab_alloc.constprop.0+0x56/0xa0 __kmem_cache_alloc_node+0x136/0x320 kmalloc_trace+0x26/0xe0 iommufd_test+0x1328/0x2c20 iommufd_fops_ioctl+0x317/0x4b0 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd page last free stack trace: free_unref_page_prepare+0x62e/0xcb0 free_unref_page_list+0xe3/0xa70 release_pages+0xcd8/0x1380 tlb_batch_pages_flush+0xa8/0x1a0 tlb_finish_mmu+0x14b/0x7e0 exit_mmap+0x2b2/0x930 __mmput+0x128/0x4c0 mmput+0x60/0x70 do_exit+0x9b0/0x29b0 do_group_exit+0xd4/0x2a0 get_signal+0x2318/0x25b0 arch_do_signal_or_restart+0x79/0x5c0 exit_to_user_mode_prepare+0x11f/0x240 syscall_exit_to_user_mode+0x1d/0x50 do_syscall_64+0x46/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd Precompute what is needed to call the access function and do not check the area's num_accesses again as the pointer may not be valid anymore. Use a counter instead. Fixes: `51fe6141f0` ("iommufd: Data structure to provide IOVA to PFN mapping") Link: https://lore.kernel.org/r/1-v2-9a03761d445d+54-iommufd_syz2_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reported-by: syzbot+1ad12d16afca0e7d2dde@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/0000000000001d40fc05fe385332@google.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-06-26 09:00:23 -03:00
Jason Gunthorpe	d6c55c0a20	iommufd: Change the order of MSI setup Eric points out this is wrong for the rare case of someone using allow_unsafe_interrupts on ARM. We always have to setup the MSI window in the domain if the iommu driver asks for it. Move the iommu_get_msi_cookie() setup to the top of the function and always do it, regardless of the security mode. Add checks to iommufd_device_setup_msi() to ensure the driver is not doing something incomprehensible. No current driver will set both a HW and SW MSI window, or have more than one SW MSI window. Fixes: `e8d5721003` ("iommufd: Add kAPI toward external drivers for physical devices") Link: https://lore.kernel.org/r/3-v1-0362a1a1c034+98-iommufd_fixes1_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reported-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-12-09 15:24:30 -04:00
Jason Gunthorpe	52f528583b	iommufd: Add additional invariant assertions These are on performance paths so we protect them using the CONFIG_IOMMUFD_TEST to not take a hit during normal operation. These are useful when running the test suite and syzkaller to find data structure inconsistencies early. Link: https://lore.kernel.org/r/18-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00
Jason Gunthorpe	8d40205f60	iommufd: Add kAPI toward external drivers for kernel access Kernel access is the mode that VFIO "mdevs" use. In this case there is no struct device and no IOMMU connection. iommufd acts as a record keeper for accesses and returns the actual struct pages back to the caller to use however they need. eg with kmap or the DMA API. Each caller must create a struct iommufd_access with iommufd_access_create(), similar to how iommufd_device_bind() works. Using this struct the caller can access blocks of IOVA using iommufd_access_pin_pages() or iommufd_access_rw(). Callers must provide a callback that immediately unpins any IOVA being used within a range. This happens if userspace unmaps the IOVA under the pin. The implementation forwards the access requests directly to the iopt infrastructure that manages the iopt_pages_access. Link: https://lore.kernel.org/r/14-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00
Jason Gunthorpe	51fe6141f0	iommufd: Data structure to provide IOVA to PFN mapping This is the remainder of the IOAS data structure. Provide an object called an io_pagetable that is composed of iopt_areas pointing at iopt_pages, along with a list of iommu_domains that mirror the IOVA to PFN map. At the top this is a simple interval tree of iopt_areas indicating the map of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based on the attached domains there is a minimum alignment for areas (which may be smaller than PAGE_SIZE), an interval tree of reserved IOVA that can't be mapped and an IOVA of allowed IOVA that can always be mappable. The concept of an 'access' refers to something like a VFIO mdev that is accessing the IOVA and using a 'struct page *' for CPU based access. Externally an API is provided that matches the requirements of the IOCTL interface for map/unmap and domain attachment. The API provides a 'copy' primitive to establish a new IOVA map in a different IOAS from an existing mapping by re-using the iopt_pages. This is the basic mechanism to provide single pinning. This is designed to support a pre-registration flow where userspace would setup an dummy IOAS with no domains, map in memory and then establish an access to pin all PFNs into the xarray. Copy can then be used to create new IOVA mappings in a different IOAS, with iommu_domains attached. Upon copy the PFNs will be read out of the xarray and mapped into the iommu_domains, avoiding any pin_user_pages() overheads. Link: https://lore.kernel.org/r/10-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00

8 Commits