Refactor __io_uring_add_tctx_node() so that on error it never leaves
current->io_uring pointing at a half-setup tctx. This moves the
assignment of current->io_uring to the end of the function post any
failure points.
Separate out the node installation into io_tctx_install_node() to
further clean this up.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of having io_uring_alloc_task_context() return an int and
assign tsk->io_uring, just have it return the task context directly.
This enables cleaner error handling in callers, which may have
failure points post calling io_uring_alloc_task_context().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bitfields cannot be set and checked atomically, and this makes it more
clear that these are indeed in shared storage and must be checked and
set in a sane fashion. This is in preparation for annotating a few of
the known racy, but harmless, flags checking.
No intended functional changes in this patch.
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
Use a separate 'idx' variable to store the result of array_index_nospec()
instead of modifying the loop variable 'offset' directly. This improves
code clarity by separating the logical index from the sanitized index
used for array access.
No functional change intended.
Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull io_uring bpf filters from Jens Axboe:
"This adds support for both cBPF filters for io_uring, as well as task
inherited restrictions and filters.
seccomp and io_uring don't play along nicely, as most of the
interesting data to filter on resides somewhat out-of-band, in the
submission queue ring.
As a result, things like containers and systemd that apply seccomp
filters, can't filter io_uring operations.
That leaves them with just one choice if filtering is critical -
filter the actual io_uring_setup(2) system call to simply disallow
io_uring. That's rather unfortunate, and has limited us because of it.
io_uring already has some filtering support. It requires the ring to
be setup in a disabled state, and then a filter set can be applied.
This filter set is completely bi-modal - an opcode is either enabled
or it's not. Once a filter set is registered, the ring can be enabled.
This is very restrictive, and it's not useful at all to systemd or
containers which really want both broader and more specific control.
This first adds support for cBPF filters for opcodes, which enables
tighter control over what exactly a specific opcode may do. As
examples, specific support is added for IORING_OP_OPENAT/OPENAT2,
allowing filtering on resolve flags. And another example is added for
IORING_OP_SOCKET, allowing filtering on domain/type/protocol. These
are both common use cases. cBPF was chosen rather than eBPF, because
the latter is often restricted in containers as well.
These filters are run post the init phase of the request, which allows
filters to even dip into data that is being passed in struct in user
memory, as the init side of requests make that data stable by bringing
it into the kernel. This allows filtering without needing to copy this
data twice, or have filters etc know about the exact layout of the
user data. The filters get the already copied and sanitized data
passed.
On top of that support is added for per-task filters, meaning that any
ring created with a task that has a per-task filter will get those
filters applied when it's created. These filters are inherited across
fork as well. Once a filter has been registered, any further added
filters may only further restrict what operations are permitted.
Filters cannot change the return value of an operation, they can only
permit or deny it based on the contents"
* tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring: allow registration of per-task restrictions
io_uring: add task fork hook
io_uring/bpf_filter: add ref counts to struct io_bpf_filter
io_uring/bpf_filter: cache lookup table in ctx->bpf_filters
io_uring/bpf_filter: allow filtering on contents of struct open_how
io_uring/net: allow filtering on IORING_OP_SOCKET data
io_uring: add support for BPF filtering for opcode restrictions
Pull io_uring updates from Jens Axboe:
- Clean up the IORING_SETUP_R_DISABLED and submitter task checking,
mostly just in preparation for relaxing the locking for SINGLE_ISSUER
in the future.
- Improve IOPOLL by using a doubly linked list to manage completions.
Previously it was singly listed, which meant that to complete request
N in the chain 0..N-1 had to have completed first. With a doubly
linked list we can complete whatever request completes in that order,
rather than need to wait for a consecutive range to be available.
This reduces latencies.
- Improve the restriction setup and checking. Mostly in preparation for
adding further features on top of that. Coming in a separate pull
request.
- Split out task_work and wait handling into separate files. These are
mostly nicely abstracted already, but still remained in the
io_uring.c file which is on the larger side.
- Use GFP_KERNEL_ACCOUNT in a few more spots, where appropriate.
- Ensure even the idle io-wq worker exits if a task no longer has any
rings open.
- Add support for a non-circular submission queue.
By default, the SQ ring keeps moving around, even if only a few
entries are used for each submission. This can be wasteful in terms
of cachelines.
If IORING_SETUP_SQ_REWIND is set for the ring when created, each
submission will start at offset 0 instead of where we last left off
doing submissions.
- Various little cleanups
* tag 'for-7.0/io_uring-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (30 commits)
io_uring/kbuf: fix memory leak if io_buffer_add_list fails
io_uring: Add SPDX id lines to remaining source files
io_uring: allow io-wq workers to exit when unused
io_uring/io-wq: add exit-on-idle state
io_uring/net: don't continue send bundle if poll was required for retry
io_uring/rsrc: use GFP_KERNEL_ACCOUNT consistently
io_uring/futex: use GFP_KERNEL_ACCOUNT for futex data allocation
io_uring/io-wq: handle !sysctl_hung_task_timeout_secs
io_uring: fix bad indentation for setup flags if statement
io_uring/rsrc: take unsigned index in io_rsrc_node_lookup()
io_uring: introduce non-circular SQ
io_uring: split out CQ waiting code into wait.c
io_uring: split out task work code into tw.c
io_uring/io-wq: don't trigger hung task for syzbot craziness
io_uring: add IO_URING_EXIT_WAIT_MAX definition
io_uring/sync: validate passed in offset
io_uring/eventfd: remove unused ctx->evfd_last_cq_tail member
io_uring/timeout: annotate data race in io_flush_timeouts()
io_uring/uring_cmd: explicitly disallow cancelations for IOPOLL
io_uring: fix IOPOLL with passthrough I/O
...
Currently io_uring supports restricting operations on a per-ring basis.
To use those, the ring must be setup in a disabled state by setting
IORING_SETUP_R_DISABLED. Then restrictions can be set for the ring, and
the ring can then be enabled.
This commit adds support for IORING_REGISTER_RESTRICTIONS with ring_fd
== -1, like the other "blind" register opcodes which work on the task
rather than a specific ring. This allows registration of the same kind
of restrictions as can been done on a specific ring, but with the task
itself. Once done, any ring created will inherit these restrictions.
If a restriction filter is registered with a task, then it's inherited
on fork for its children. Children may only further restrict operations,
not extend them.
Inheriting restrictions include both the classic
IORING_REGISTER_RESTRICTIONS based restrictions, as well as the BPF
filters that have been registered with the task via
IORING_REGISTER_BPF_FILTER.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Called when copy_process() is called to copy state to a new child.
Right now this is just a stub, but will be used shortly to properly
handle fork'ing of task based io_uring restrictions.
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring keeps a per-task io-wq around, even when the task no longer has
any io_uring instances.
If the task previously used io_uring for file I/O, this can leave an
unrelated iou-wrk-* worker thread behind after the last io_uring
instance is gone.
When the last io_uring ctx is removed from the task context, mark the
io-wq exit-on-idle so workers can go away. Clear the flag on subsequent
io_uring usage.
Signed-off-by: Li Chen <me@linux.beauty>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct io_wq stores do_work and free_work function pointers which are
called on each work item. But these function pointers are always set to
io_wq_submit_work and io_wq_free_work, respectively. So remove these
function pointers and just call the functions directly.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250329161527.3281314-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
syzbot triggered the following WARN_ON:
WARNING: CPU: 0 PID: 16 at io_uring/tctx.c:51 __io_uring_free+0xfa/0x140 io_uring/tctx.c:51
which is the
WARN_ON_ONCE(!xa_empty(&tctx->xa));
sanity check in __io_uring_free() when a io_uring_task is going through
its final put. The syzbot test case includes injecting memory allocation
failures, and it very much looks like xa_store() can fail one of its
memory allocations and end up with ->head being non-NULL even though no
entries exist in the xarray.
Until this issue gets sorted out, work around it by attempting to
iterate entries in our xarray, and WARN_ON_ONCE() if one is found.
Reported-by: syzbot+cc36d44ec9f368e443d3@syzkaller.appspotmail.com
Link: https://lore.kernel.org/io-uring/673c1643.050a0220.87769.0066.GAE@google.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than store the task_struct itself in struct io_kiocb, store
the io_uring specific task_struct. The life times are the same in terms
of io_uring, and this avoids doing some dereferences through the
task_struct. For the hot path of putting local task references, we can
deref req->tctx instead, which we'll need anyway in that function
regardless of whether it's local or remote references.
This is mostly straight forward, except the original task PF_EXITING
check needs a bit of tweaking. task_work is _always_ run from the
originating task, except in the fallback case, where it's run from a
kernel thread. Replace the potentially racy (in case of fallback work)
checks for req->task->flags with current->flags. It's either the still
the original task, in which case PF_EXITING will be sane, or it has
PF_KTHREAD set, in which case it's fallback work. Both cases should
prevent moving forward with the given request.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With IORING_REGISTER_USE_REGISTERED_RING, an application can register
the ring fd and use it via registered index rather than installed fd.
This allows using a registered ring for everything *except* the initial
mmap.
With IORING_SETUP_NO_MMAP, io_uring_setup uses buffers allocated by the
user, rather than requiring a subsequent mmap.
The combination of the two allows a user to operate *entirely* via a
registered ring fd, making it unnecessary to ever install the fd in the
first place. So, add a flag IORING_SETUP_REGISTERED_FD_ONLY to make
io_uring_setup register the fd and return a registered index, without
installing the fd.
This allows an application to avoid touching the fd table at all, and
allows a library to never even momentarily install a file descriptor.
This splits out an io_ring_add_registered_file helper from
io_ring_add_registered_fd, for use by io_uring_setup.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Link: https://lore.kernel.org/r/bc8f431bada371c183b95a83399628b605e978a3.1682699803.git.josh@joshtriplett.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This better describes what it does - it's incremented when the task is
currently undergoing a cancelation operation, due to exiting or exec'ing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
this is no longer needed, as submitter_task is set at creation time.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Fixes: 97bbdc06a4 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove submitter parameter from __io_uring_add_tctx_node.
It was only called from one place, and we can do that logic in that one
place.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Fixes: 97bbdc06a4 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With networking use cases we see contention on the spinlock used to
protect the task_list when multiple threads try and add completions at once.
Instead we can use a lockless list, and assume that the first caller to
add to the list is responsible for kicking off task work.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-4-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>