Commit Graph

1351148 Commits

Author SHA1 Message Date
Max Kellermann
28a3f6ab2f fs/open: make chmod_common() and chown_common() killable
Allows killing processes that are waiting for the inode lock.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/20250513150327.1373061-2-max.kellermann@ionos.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15 12:03:12 +02:00
Max Kellermann
d8c5507cd1 include/linux/fs.h: add inode_lock_killable()
Prepare for making inode operations killable while they're waiting for
the lock.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/20250513150327.1373061-1-max.kellermann@ionos.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15 12:03:11 +02:00
Miklos Szeredi
e0410e956b readdir: supply dir_context.count as readdir buffer size hint
This is a preparation for large readdir buffers in fuse.

Simply setting the fuse buffer size to the userspace buffer size should
work, the record sizes are similar (fuse's is slightly larger than libc's,
so no overflow should ever happen).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jaco Kroon <jaco@uls.co.za>
Link: https://lore.kernel.org/20250513151012.1476536-1-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15 11:26:05 +02:00
Yafang Shao
e7b9cea718 vfs: Add sysctl vfs_cache_pressure_denom for bulk file operations
On our HDFS servers with 12 HDDs per server, a HDFS datanode[0] startup
involves scanning all files and caching their metadata (including dentries
and inodes) in memory. Each HDD contains approximately 2 million files,
resulting in a total of ~20 million cached dentries after initialization.

To minimize dentry reclamation, we set vfs_cache_pressure to 1. Despite
this configuration, memory pressure conditions can still trigger
reclamation of up to 50% of cached dentries, reducing the cache from 20
million to approximately 10 million entries. During the subsequent cache
rebuild period, any HDFS datanode restart operation incurs substantial
latency penalties until full cache recovery completes.

To maintain service stability, we need to preserve more dentries during
memory reclamation. The current minimum reclaim ratio (1/100 of total
dentries) remains too aggressive for our workload. This patch introduces
vfs_cache_pressure_denom for more granular cache pressure control. The
configuration [vfs_cache_pressure=1, vfs_cache_pressure_denom=10000]
effectively maintains the full 20 million dentry cache under memory
pressure, preventing datanode restart performance degradation.

Link: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#NameNode+and+DataNodes [0]

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/20250511083624.9305-1-laoar.shao@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15 11:12:59 +02:00
Miklos Szeredi
8d9117009d fuse: don't allow signals to interrupt getdents copying
When getting the directory contents, the entries are first fetched to a
kernel buffer, then they are copied to userspace with dir_emit().  This
second phase is non-blocking as long as the userspace buffer is not paged
out, making it interruptible makes zero sense.

Overload d_type as flags, since it only uses 4 bits from 32.

Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/20250513112335.1473177-1-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15 11:12:11 +02:00
Petr Vaněk
678927c0c9 Documentation: fix typo in root= kernel parameter description
Fixes a typo in the root= parameter description, changing
"this a a" to "this is a".

Fixes: c0c1a7dcb6 ("init: move the nfs/cifs/ram special cases out of name_to_dev_t")
Signed-off-by: Petr Vaněk <arkamar@atlas.cz>
Link: https://lore.kernel.org/20250512110827.32530-1-arkamar@atlas.cz
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-13 09:27:57 +02:00
Christian Brauner
e68ecc161f Merge patch series "Minor namespace code simplication"
Joel Savitz <jsavitz@redhat.com> says:

The two patches are independent of each other. The first patch removes
unnecssary NULL guards from free_nsproxy() and create_new_namespaces()
in line with other usage of the put_*_ns() call sites. The second patch
slightly reduces the size of the kernel when CONFIG_CGROUPS is not
selected.

* patches from https://lore.kernel.org/20250508184930.183040-1-jsavitz@redhat.com:
  include/cgroup: separate {get,put}_cgroup_ns no-op case
  kernel/nsproxy: remove unnecessary guards

Link: https://lore.kernel.org/20250508184930.183040-1-jsavitz@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09 13:14:02 +02:00
Joel Savitz
79fb8d8d93 include/cgroup: separate {get,put}_cgroup_ns no-op case
When CONFIG_CGROUPS is not selected, {get,put}_cgroup_ns become no-ops
and therefore it is not necessary to compile in the code for changing
the reference count.

When CONFIG_CGROUP is selected, there is no valid case where
either of {get,put}_cgroup_ns() will be called with a NULL argument.

Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Link: https://lore.kernel.org/20250508184930.183040-3-jsavitz@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09 13:13:54 +02:00
Joel Savitz
5caa2d89b7 kernel/nsproxy: remove unnecessary guards
In free_nsproxy() and the error path of create_new_namesapces() the
put_*_ns() calls are guarded by unnecessary NULL checks.

put_pid_ns(), put_ipc_ns(), put_uts_ns(), and put_time_ns() will never
receive a NULL argument unless their namespace type is disabled, and in
this case all four become no-ops at compile time anyway. put_mnt_ns()
will never receive a null argument at any time.

This unguarded usage is in line with other call sites of put_*_ns().

Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Link: https://lore.kernel.org/20250508184930.183040-2-jsavitz@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09 13:13:54 +02:00
Christoph Hellwig
bb01e8cc10 fs: use writeback_iter directly in mpage_writepages
Stop using write_cache_pages and use writeback_iter directly.  This
removes an indirect call per written folio and makes the code easier
to follow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250507062124.3933305-1-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09 12:37:48 +02:00
Jinliang Zheng
9f81d70702 fs: remove useless plus one in super_cache_scan()
After commit 475d0db742 ("fs: Fix theoretical division by 0 in
super_cache_scan()."), there's no need to plus one to prevent
division by zero.

Remove it to simplify the code.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Link: https://lore.kernel.org/20250428135050.267297-1-alexjlzheng@tencent.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-29 13:08:29 +02:00
Christian Brauner
19bbfe7b5f fs: add S_ANON_INODE
This makes it easy to detect proper anonymous inodes and to ensure that
we can detect them in codepaths such as readahead().

Readahead on anonymous inodes didn't work because they didn't have a
proper mode. Now that they have we need to retain EINVAL being returned
otherwise LTP will fail.

We also need to ensure that ioctls aren't simply fired like they are for
regular files so things like inotify inodes continue to correctly call
their own ioctl handlers as in [1].

Reported-by: Xilin Wu <sophon@radxa.com>
Link: https://lore.kernel.org/3A9139D5CD543962+89831381-31b9-4392-87ec-a84a5b3507d8@radxa.com [1]
Link: https://lore.kernel.org/7a1a7076-ff6b-4cb0-94e7-7218a0a44028@sirena.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-21 13:20:14 +02:00
Christian Brauner
c4044870ae Merge patch series "two nits for path lookup"
Mateusz Guzik <mjguzik@gmail.com> says:

Since path looku is being looked at, two extra nits from me:

1. some trivial jump avoidance in inode_permission()

2. but more importantly avoiding a memory access which is most likely a
cache miss when descending into devcgroup_inode_permission()

the file seems to have no maintainer fwiw

anyhow I'm confident the way forward is to add IOP_FAST_MAY_EXEC (or
similar) to elide inode_permission() in the common case to begin with.
There are quite a few branches which straight up don't need execute.
On top of that btrfs has a permission hook only to check for MAY_WRITE,
which in case of path lookup is not set. With the above flag the call
will be avoided.

* patches from https://lore.kernel.org/20250416221626.2710239-1-mjguzik@gmail.com:
  device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission()
  fs: touch up predicts in inode_permission()

Link: https://lore.kernel.org/20250416221626.2710239-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-21 10:27:59 +02:00
Christian Brauner
79beea2db0 fs: remove uselib() system call
This system call has been deprecated for quite a while now.
Let's try and remove it from the kernel completely.

Link: https://lore.kernel.org/20250415-kanufahren-besten-02ac00e6becd@brauner
Acked-by: Kees Cook <kees@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-21 10:27:59 +02:00
Mateusz Guzik
4ef4ac3601 device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission()
The routine gets called for every path component during lookup.
->i_mode is going to be cached on account of permission checks, while
->i_rdev is an area which is most likely cache-cold.

gcc 14.2 is kind enough to emit one branch:
	movzwl (%rbx),%eax
	mov    %eax,%edx
	and    $0xb000,%dx
	cmp    $0x2000,%dx
	je     11bc <inode_permission+0xec>

This patch is lazy in that I don't know if the ->i_rdev branch makes
any sense with the newly added mode check upfront. I am not changing any
semantics here though.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/20250416221626.2710239-3-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-21 10:27:59 +02:00
Zijun Hu
d1f482108a fs/fs_parse: Remove unused and problematic validate_constant_table()
Remove validate_constant_table() since:

- It has no caller.

- It has below 3 bugs for good constant table array array[] which must
  end with a empty entry, and take below invocation for explaination:
  validate_constant_table(array, ARRAY_SIZE(array), ...)

  - Always return wrong value due to the last empty entry.
  - Imprecise error message for missorted case.
  - Potential NULL pointer dereference since the last pr_err() may use
    @tbl[i].name NULL pointer to print the last empty entry's name.

Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://lore.kernel.org/20250415-fix_fs-v4-1-5d575124a3ff@quicinc.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-21 10:27:59 +02:00
Mateusz Guzik
875ccc0ddc fs: touch up predicts in inode_permission()
The routine only encounters errors when people try to access things they
can't, which is a negligible amount of calls.

The only questionable bit might be the pre-existing predict around
MAY_WRITE. Currently the routine is predominantly used for MAY_EXEC, so
this makes some sense.

I verified this straightens out the asm.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/20250416221626.2710239-2-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-21 10:27:59 +02:00
Zijun Hu
296b67059e fs/fs_parse: Delete macro fsparam_u32hex()
Delete macro fsparam_u32hex() since:

- it has no caller.

- it uses as type @fs_param_is_u32_hex which is never defined, so will
  cause compile error when caller uses it.

Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://lore.kernel.org/20250411-fix_fs-v2-1-5d3395c102e4@quicinc.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-21 10:27:58 +02:00
Mateusz Guzik
8564124c36 fs: improve codegen in link_path_walk()
Looking at the asm produced by gcc 13.3 for x86-64:
1. may_lookup() usage was not optimized for succeeding, despite the
   routine being inlined and rightfully starting with likely(!err)
2. the compiler assumed the path will have an indefinite amount of
   slashes to skip, after which the result will be an empty name

As such:
1. predict may_lookup() succeeding
2. check for one slash, no explicit predicts. do roll forward with
   skipping more slashes while predicting there is only one
3. predict the path to find was not a mere slash

This also has a side effect of shrinking the file:
add/remove: 1/1 grow/shrink: 0/3 up/down: 934/-1012 (-78)
Function                                     old     new   delta
link_path_walk                                 -     934    +934
path_parentat                                138     112     -26
path_openat                                 4864    4823     -41
path_lookupat                                418     374     -44
link_path_walk.part.constprop                901       -    -901
Total: Before=46639, After=46561, chg -0.17%

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/20250412110935.2267703-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-21 10:27:58 +02:00
Li RongQing
ef181fa11d fs: Make file-nr output the total allocated file handles
Make file-nr output the total allocated file handles, not per-cpu
cache number, it's more precise, and not in hot path

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://lore.kernel.org/20250410112117.2851-1-lirongqing@baidu.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-21 10:27:58 +02:00
Colin Ian King
6b24a702ec select: core_sys_select add unlikely branch hint on return path
Adding an unlikely() hint on the n < 0 comparison return path improves
run-time performance of the select() system call, the negative
value of n is very uncommon in normal select usage.

Benchmarking on an Debian based Intel(R) Core(TM) Ultra 9 285K with
a 6.15-rc1 kernel built with 14.2.0 using a select of 1000 file
descriptors with zero timeout shows a consistent call reduction from
258 ns down to 254 ns, which is a ~1.5% performance improvement.

Results based on running 25 tests with turbo disabled (to reduce clock
freq turbo changes), with 30 second run per test and comparing the number
of select() calls per second. The % standard deviation of the 25 tests
was 0.24%, so results are reliable.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/20250414092426.53529-1-colin.i.king@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-21 10:27:58 +02:00
Zijun Hu
1363c134ad fs/filesystems: Fix potential unsigned integer underflow in fs_name()
fs_name() has @index as unsigned int, so there is underflow risk for
operation '@index--'.

Fix by breaking the for loop when '@index == 0' which is also more proper
than '@index <= 0' for unsigned integer comparison.

Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://lore.kernel.org/20250410-fix_fs-v1-1-7c14ccc8ebaa@quicinc.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-14 13:05:59 +02:00
Zijun Hu
698d1b483c fs/fs_context: Mark an unlikely if condition with unlikely() in vfs_parse_monolithic_sep()
There is no mount option with pattern "...,=key_or_value,...", so the if
condition '(value == key)' in while loop of vfs_parse_monolithic_sep() is
is unlikely true.

Mark the condition with unlikely() to improve both performance and
readability.

Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://lore.kernel.org/20250410-fix_fs-v1-5-7c14ccc8ebaa@quicinc.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-14 13:05:59 +02:00
Zijun Hu
1d17057d21 fs/fs_parse: Correct comments of fs_validate_description()
For fs_validate_description(), its comments easily mislead reader that
the function will search array @desc for duplicated entries with name
specified by parameter @name, but @name is not used for search actually.

Fix by marking name as owner's name of these parameter specifications.

Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-14 13:05:40 +02:00
Zijun Hu
916148d24d fs/fs_context: Use KERN_INFO for infof()|info_plog()|infofc()
Use KERN_INFO instead of default KERN_NOTICE for
infof()|info_plog()|infofc() to printk informational messages.

Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://lore.kernel.org/20250410-rfc_fix_fs-v1-1-406e13b3608e@quicinc.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-11 16:10:51 +02:00
Colin Ian King
5730609ffd select: do_pollfd: add unlikely branch hint return path
Adding an unlikely() hint on the fd < 0 comparison return path improves
run-time performance of the poll() system call. gcov based coverage
analysis based on running stress-ng and a kernel build shows that this
path return path is highly unlikely.

Benchmarking on an Debian based Intel(R) Core(TM) Ultra 9 285K with
a 6.15-rc1 kernel and a poll of 1024 file descriptors with zero timeout
shows an call reduction from 32818 ns down to 32635 ns, which is a ~0.5%
performance improvement.

Results based on running 25 tests with turbo disabled (to reduce clock
freq turbo changes), with 30 second run per test and comparing the number
of poll() calls per second. The % standard deviation of the 25 tests
was 0.08%, so results are reliable.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/20250409155510.577490-1-colin.i.king@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-11 15:56:54 +02:00
David Howells
f1745496d3 netfs: Update main API document
Bring the netfs documentation up to date.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/1690127.1744208325@warthog.procyon.org.uk
Reviewed-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Timothy Day <timday@amazon.com>
cc: Jonathan Corbet <corbet@lwn.net>
cc: netfs@lists.linux.dev
cc: linux-doc@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-11 15:23:50 +02:00
Mateusz Guzik
e45960c279 fs: unconditionally use atime_needs_update() in pick_link()
Vast majority of the time the func returns false.

This avoids a branch to determine whether we are in RCU mode.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/20250408073641.1799151-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-08 11:08:24 +02:00
Christian Brauner
c9b380a017 Merge patch series "fs: sort out cosmetic differences between stat funcs and add predicts"
Predict fastpaths in stat and during fdput().

* patches from https://lore.kernel.org/20250406235806.1637000-1-mjguzik@gmail.com:
  fs: predict not having to do anything in fdput()
  fs: sort out cosmetic differences between stat funcs and add predicts

Link: https://lore.kernel.org/20250406235806.1637000-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-08 10:28:10 +02:00
Mateusz Guzik
5f3e0b4a1f fs: predict not having to do anything in fdput()
This matches the annotation in fdget().

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/20250406235806.1637000-2-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-08 10:28:07 +02:00
Mateusz Guzik
eaec2cd167 fs: sort out cosmetic differences between stat funcs and add predicts
This is a nop, but I did verify asm improves.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/20250406235806.1637000-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-08 10:28:07 +02:00
Christian Brauner
9d36c5145a Merge patch series "fs: harden anon inodes"
Christian Brauner <brauner@kernel.org> says:

* Anonymous inodes currently don't come with a proper mode causing
  issues in the kernel when we want to add useful VFS debug assert. Fix
  that by giving them a proper mode and masking it off when we report it
  to userspace which relies on them not having any mode.

* Anonymous inodes currently allow to change inode attributes because
  the VFS falls back to simple_setattr() if i_op->setattr isn't
  implemented. This means the ownership and mode for every single user
  of anon_inode_inode can be changed. Block that as it's either useless
  or actively harmful. If specific ownership is needed the respective
  subsystem should allocate anonymous inodes from their own private
  superblock.

* Port pidfs to the new anon_inode_{g,s}etattr() helpers.

* Add proper tests for anonymous inode behavior.

The anonymous inode specific fixes should ideally be backported to all
LTS kernels.

* patches from https://lore.kernel.org/20250407-work-anon_inode-v1-0-53a44c20d44e@kernel.org:
  selftests/filesystems: add fourth test for anonymous inodes
  selftests/filesystems: add third test for anonymous inodes
  selftests/filesystems: add second test for anonymous inodes
  selftests/filesystems: add first test for anonymous inodes
  anon_inode: raise SB_I_NODEV and SB_I_NOEXEC
  pidfs: use anon_inode_setattr()
  anon_inode: explicitly block ->setattr()
  pidfs: use anon_inode_getattr()
  anon_inode: use a proper mode internally

Link: https://lore.kernel.org/20250407-work-anon_inode-v1-0-53a44c20d44e@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 16:20:15 +02:00
Christian Brauner
25a6cc9a63 selftests/filesystems: add open() test for anonymous inodes
Test that anonymous inodes cannot be open()ed.

Link: https://lore.kernel.org/20250407-work-anon_inode-v1-9-53a44c20d44e@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 16:20:15 +02:00
Christian Brauner
f8ca403ae7 selftests/filesystems: add exec() test for anonymous inodes
Test that anonymous inodes cannot be exec()ed.

Link: https://lore.kernel.org/20250407-work-anon_inode-v1-8-53a44c20d44e@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 16:20:14 +02:00
Christian Brauner
fcf31ec7ca selftests/filesystems: add chmod() test for anonymous inodes
Test that anonymous inodes cannot be chmod()ed.

Link: https://lore.kernel.org/20250407-work-anon_inode-v1-7-53a44c20d44e@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 16:20:14 +02:00
Christian Brauner
c784159750 selftests/filesystems: add chown() test for anonymous inodes
Test that anonymous inodes cannot be chown()ed.

Link: https://lore.kernel.org/20250407-work-anon_inode-v1-6-53a44c20d44e@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 16:20:14 +02:00
Christian Brauner
1ed95281c0 anon_inode: raise SB_I_NODEV and SB_I_NOEXEC
It isn't possible to execute anonymous inodes because they cannot be
opened in any way after they have been created. This includes execution:

execveat(fd_anon_inode, "", NULL, NULL, AT_EMPTY_PATH)

Anonymous inodes have inode->f_op set to no_open_fops which sets
no_open() which returns ENXIO. That means any call to do_dentry_open()
which is the endpoint of the do_open_execat() will fail. There's no
chance to execute an anonymous inode. Unless a given subsystem overrides
it ofc.

However, we should still harden this and raise SB_I_NODEV and
SB_I_NOEXEC on the superblock itself so that no one gets any creative
ideas.

Link: https://lore.kernel.org/20250407-work-anon_inode-v1-5-53a44c20d44e@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org # all LTS kernels
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 16:19:04 +02:00
Christian Brauner
c83b902496 pidfs: use anon_inode_setattr()
So far pidfs did use it's own version. Just use the generic version.
We use our own wrappers because we're going to be implementing
properties soon.

Link: https://lore.kernel.org/20250407-work-anon_inode-v1-4-53a44c20d44e@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 16:19:02 +02:00
Christian Brauner
22bdf3d658 anon_inode: explicitly block ->setattr()
It is currently possible to change the mode and owner of the single
anonymous inode in the kernel:

int main(int argc, char *argv[])
{
        int ret, sfd;
        sigset_t mask;
        struct signalfd_siginfo fdsi;

        sigemptyset(&mask);
        sigaddset(&mask, SIGINT);
        sigaddset(&mask, SIGQUIT);

        ret = sigprocmask(SIG_BLOCK, &mask, NULL);
        if (ret < 0)
                _exit(1);

        sfd = signalfd(-1, &mask, 0);
        if (sfd < 0)
                _exit(2);

        ret = fchown(sfd, 5555, 5555);
        if (ret < 0)
                _exit(3);

        ret = fchmod(sfd, 0777);
        if (ret < 0)
                _exit(3);

        _exit(4);
}

This is a bug. It's not really a meaningful one because anonymous inodes
don't really figure into path lookup and they cannot be reopened via
/proc/<pid>/fd/<nr> and can't be used for lookup itself. So they can
only ever serve as direct references.

But it is still completely bogus to allow the mode and ownership or any
of the properties of the anonymous inode to be changed. Block this!

Link: https://lore.kernel.org/20250407-work-anon_inode-v1-3-53a44c20d44e@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org # all LTS kernels
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 16:18:59 +02:00
Christian Brauner
37e62dafbf pidfs: use anon_inode_getattr()
So far pidfs did use it's own version. Just use the generic version. We
use our own wrappers because we're going to be implementing our own
retrieval properties soon.

Link: https://lore.kernel.org/20250407-work-anon_inode-v1-2-53a44c20d44e@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 16:18:56 +02:00
Christian Brauner
cfd86ef7e8 anon_inode: use a proper mode internally
This allows the VFS to not trip over anonymous inodes and we can add
asserts based on the mode into the vfs. When we report it to userspace
we can simply hide the mode to avoid regressions. I've audited all
direct callers of alloc_anon_inode() and only secretmen overrides i_mode
and i_op inode operations but it already uses a regular file.

Link: https://lore.kernel.org/20250407-work-anon_inode-v1-1-53a44c20d44e@kernel.org
Fixes: af153bb63a ("vfs: catch invalid modes in may_open()")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org # all LTS kernels
Reported-by: syzbot+5d8e79d323a13aa0b248@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67ed3fb3.050a0220.14623d.0009.GAE@google.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 16:18:46 +02:00
David Disseldorp
418556fa57 docs: initramfs: update compression and mtime descriptions
Update the document to reflect that initramfs didn't replace initrd
following kernel 2.5.x.
The initramfs buffer format now supports many compression types in
addition to gzip, so include them in the grammar section.
c_mtime use is dependent on CONFIG_INITRAMFS_PRESERVE_MTIME.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Link: https://lore.kernel.org/r/20250402033949.852-2-ddiss@suse.de
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 09:38:01 +02:00
Linus Torvalds
0af2f6be1b Linux 6.15-rc1 v6.15-rc1 2025-04-06 13:11:33 -07:00
Thomas Weißschuh
0efdedb335 tools/include: make uapi/linux/types.h usable from assembly
The "real" linux/types.h UAPI header gracefully degrades to a NOOP when
included from assembly code.

Mirror this behaviour in the tools/ variant.

Test for __ASSEMBLER__ over __ASSEMBLY__ as the former is provided by the
toolchain automatically.

Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/lkml/af553c62-ca2f-4956-932c-dd6e3a126f58@sirena.org.uk/
Fixes: c9fbaa8795 ("selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Link: https://patch.msgid.link/20250321-uapi-consistency-v1-1-439070118dc0@linutronix.de
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-06 12:55:31 -07:00
Linus Torvalds
710329254d Merge tag 'turbostat-2025.05.06' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
Pull turbostat updates from Len Brown:

 - support up to 8192 processors

 - add cpuidle governor debug telemetry, disabled by default

 - update default output to exclude cpuidle invocation counts

 - bug fixes

* tag 'turbostat-2025.05.06' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
  tools/power turbostat: v2025.05.06
  tools/power turbostat: disable "cpuidle" invocation counters, by default
  tools/power turbostat: re-factor sysfs code
  tools/power turbostat: Restore GFX sysfs fflush() call
  tools/power turbostat: Document GNR UncMHz domain convention
  tools/power turbostat: report CoreThr per measurement interval
  tools/power turbostat: Increase CPU_SUBSET_MAXCPUS to 8192
  tools/power turbostat: Add idle governor statistics reporting
  tools/power turbostat: Fix names matching
  tools/power turbostat: Allow Zero return value for some RAPL registers
  tools/power turbostat: Clustered Uncore MHz counters should honor show/hide options
2025-04-06 12:32:43 -07:00
Linus Torvalds
59f392fa7c Merge tag 'soundwire-6.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire
Pull soundwire fix from Vinod Koul:

 - add missing config symbol CONFIG_SND_HDA_EXT_CORE required for asoc
   driver CONFIG_SND_SOF_SOF_HDA_SDW_BPT

* tag 'soundwire-6.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire:
  ASoC: SOF: Intel: Let SND_SOF_SOF_HDA_SDW_BPT select SND_HDA_EXT_CORE
2025-04-06 12:04:53 -07:00
Len Brown
03e00e373c tools/power turbostat: v2025.05.06
Support up to 8192 processors
Add cpuidle governor debug telemetry, disabled by default
Update default output to exclude cpuidle invocation counts
Bug fixes

Signed-off-by: Len Brown <len.brown@intel.com>
2025-04-06 14:49:20 -04:00
Len Brown
ec4acd3166 tools/power turbostat: disable "cpuidle" invocation counters, by default
Create "pct_idle" counter group, the sofware notion of residency
so it can now be singled out, independent of other counter groups.

Create "cpuidle" group, the cpuidle invocation counts.
Disable "cpuidle", by default.

Create "swidle" = "cpuidle" + "pct_idle".
Undocument "sysfs", the old name for "swidle", but keep it working
for backwards compatibilty.

Create "hwidle", all the HW idle counters

Modify "idle", enabled by default
"idle" = "hwidle" + "pct_idle" (and now excludes "cpuidle")

Signed-off-by: Len Brown <len.brown@intel.com>
2025-04-06 14:29:57 -04:00
Linus Torvalds
dda8887894 Merge tag 'perf-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fix from Ingo Molnar:
 "Fix a perf events time accounting bug"

* tag 'perf-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix child_total_time_enabled accounting bug at task exit
2025-04-06 10:48:12 -07:00
Linus Torvalds
302deb109d Merge tag 'sched-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:

 - Fix a nonsensical Kconfig combination

 - Remove an unnecessary rseq-notification

* tag 'sched-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rseq: Eliminate useless task_work on execve
  sched/isolation: Make CONFIG_CPU_ISOLATION depend on CONFIG_SMP
2025-04-06 10:44:58 -07:00