On our HDFS servers with 12 HDDs per server, a HDFS datanode[0] startup
involves scanning all files and caching their metadata (including dentries
and inodes) in memory. Each HDD contains approximately 2 million files,
resulting in a total of ~20 million cached dentries after initialization.
To minimize dentry reclamation, we set vfs_cache_pressure to 1. Despite
this configuration, memory pressure conditions can still trigger
reclamation of up to 50% of cached dentries, reducing the cache from 20
million to approximately 10 million entries. During the subsequent cache
rebuild period, any HDFS datanode restart operation incurs substantial
latency penalties until full cache recovery completes.
To maintain service stability, we need to preserve more dentries during
memory reclamation. The current minimum reclaim ratio (1/100 of total
dentries) remains too aggressive for our workload. This patch introduces
vfs_cache_pressure_denom for more granular cache pressure control. The
configuration [vfs_cache_pressure=1, vfs_cache_pressure_denom=10000]
effectively maintains the full 20 million dentry cache under memory
pressure, preventing datanode restart performance degradation.
Link: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#NameNode+and+DataNodes [0]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/20250511083624.9305-1-laoar.shao@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When getting the directory contents, the entries are first fetched to a
kernel buffer, then they are copied to userspace with dir_emit(). This
second phase is non-blocking as long as the userspace buffer is not paged
out, making it interruptible makes zero sense.
Overload d_type as flags, since it only uses 4 bits from 32.
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/20250513112335.1473177-1-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
When CONFIG_CGROUPS is not selected, {get,put}_cgroup_ns become no-ops
and therefore it is not necessary to compile in the code for changing
the reference count.
When CONFIG_CGROUP is selected, there is no valid case where
either of {get,put}_cgroup_ns() will be called with a NULL argument.
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Link: https://lore.kernel.org/20250508184930.183040-3-jsavitz@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
In free_nsproxy() and the error path of create_new_namesapces() the
put_*_ns() calls are guarded by unnecessary NULL checks.
put_pid_ns(), put_ipc_ns(), put_uts_ns(), and put_time_ns() will never
receive a NULL argument unless their namespace type is disabled, and in
this case all four become no-ops at compile time anyway. put_mnt_ns()
will never receive a null argument at any time.
This unguarded usage is in line with other call sites of put_*_ns().
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Link: https://lore.kernel.org/20250508184930.183040-2-jsavitz@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Mateusz Guzik <mjguzik@gmail.com> says:
Since path looku is being looked at, two extra nits from me:
1. some trivial jump avoidance in inode_permission()
2. but more importantly avoiding a memory access which is most likely a
cache miss when descending into devcgroup_inode_permission()
the file seems to have no maintainer fwiw
anyhow I'm confident the way forward is to add IOP_FAST_MAY_EXEC (or
similar) to elide inode_permission() in the common case to begin with.
There are quite a few branches which straight up don't need execute.
On top of that btrfs has a permission hook only to check for MAY_WRITE,
which in case of path lookup is not set. With the above flag the call
will be avoided.
* patches from https://lore.kernel.org/20250416221626.2710239-1-mjguzik@gmail.com:
device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission()
fs: touch up predicts in inode_permission()
Link: https://lore.kernel.org/20250416221626.2710239-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
The routine gets called for every path component during lookup.
->i_mode is going to be cached on account of permission checks, while
->i_rdev is an area which is most likely cache-cold.
gcc 14.2 is kind enough to emit one branch:
movzwl (%rbx),%eax
mov %eax,%edx
and $0xb000,%dx
cmp $0x2000,%dx
je 11bc <inode_permission+0xec>
This patch is lazy in that I don't know if the ->i_rdev branch makes
any sense with the newly added mode check upfront. I am not changing any
semantics here though.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/20250416221626.2710239-3-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Remove validate_constant_table() since:
- It has no caller.
- It has below 3 bugs for good constant table array array[] which must
end with a empty entry, and take below invocation for explaination:
validate_constant_table(array, ARRAY_SIZE(array), ...)
- Always return wrong value due to the last empty entry.
- Imprecise error message for missorted case.
- Potential NULL pointer dereference since the last pr_err() may use
@tbl[i].name NULL pointer to print the last empty entry's name.
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://lore.kernel.org/20250415-fix_fs-v4-1-5d575124a3ff@quicinc.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The routine only encounters errors when people try to access things they
can't, which is a negligible amount of calls.
The only questionable bit might be the pre-existing predict around
MAY_WRITE. Currently the routine is predominantly used for MAY_EXEC, so
this makes some sense.
I verified this straightens out the asm.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/20250416221626.2710239-2-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Looking at the asm produced by gcc 13.3 for x86-64:
1. may_lookup() usage was not optimized for succeeding, despite the
routine being inlined and rightfully starting with likely(!err)
2. the compiler assumed the path will have an indefinite amount of
slashes to skip, after which the result will be an empty name
As such:
1. predict may_lookup() succeeding
2. check for one slash, no explicit predicts. do roll forward with
skipping more slashes while predicting there is only one
3. predict the path to find was not a mere slash
This also has a side effect of shrinking the file:
add/remove: 1/1 grow/shrink: 0/3 up/down: 934/-1012 (-78)
Function old new delta
link_path_walk - 934 +934
path_parentat 138 112 -26
path_openat 4864 4823 -41
path_lookupat 418 374 -44
link_path_walk.part.constprop 901 - -901
Total: Before=46639, After=46561, chg -0.17%
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/20250412110935.2267703-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Adding an unlikely() hint on the n < 0 comparison return path improves
run-time performance of the select() system call, the negative
value of n is very uncommon in normal select usage.
Benchmarking on an Debian based Intel(R) Core(TM) Ultra 9 285K with
a 6.15-rc1 kernel built with 14.2.0 using a select of 1000 file
descriptors with zero timeout shows a consistent call reduction from
258 ns down to 254 ns, which is a ~1.5% performance improvement.
Results based on running 25 tests with turbo disabled (to reduce clock
freq turbo changes), with 30 second run per test and comparing the number
of select() calls per second. The % standard deviation of the 25 tests
was 0.24%, so results are reliable.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/20250414092426.53529-1-colin.i.king@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
For fs_validate_description(), its comments easily mislead reader that
the function will search array @desc for duplicated entries with name
specified by parameter @name, but @name is not used for search actually.
Fix by marking name as owner's name of these parameter specifications.
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Adding an unlikely() hint on the fd < 0 comparison return path improves
run-time performance of the poll() system call. gcov based coverage
analysis based on running stress-ng and a kernel build shows that this
path return path is highly unlikely.
Benchmarking on an Debian based Intel(R) Core(TM) Ultra 9 285K with
a 6.15-rc1 kernel and a poll of 1024 file descriptors with zero timeout
shows an call reduction from 32818 ns down to 32635 ns, which is a ~0.5%
performance improvement.
Results based on running 25 tests with turbo disabled (to reduce clock
freq turbo changes), with 30 second run per test and comparing the number
of poll() calls per second. The % standard deviation of the 25 tests
was 0.08%, so results are reliable.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/20250409155510.577490-1-colin.i.king@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner <brauner@kernel.org> says:
* Anonymous inodes currently don't come with a proper mode causing
issues in the kernel when we want to add useful VFS debug assert. Fix
that by giving them a proper mode and masking it off when we report it
to userspace which relies on them not having any mode.
* Anonymous inodes currently allow to change inode attributes because
the VFS falls back to simple_setattr() if i_op->setattr isn't
implemented. This means the ownership and mode for every single user
of anon_inode_inode can be changed. Block that as it's either useless
or actively harmful. If specific ownership is needed the respective
subsystem should allocate anonymous inodes from their own private
superblock.
* Port pidfs to the new anon_inode_{g,s}etattr() helpers.
* Add proper tests for anonymous inode behavior.
The anonymous inode specific fixes should ideally be backported to all
LTS kernels.
* patches from https://lore.kernel.org/20250407-work-anon_inode-v1-0-53a44c20d44e@kernel.org:
selftests/filesystems: add fourth test for anonymous inodes
selftests/filesystems: add third test for anonymous inodes
selftests/filesystems: add second test for anonymous inodes
selftests/filesystems: add first test for anonymous inodes
anon_inode: raise SB_I_NODEV and SB_I_NOEXEC
pidfs: use anon_inode_setattr()
anon_inode: explicitly block ->setattr()
pidfs: use anon_inode_getattr()
anon_inode: use a proper mode internally
Link: https://lore.kernel.org/20250407-work-anon_inode-v1-0-53a44c20d44e@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
It isn't possible to execute anonymous inodes because they cannot be
opened in any way after they have been created. This includes execution:
execveat(fd_anon_inode, "", NULL, NULL, AT_EMPTY_PATH)
Anonymous inodes have inode->f_op set to no_open_fops which sets
no_open() which returns ENXIO. That means any call to do_dentry_open()
which is the endpoint of the do_open_execat() will fail. There's no
chance to execute an anonymous inode. Unless a given subsystem overrides
it ofc.
However, we should still harden this and raise SB_I_NODEV and
SB_I_NOEXEC on the superblock itself so that no one gets any creative
ideas.
Link: https://lore.kernel.org/20250407-work-anon_inode-v1-5-53a44c20d44e@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org # all LTS kernels
Signed-off-by: Christian Brauner <brauner@kernel.org>
It is currently possible to change the mode and owner of the single
anonymous inode in the kernel:
int main(int argc, char *argv[])
{
int ret, sfd;
sigset_t mask;
struct signalfd_siginfo fdsi;
sigemptyset(&mask);
sigaddset(&mask, SIGINT);
sigaddset(&mask, SIGQUIT);
ret = sigprocmask(SIG_BLOCK, &mask, NULL);
if (ret < 0)
_exit(1);
sfd = signalfd(-1, &mask, 0);
if (sfd < 0)
_exit(2);
ret = fchown(sfd, 5555, 5555);
if (ret < 0)
_exit(3);
ret = fchmod(sfd, 0777);
if (ret < 0)
_exit(3);
_exit(4);
}
This is a bug. It's not really a meaningful one because anonymous inodes
don't really figure into path lookup and they cannot be reopened via
/proc/<pid>/fd/<nr> and can't be used for lookup itself. So they can
only ever serve as direct references.
But it is still completely bogus to allow the mode and ownership or any
of the properties of the anonymous inode to be changed. Block this!
Link: https://lore.kernel.org/20250407-work-anon_inode-v1-3-53a44c20d44e@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org # all LTS kernels
Signed-off-by: Christian Brauner <brauner@kernel.org>
Update the document to reflect that initramfs didn't replace initrd
following kernel 2.5.x.
The initramfs buffer format now supports many compression types in
addition to gzip, so include them in the grammar section.
c_mtime use is dependent on CONFIG_INITRAMFS_PRESERVE_MTIME.
Signed-off-by: David Disseldorp <ddiss@suse.de>
Link: https://lore.kernel.org/r/20250402033949.852-2-ddiss@suse.de
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull soundwire fix from Vinod Koul:
- add missing config symbol CONFIG_SND_HDA_EXT_CORE required for asoc
driver CONFIG_SND_SOF_SOF_HDA_SDW_BPT
* tag 'soundwire-6.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire:
ASoC: SOF: Intel: Let SND_SOF_SOF_HDA_SDW_BPT select SND_HDA_EXT_CORE
Support up to 8192 processors
Add cpuidle governor debug telemetry, disabled by default
Update default output to exclude cpuidle invocation counts
Bug fixes
Signed-off-by: Len Brown <len.brown@intel.com>
Create "pct_idle" counter group, the sofware notion of residency
so it can now be singled out, independent of other counter groups.
Create "cpuidle" group, the cpuidle invocation counts.
Disable "cpuidle", by default.
Create "swidle" = "cpuidle" + "pct_idle".
Undocument "sysfs", the old name for "swidle", but keep it working
for backwards compatibilty.
Create "hwidle", all the HW idle counters
Modify "idle", enabled by default
"idle" = "hwidle" + "pct_idle" (and now excludes "cpuidle")
Signed-off-by: Len Brown <len.brown@intel.com>
Pull perf event fix from Ingo Molnar:
"Fix a perf events time accounting bug"
* tag 'perf-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix child_total_time_enabled accounting bug at task exit
Pull scheduler fixes from Ingo Molnar:
- Fix a nonsensical Kconfig combination
- Remove an unnecessary rseq-notification
* tag 'sched-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: Eliminate useless task_work on execve
sched/isolation: Make CONFIG_CPU_ISOLATION depend on CONFIG_SMP