There is a race condition that leads to a use-after-free situation:
because the rawdata inodes are not refcounted, an attacker can start
open()ing one of the rawdata files, and at the same time remove the
last reference to this rawdata (by removing the corresponding profile,
for example), which frees its struct aa_loaddata; as a result, when
seq_rawdata_open() is reached, i_private is a dangling pointer and
freed memory is accessed.
The rawdata inodes weren't refcounted to avoid a circular refcount and
were supposed to be held by the profile rawdata reference. However
during profile removal there is a window where the vfs and profile
destruction race, resulting in the use after free.
Fix this by moving to a double refcount scheme. Where the profile
refcount on rawdata is used to break the circular dependency. Allowing
for freeing of the rawdata once all inode references to the rawdata
are put.
Fixes: 5d5182cae4 ("apparmor: move to per loaddata files, instead of replicating in profiles")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Maxime Bélair <maxime.belair@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Differential encoding allows loops to be created if it is abused. To
prevent this the unpack should verify that a diff-encode chain
terminates.
Unfortunately the differential encode verification had two bugs.
1. it conflated states that had gone through check and already been
marked, with states that were currently being checked and marked.
This means that loops in the current chain being verified are treated
as a chain that has already been verified.
2. the order bailout on already checked states compared current chain
check iterators j,k instead of using the outer loop iterator i.
Meaning a step backwards in states in the current chain verification
was being mistaken for moving to an already verified state.
Move to a double mark scheme where already verified states get a
different mark, than the current chain being kept. This enables us
to also drop the backwards verification check that was the cause of
the second error as any already verified state is already marked.
Fixes: 031dcc8f4e ("apparmor: dfa add support for state differential encoding")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
An unprivileged local user can load, replace, and remove profiles by
opening the apparmorfs interfaces, via a confused deputy attack, by
passing the opened fd to a privileged process, and getting the
privileged process to write to the interface.
This does require a privileged target that can be manipulated to do
the write for the unprivileged process, but once such access is
achieved full policy management is possible and all the possible
implications that implies: removing confinement, DoS of system or
target applications by denying all execution, by-passing the
unprivileged user namespace restriction, to exploiting kernel bugs for
a local privilege escalation.
The policy management interface can not have its permissions simply
changed from 0666 to 0600 because non-root processes need to be able
to load policy to different policy namespaces.
Instead ensure the task writing the interface has privileges that
are a subset of the task that opened the interface. This is already
done via policy for confined processes, but unconfined can delegate
access to the opened fd, by-passing the usual policy check.
Fixes: b7fd2c0340 ("apparmor: add per policy ns .load, .replace, .remove interface files")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
if ns_name is NULL after
1071 error = aa_unpack(udata, &lh, &ns_name);
and if ent->ns_name contains an ns_name in
1089 } else if (ent->ns_name) {
then ns_name is assigned the ent->ns_name
1095 ns_name = ent->ns_name;
however ent->ns_name is freed at
1262 aa_load_ent_free(ent);
and then again when freeing ns_name at
1270 kfree(ns_name);
Fix this by NULLing out ent->ns_name after it is transferred to ns_name
Fixes: 145a0ef21c ("apparmor: fix blob compression when ns is forced on a policy load
")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Currently the number of policy namespaces is not bounded relying on
the user namespace limit. However policy namespaces aren't strictly
tied to user namespaces and it is possible to create them and nest
them arbitrarily deep which can be used to exhaust system resource.
Hard cap policy namespaces to the same depth as user namespaces.
Fixes: c88d4c7b04 ("AppArmor: core policy routines")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Reviewed-by: Ryan Lee <ryan.lee@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
The profile removal code uses recursion when removing nested profiles,
which can lead to kernel stack exhaustion and system crashes.
Reproducer:
$ pf='a'; for ((i=0; i<1024; i++)); do
echo -e "profile $pf { \n }" | apparmor_parser -K -a;
pf="$pf//x";
done
$ echo -n a > /sys/kernel/security/apparmor/.remove
Replace the recursive __aa_profile_list_release() approach with an
iterative approach in __remove_profile(). The function repeatedly
finds and removes leaf profiles until the entire subtree is removed,
maintaining the same removal semantic without recursion.
Fixes: c88d4c7b04 ("AppArmor: core policy routines")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: Massimiliano Pellizzer <massimiliano.pellizzer@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
The function sets `*ns = NULL` on every call, leaking the namespace
string allocated in previous iterations when multiple profiles are
unpacked. This also breaks namespace consistency checking since *ns
is always NULL when the comparison is made.
Remove the incorrect assignment.
The caller (aa_unpack) initializes *ns to NULL once before the loop,
which is sufficient.
Fixes: dd51c84857 ("apparmor: provide base for multiple profiles to be replaced at once")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: Massimiliano Pellizzer <massimiliano.pellizzer@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Start states are read from untrusted data and used as indexes into the
DFA state tables. The aa_dfa_next() function call in unpack_pdb() will
access dfa->tables[YYTD_ID_BASE][start], and if the start state exceeds
the number of states in the DFA, this results in an out-of-bound read.
==================================================================
BUG: KASAN: slab-out-of-bounds in aa_dfa_next+0x2a1/0x360
Read of size 4 at addr ffff88811956fb90 by task su/1097
...
Reject policies with out-of-bounds start states during unpacking
to prevent the issue.
Fixes: ad5ff3db53 ("AppArmor: Add ability to load extended policy")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: Massimiliano Pellizzer <massimiliano.pellizzer@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
Smatch static checker warning:
security/apparmor/policy_unpack.c:966 unpack_pdb()
warn: unsigned 'unpack_tags(e, &pdb->tags, info)' is never less than zero.
unpack_tags() is declared with return type size_t (unsigned) but returns
negative errno values on failure. The caller in unpack_pdb() tests the
return with `< 0`, which is always false for an unsigned type, making
error handling dead code. Malformed tag data would be silently accepted
instead of causing a load failure.
Change return type of unpack_tags() from size_t to int to match the
functions's actual semantic.
Fixes: 3d28e2397a ("apparmor: add support loading per permission tagging")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Massimiliano Pellizzer <mpellizzer.dev@gmail.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
if debugging is enabled the DEBUG statement will fail do to a bad
fat fingered cast.
Fixes: 102ada7ca3 ("apparmor: fix fmt string type error in process_strs_entry")
Signed-off-by: John Johansen <john.johansen@canonical.com>
Add .kunitconfig file to the AppArmor directory to enable easy execution of
KUnit tests.
AppArmor tests (CONFIG_SECURITY_APPARMOR_KUNIT_TEST) depend on
CONFIG_SECURITY_APPARMOR which also depends on CONFIG_SECURITY and
CONFIG_NET. Without explicitly enabling these configs in the .kunitconfig,
developers will need to specify config manually.
With the .kunitconfig, developers can run the tests:
$ ./tools/testing/kunit/kunit.py run --kunitconfig security/apparmor
Signed-off-by: Ryota Sakamoto <sakamo.ryota@gmail.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
There are two unused percpu critical sections in the buffer management
code. These are remanents from when a more complex hold algorithm was
used. Remove them, as they serve no purpose.
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
The buffer hold is a measure of contention, but it is tracked per cpu
where the lock is a globabl resource. On some systems (eg. real time)
there is no guarantee that the code will be on the same cpu pre, and
post spinlock acquisition, nor that the buffer will be put back to
the same percpu cache when we are done with it.
Because of this the hold value can move asynchronous to the buffers on
the cache, meaning it is possible to underflow, and potentially in really
pathelogical cases overflow.
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
When aa_get_buffer() pulls from the per-cpu list it unconditionally
decrements cache->hold. If hold reaches 0 while count is still non-zero,
the unsigned decrement wraps to UINT_MAX. This keeps hold non-zero for a
very long time, so aa_put_buffer() never returns buffers to the global
list, which can starve other CPUs and force repeated kmalloc(aa_g_path_max)
allocations.
Guard the decrement so hold never underflows.
Fixes: ea9bae12d0 ("apparmor: cache buffers on percpu list if there is lock contention")
Signed-off-by: Zhengmian Hu <huzhengmian@gmail.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
This patch doesn't change current functionality, it switches the two
uses of the in_ns fns and macros into the two semantically different
cases they are used for.
xxx_in_scope for checking mediation interaction between profiles
xxx_in_view to determine which profiles are visible.The scope will
always be a subset of the view as profiles that can not see each
other can not interact.
The split can not be completely done for label_match because it has to
distinct uses matching permission against label in scope, and checking
if a transition to a profile is allowed. The transition to a profile
can include profiles that are in view but not in scope, so retain this
distinction as a parameter.
While at the moment the two uses are very similar, in the future there
will be additional differences. So make sure the semantics differences
are present in the code.
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
compound match is inconsistent in returning a state or an integer error
this is problemati if the error is ever used as a state in the state
machine
Fixes: f1bd904175 ("apparmor: add the base fns() for domain labels")
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
The modes shouldn't be applied at the point of label match, it just
results in them being applied multiple times. Instead they should be
applied after which is already being done by all callers so it can
just be dropped from label_match.
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
The fast path cache check is incorrect forcing more slow path
revalidations than necessary, because the unix logic check is inverted.
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Posix cpu timers requires an additional step beyond setting the rlimit.
Refactor the code so its clear when what code is setting the
limit and conditionally update the posix cpu timers when appropriate.
Fixes: baa73d9e47 ("posix-timers: Make them configurable")
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
aa_cred_raw_label() and cred_label() now do the same things so
consolidate to cred_label()
Document the crit section use and constraints better and refactor
__begin_current_label_crit_section() into a base fn
__begin_cred_crit_section() and a wrapper that calls the base with
current cred.
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
files with a dentry pointing aa_null.dentry where already rejected as
part of file_inheritance. Unfortunately the check in
common_file_perm() is insufficient to cover all cases causing
unnecessary audit messages without the original files context.
Eg.
[ 442.886474] audit: type=1400 audit(1704822661.616:329): apparmor="DENIED" operation="file_inherit" class="file" namespace="root//lxd-juju-98527a-0_<var-snap-lxd-common-lxd>" profile="snap.lxd.activate" name="/apparmor/.null" pid=9525 comm="snap-exec"
Further examples of this are in the logs of
https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2120439https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1952084https://bugs.launchpad.net/snapd/+bug/2049099
These messages have no value and should not be sent to the logs.
AppArmor was already filtering the out in some cases but the original
patch did not catch all cases. Fix this by push the existing check
down into two functions that should cover all cases.
Link: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2122743
Fixes: 192ca6b55a ("apparmor: revalidate files during exec")
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
aa_free_data() and free_attachment() don't guard against having
a NULL parameter passed to them. Fix this.
Reviewed-by: Ryan Lee <ryan.lee@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
In policy_unpack.c:unpack_perms_table, the perms struct is allocated via
kcalloc, with the position being reset if the allocation fails. However,
the error path results in -EPROTO being retured instead of -ENOMEM. Fix
this to return the correct error code.
Reported-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Fixes: fd1b2b95a2 ("apparmor: add the ability for policy to specify a permission table")
Reviewed-by: Tyler Hicks <code@tyhicks.com>
Signed-off-by: Ryan Lee <ryan.lee@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
If we are not in an atomic context in common_file_perm, then we don't have
to use the atomic versions, resulting in improved performance outside of
atomic contexts.
Signed-off-by: Ryan Lee <ryan.lee@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
The previous value of GFP_ATOMIC is an int and not a bool, potentially
resulting in UB when being assigned to a bool. In addition, the mmap hook
is called outside of locks (i.e. in a non-atomic context), so we can pass
a fixed constant value of false instead to common_mmap.
Signed-off-by: Ryan Lee <ryan.lee@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
This new field allows reliable identification of the binary that
triggered a denial since the existing field (comm) only gives the name of
the binary, not its path. Thus comm doesn't work for binaries outside of
$PATH or works unreliably when two binaries have the same name.
Additionally comm can be modified by a program, for example, comm="(tor)"
or comm=4143504920506F6C6C6572 (= ACPI Poller).
Signed-off-by: Maxime Bélair <maxime.belair@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Add support for the per permission tag index for a given permission
set. This will be used by both meta-data tagging, to allow annotating
accept states with context and debug information. As well as by rule
tainting and triggers to specify the taint or trigger to be applied.
Since these are low frequency ancillary data items they are stored
in a tighter packed format to that allows for sharing and reuse of the
strings between permissions and accept states. Reducing the amount of
kernel memory use at the cost of having to go through a couple if
index based indirections.
The tags are just strings that has no meaning with out context. When
used as meta-data for auditing and debugging its entirely information
for userspace, but triggers, and tainting can be used to affect the
domain. However they all exist in the same packed data set and can
be shared between different uses.
Signed-off-by: John Johansen <john.johansen@canonical.com>
The strtable is currently limited to a single entry string on unpack
even though domain has the concept of multiple entries within it. Make
this a reality as it will be used for tags and more advanced domain
transitions.
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Source blob may come from userspace and might be unaligned.
Try to optize the copying process by avoiding unaligned memory accesses.
- Added Fixes tag
- Added "Fix &" to description as this doesn't just optimize but fixes
a potential unaligned memory access
Fixes: e6e8bf4188 ("apparmor: fix restricted endian type warnings for dfa unpack")
Signed-off-by: Helge Deller <deller@gmx.de>
[jj: remove duplicate word "convert" in comment trigger checkpatch warning]
Signed-off-by: John Johansen <john.johansen@canonical.com>
strcpy() is deprecated; use memcpy() instead. Unlike strcpy(), memcpy()
does not copy the NUL terminator from the source string, which would be
overwritten anyway on every iteration when using strcpy(). snprintf()
then ensures that 'char *s' is NUL-terminated.
Replace the hard-coded path length to remove the magic number 6, and add
a comment explaining the extra 11 bytes.
Closes: https://github.com/KSPP/linux/issues/88
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Deal with the potential that sock and sock-sk can be NULL during
socket setup or teardown. This could lead to an oops. The fix for NULL
pointer dereference in __unix_needs_revalidation shows this is at
least possible for af_unix sockets. While the fix for af_unix sockets
applies for newer mediation this is still the fall back path for older
af_unix mediation and other sockets, so ensure it is covered.
Fixes: 56974a6fcf ("apparmor: add base infastructure for socket mediation")
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
When receiving file descriptors via SCM_RIGHTS, both the socket pointer
and the socket's sk pointer can be NULL during socket setup or teardown,
causing NULL pointer dereferences in __unix_needs_revalidation().
This is a regression in AppArmor 5.0.0 (kernel 6.17+) where the new
__unix_needs_revalidation() function was added without proper NULL checks.
The crash manifests as:
BUG: kernel NULL pointer dereference, address: 0x0000000000000018
RIP: aa_file_perm+0xb7/0x3b0 (or +0xbe/0x3b0, +0xc0/0x3e0)
Call Trace:
apparmor_file_receive+0x42/0x80
security_file_receive+0x2e/0x50
receive_fd+0x1d/0xf0
scm_detach_fds+0xad/0x1c0
The function dereferences sock->sk->sk_family without checking if either
sock or sock->sk is NULL first.
Add NULL checks for both sock and sock->sk before accessing sk_family.
Fixes: 88fec3526e ("apparmor: make sure unix socket labeling is correctly updated.")
Reported-by: Jamin Mc <jaminmc@gmail.com>
Closes: https://bugzilla.proxmox.com/show_bug.cgi?id=7083
Closes: https://gitlab.com/apparmor/apparmor/-/issues/568
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: System Administrator <root@localhost>
Signed-off-by: John Johansen <john.johansen@canonical.com>
strcpy() is deprecated; replace it with a direct '/' assignment. The
buffer is already NUL-terminated, so there is no need to copy an
additional NUL terminator as strcpy() did.
Update the comment and add the local variable 'is_root' for clarity.
Closes: https://github.com/KSPP/linux/issues/88
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: John Johansen <john.johansen@canonical.com>
strcpy() is deprecated and sprintf() does not perform bounds checking
either. Although an overflow is unlikely, it's better to proactively
avoid it by using the safer strscpy() and scnprintf(), respectively.
Additionally, unify memory allocation for 'hname' to simplify and
improve aa_policy_init().
Closes: https://github.com/KSPP/linux/issues/88
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Replace unbounded sprintf() calls with snprintf() to prevent potential
buffer overflows in aa_new_learning_profile(). While the current code
works correctly, snprintf() is safer and follows secure coding best
practices. No functional changes.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Pull persistent dentry infrastructure and conversion from Al Viro:
"Some filesystems use a kinda-sorta controlled dentry refcount leak to
pin dentries of created objects in dcache (and undo it when removing
those). A reference is grabbed and not released, but it's not actually
_stored_ anywhere.
That works, but it's hard to follow and verify; among other things, we
have no way to tell _which_ of the increments is intended to be an
unpaired one. Worse, on removal we need to decide whether the
reference had already been dropped, which can be non-trivial if that
removal is on umount and we need to figure out if this dentry is
pinned due to e.g. unlink() not done. Usually that is handled by using
kill_litter_super() as ->kill_sb(), but there are open-coded special
cases of the same (consider e.g. /proc/self).
Things get simpler if we introduce a new dentry flag
(DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set
claims responsibility for +1 in refcount.
The end result this series is aiming for:
- get these unbalanced dget() and dput() replaced with new primitives
that would, in addition to adjusting refcount, set and clear
persistency flag.
- instead of having kill_litter_super() mess with removing the
remaining "leaked" references (e.g. for all tmpfs files that hadn't
been removed prior to umount), have the regular
shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries,
dropping the corresponding reference if it had been set. After that
kill_litter_super() becomes an equivalent of kill_anon_super().
Doing that in a single step is not feasible - it would affect too many
places in too many filesystems. It has to be split into a series.
This work has really started early in 2024; quite a few preliminary
pieces have already gone into mainline. This chunk is finally getting
to the meat of that stuff - infrastructure and most of the conversions
to it.
Some pieces are still sitting in the local branches, but the bulk of
that stuff is here"
* tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
d_make_discardable(): warn if given a non-persistent dentry
kill securityfs_recursive_remove()
convert securityfs
get rid of kill_litter_super()
convert rust_binderfs
convert nfsctl
convert rpc_pipefs
convert hypfs
hypfs: swich hypfs_create_u64() to returning int
hypfs: switch hypfs_create_str() to returning int
hypfs: don't pin dentries twice
convert gadgetfs
gadgetfs: switch to simple_remove_by_name()
convert functionfs
functionfs: switch to simple_remove_by_name()
functionfs: fix the open/removal races
functionfs: need to cancel ->reset_work in ->kill_sb()
functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}()
functionfs: don't abuse ffs_data_closed() on fs shutdown
convert selinuxfs
...
Pull LSM updates from Paul Moore:
- Rework the LSM initialization code
What started as a "quick" patch to enable a notification event once
all of the individual LSMs were initialized, snowballed a bit into a
30+ patch patchset when everything was done. Most of the patches, and
diffstat, is due to splitting out the initialization code into
security/lsm_init.c and cleaning up some of the mess that was there.
While not strictly necessary, it does cleanup the code signficantly,
and hopefully makes the upkeep a bit easier in the future.
Aside from the new LSM_STARTED_ALL notification, these changes also
ensure that individual LSM initcalls are only called when the LSM is
enabled at boot time. There should be a minor reduction in boot times
for those who build multiple LSMs into their kernels, but only enable
a subset at boot.
It is worth mentioning that nothing at present makes use of the
LSM_STARTED_ALL notification, but there is work in progress which is
dependent upon LSM_STARTED_ALL.
- Make better use of the seq_put*() helpers in device_cgroup
* tag 'lsm-pr-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (36 commits)
lsm: use unrcu_pointer() for current->cred in security_init()
device_cgroup: Refactor devcgroup_seq_show to use seq_put* helpers
lsm: add a LSM_STARTED_ALL notification event
lsm: consolidate all of the LSM framework initcalls
selinux: move initcalls to the LSM framework
ima,evm: move initcalls to the LSM framework
lockdown: move initcalls to the LSM framework
apparmor: move initcalls to the LSM framework
safesetid: move initcalls to the LSM framework
tomoyo: move initcalls to the LSM framework
smack: move initcalls to the LSM framework
ipe: move initcalls to the LSM framework
loadpin: move initcalls to the LSM framework
lsm: introduce an initcall mechanism into the LSM framework
lsm: group lsm_order_parse() with the other lsm_order_*() functions
lsm: output available LSMs when debugging
lsm: cleanup the debug and console output in lsm_init.c
lsm: add/tweak function header comment blocks in lsm_init.c
lsm: fold lsm_init_ordered() into security_init()
lsm: cleanup initialize_lsm() and rename to lsm_init_single()
...
At this point there are very few call chains that might lead to
d_make_discardable() on a dentry that hadn't been made persistent:
calls of simple_unlink() and simple_rmdir() in configfs and
apparmorfs.
Both filesystems do pin (part of) their contents in dcache, but
they are currently playing very unusual games with that. Converting
them to more usual patterns might be possible, but it's definitely
going to be a long series of changes in both cases.
For now the easiest solution is to have both stop using simple_unlink()
and simple_rmdir() - that allows to make d_make_discardable() warn
when given a non-persistent dentry.
Rather than giving them full-blown private copies (with calls of
d_make_discardable() replaced with dput()), let's pull the parts of
simple_unlink() and simple_rmdir() that deal with timestamps and link
counts into separate helpers (__simple_unlink() and __simple_rmdir()
resp.) and have those used by configfs and apparmorfs.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>