BPF side tld_get_data() currently may return garbage when tld_data_u is
not aligned to page_size. This can happen when small amount of memory
is allocated for tld_data_u. The misalignment is supposed to be allowed
and the BPF side will use tld_data_u->start to reference the tld_data_u
in a page. However, since "start" is within tld_data_u, there is no way
to know the correct "start" in the first place. As a result, BPF
programs will see garbage data. The selftest did not catch this since
it tries to allocate the maximum amount of data possible (i.e., a page)
such that tld_data_u->start is always correct.
Fix it by moving tld_data_u->start to tld_data_map->start. The original
field is now renamed as unused instead of removing it because BPF side
tld_get_data() views off = 0 returned from tld_fetch_key() as
uninitialized.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260413190259.358442-3-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fix a bug in the task local data library that may allocate more than a
a page for tld_data_u. This may happen when users set a too large
TLD_DYN_DATA_SIZE, so check it when creating dynamic TLD fields and fix
the corresponding selftest.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260413190259.358442-2-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If TLD_FREE_DATA_ON_THREAD_EXIT is not enabled in a translation unit
that calls __tld_create_key() first, another translation unit that
enables it will not get the auto cleanup feature as pthread key is only
created once when allocation metadata. Fix it by always try to create
the pthread key when __tld_create_key() is called.
Also improve the documentation:
- Discourage user from using different options in different translation
units
- Specify calling tld_free() before thread exit as undefined behavior
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260331213555.1993883-6-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Without specifying constructor priority of the hidden constructor
function defined by TLD_DEFINE_KEY, __tld_create_key(..., dyn_data =
false) may run after tld_get_data() called from other constructors.
Threads calling tld_get_data() before __tld_create_key(..., dyn_data
= false) will not allocate enough memory for all TLDs and later result
in OOB access. Therefore, set it to the lowest value available to
users. Note that lower means higher priority and 0-100 is reserved to
the compiler.
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Sun Jian <sun.jian.kdev@gmail.com>
Link: https://lore.kernel.org/r/20260331213555.1993883-4-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Simplify data allocation by always using aligned_alloc() and passing
size_pot, size rounded up to the closest power of two to alignment.
Currently, aligned_alloc(page_size, size) is only intended to be used
with memory allocators that can fulfill the request without rounding
size up to page_size to conserve memory. This is enabled by defining
TLD_DATA_USE_ALIGNED_ALLOC. The reason to align to page_size is due to
the limitation of UPTR where only a page can be pinned to the kernel.
Otherwise, malloc(size * 2) is used to allocate memory for data.
However, we don't need to call aligned_alloc(page_size, size) to get
a contiguous memory of size bytes within a page. aligned_alloc(size_pot,
...) will also do the trick. Therefore, just use aligned_alloc(size_pot,
...) universally.
As for the size argument, create a new option,
TLD_DONT_ROUND_UP_DATA_SIZE, to specify not rounding up the size.
This preserves the current TLD_DATA_USE_ALIGNED_ALLOC behavior, allowing
memory allocators with low overhead aligned_alloc() to not waste memory.
To enable this, users need to make sure it is not an undefined behavior
for the memory allocator to have size not being an integral multiple of
alignment.
Compared to the current implementation, !TLD_DATA_USE_ALIGNED_ALLOC
used to always waste size-byte of memory due to malloc(size * 2).
Now the worst case becomes size - 1 and the best case is 0 when the size
is already a power of two.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260331213555.1993883-3-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, when allocating memory for data, size of tld_data_u->start
is not taken into account. This may cause OOB access. Fixed it by adding
the non-flexible array part of tld_data_u.
Besides, explicitly align tld_data_u->data to 8 bytes in case some
fields are added before data in the future. It could break the
assumption that every data field is 8 byte aligned and
sizeof(tld_data_u) will no longer be equal to
offsetof(struct tld_data_u, data), which we use interchangeably.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Sun Jian <sun.jian.kdev@gmail.com>
Link: https://lore.kernel.org/r/20260331213555.1993883-2-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
On arm64 systems with 64K pages, the selftest task_local_data has the following
failures:
...
test_task_local_data_basic:PASS:tld_create_key 0 nsec
test_task_local_data_basic:FAIL:tld_create_key unexpected tld_create_key: actual 0 != expected -28
...
test_task_local_data_basic_thread:PASS:run task_main 0 nsec
test_task_local_data_basic_thread:FAIL:task_main retval unexpected error: 2 (errno 0)
test_task_local_data_basic_thread:FAIL:tld_get_data value0 unexpected tld_get_data value0: actual 0 != expected 6268
...
#447/1 task_local_data/task_local_data_basic:FAIL
...
#447/2 task_local_data/task_local_data_race:FAIL
#447 task_local_data:FAIL
When TLD_DYN_DATA_SIZE is 64K page size, for
struct tld_meta_u {
_Atomic __u8 cnt;
__u16 size;
struct tld_metadata metadata[];
};
field 'cnt' would overflow. For example, for 4K page, 'cnt' will
be 4096/64 = 64. But for 64K page, 'cnt' will be 65536/64 = 1024
and 'cnt' is not enough for 1024. To accommodate 64K page,
'_Atomic __u8 cnt' becomes '_Atomic __u16 cnt'. A few other places
are adjusted accordingly.
In test_task_local_data.c, the value for TLD_DYN_DATA_SIZE is changed
from 4096 to (getpagesize() - 8) since the maximum buffer size for
TLD_DYN_DATA_SIZE is (getpagesize() - 8).
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Cc: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260123055122.494352-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Task local data defines an abstract storage type for storing task-
specific data (TLD). This patch provides user space and bpf
implementation as header-only libraries for accessing task local data.
Task local data is a bpf task local storage map with two UPTRs:
- tld_meta_u, shared by all tasks of a process, consists of the total
count and size of TLDs and an array of metadata of TLDs. A TLD
metadata contains the size and name. The name is used to identify a
specific TLD in bpf programs.
- u_tld_data points to a task-specific memory. It stores TLD data and
the starting offset of data in a page.
Task local design decouple user space and bpf programs. Since bpf
program does not know the size of TLDs in compile time, u_tld_data
is declared as a page to accommodate TLDs up to a page. As a result,
while user space will likely allocate memory smaller than a page for
actual TLDs, it needs to pin a page to kernel. It will pin the page
that contains enough memory if the allocated memory spans across the
page boundary.
The library also creates another task local storage map, tld_key_map,
to cache keys for bpf programs to speed up the access.
Below are the core task local data API:
User space BPF
Define TLD TLD_DEFINE_KEY(), tld_create_key() -
Init TLD object - tld_object_init()
Get TLD data tld_get_data() tld_get_data()
- TLD_DEFINE_KEY(), tld_create_key()
A TLD is first defined by the user space with TLD_DEFINE_KEY() or
tld_create_key(). TLD_DEFINE_KEY() defines a TLD statically and
allocates just enough memory during initialization. tld_create_key()
allows creating TLDs on the fly, but has a fix memory budget,
TLD_DYN_DATA_SIZE.
Internally, they all call __tld_create_key(), which iterates
tld_meta_u->metadata to check if a TLD can be added. The total TLD
size needs to fit into a page (limit of UPTR), and no two TLDs can
have the same name. If a TLD can be added, u_tld_meta->cnt is
increased using cmpxchg as there may be other concurrent
__tld_create_key(). After a successful cmpxchg, the last available
tld_meta_u->metadata now belongs to the calling thread. To prevent
other threads from reading incomplete metadata while it is being
updated, tld_meta_u->metadata->size is used to signal the completion.
Finally, the offset, derived from adding up prior TLD sizes is then
encapsulated as an opaque object key to prevent user misuse. The
offset is guaranteed to be 8-byte aligned to prevent load/store
tearing and allow atomic operations on it.
- tld_get_data()
User space programs can pass the key to tld_get_data() to get a
pointer to the associated TLD. The pointer will remain valid for the
lifetime of the thread.
tld_data_u is lazily allocated on the first call to tld_get_data().
Trying to read task local data from bpf will result in -ENODATA
during tld_object_init(). The task-specific memory need to be freed
manually by calling tld_free() on thread exit to prevent memory leak
or use TLD_FREE_DATA_ON_THREAD_EXIT.
- tld_object_init() (BPF)
BPF programs need to call tld_object_init() before calling
tld_get_data(). This is to avoid redundant map lookup in
tld_get_data() by storing pointers to the map values on stack.
The pointers are encapsulated as tld_object.
tld_key_map is also created on the first time tld_object_init()
is called to cache TLD keys successfully fetched by tld_get_data().
bpf_task_storage_get(.., F_CREATE) needs to be retried since it may
fail when another thread has already taken the percpu counter lock
for the task local storage.
- tld_get_data() (BPF)
BPF programs can also get a pointer to a TLD with tld_get_data().
It uses the cached key in tld_key_map to locate the data in
tld_data_u->data. If the cached key is not set yet (<= 0),
__tld_fetch_key() will be called to iterate tld_meta_u->metadata
and find the TLD by name. To prevent redundant string comparison
in the future when the search fail, the tld_meta_u->cnt is stored
in the non-positive range of the key. Next time, __tld_fetch_key()
will be called only if there are new TLDs and the search will start
from the newly added tld_meta_u->metadata using the old
tld_meta_u-cnt.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20250730185903.3574598-3-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>