The bridge-stp usermode helper is currently restricted to the initial
network namespace, preventing userspace STP daemons (e.g. mstpd) from
operating on bridges in other network namespaces. Since commit
ff62198553 ("bridge: Only call /sbin/bridge-stp for the initial
network namespace"), bridges in non-init namespaces silently fall
back to kernel STP with no way to use userspace STP.
Add a new bridge attribute IFLA_BR_STP_MODE that allows explicit
per-bridge control over STP mode selection:
BR_STP_MODE_AUTO (default) - Existing behavior: invoke the
/sbin/bridge-stp helper in init_net only; fall back to kernel STP
if it fails or in non-init namespaces.
BR_STP_MODE_USER - Directly enable userspace STP (BR_USER_STP)
without invoking the helper. Works in any network namespace.
Userspace is responsible for ensuring an STP daemon manages the
bridge.
BR_STP_MODE_KERNEL - Directly enable kernel STP (BR_KERNEL_STP)
without invoking the helper.
The mode can only be changed while STP is disabled, or set to the
same value (-EBUSY otherwise). IFLA_BR_STP_MODE is processed before
IFLA_BR_STP_STATE in br_changelink(), so both can be set atomically
in a single netlink message. The mode can also be changed in the
same message that disables STP.
The stp_mode struct field is u8 since all possible values fit, while
NLA_U32 is used for the netlink attribute since it occupies the same
space in the netlink message as NLA_U8.
A new stp_helper_active boolean tracks whether the /sbin/bridge-stp
helper was invoked during br_stp_start(), so that br_stp_stop() only
calls the helper for stop when it was called for start. This avoids
calling the helper asymmetrically when stp_mode changes between
start and stop.
Suggested-by: Ido Schimmel <idosch@nvidia.com>
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Andy Roulin <aroulin@nvidia.com>
Link: https://patch.msgid.link/20260405205224.3163000-2-aroulin@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a single device mode for netkit instead of netkit pairs. The primary
target for the paired devices is to connect network namespaces, of course,
and support has been implemented in projects like Cilium [0]. For the rxq
leasing the plan is to support two main scenarios related to single device
mode:
* For the use-case of io_uring zero-copy, the control plane can either
set up a netkit pair where the peer device can perform rxq leasing which
is then tied to the lifetime of the peer device, or the control plane
can use a regular netkit pair to connect the hostns to a Pod/container
and dynamically add/remove rxq leasing through a single device without
having to interrupt the device pair. In the case of io_uring, the memory
pool is used as skb non-linear pages, and thus the skb will go its way
through the regular stack into netkit. Things like the netkit policy when
no BPF is attached or skb scrubbing etc apply as-is in case the paired
devices are used, or if the backend memory is tied to the single device
and traffic goes through a paired device.
* For the use-case of AF_XDP, the control plane needs to use netkit in the
single device mode. The single device mode currently enforces only a
pass policy when no BPF is attached, and does not yet support BPF link
attachments for AF_XDP. skbs sent to that device get dropped at the
moment. Given AF_XDP operates at a lower layer of the stack tying this
to the netkit pair did not make sense. In future, the plan is to allow
BPF at the XDP layer which can: i) process traffic coming from the AF_XDP
application (e.g. QEMU with AF_XDP backend) to filter egress traffic or
to push selected egress traffic up to the single netkit device to the
local stack (e.g. DHCP requests), and ii) vice-versa skbs sent to the
single netkit into the AF_XDP application (e.g. DHCP replies). Also,
the control-plane can dynamically manage rxq leasing for the single
netkit device without having to interrupt (e.g. down/up cycle) the main
netkit pair for the Pod which has traffic going in and out.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Jordan Rife <jordan@jrife.io>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://docs.cilium.io/en/stable/operations/performance/tuning/#netkit-device-mode [0]
Link: https://patch.msgid.link/20260402231031.447597-11-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The fix commit converted several IPv4 address attributes from binary
to u32, but forgot to specify byte-order: big-endian. Without this,
YNL tools display IPv4 addresses incorrectly due to host-endian
interpretation.
Add the missing byte-order: big-endian to all affected u32 IPv4
address fields to ensure correct parsing and display.
Fixes: 1064d521d1 ("netlink: specs: support ipv4-or-v6 for dual-stack fields")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Link: https://patch.msgid.link/20251125112048.37631-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 1b255e1bea ("tools: ynl: add ipv4-or-v6 display hint"), we
can display either IPv4 or IPv6 addresses for a single field based on the
address family. However, most dual-stack fields still use the ipv4 display
hint. This update changes them to use the new ipv4-or-v6 display hint and
converts IPv4-only fields to use the u32 type.
Field changes:
- v4-or-v6
- IFA_ADDRESS, IFA_LOCAL
- IFLA_GRE_LOCAL, IFLA_GRE_REMOTE
- IFLA_VTI_LOCAL, IFLA_VTI_REMOTE
- IFLA_IPTUN_LOCAL, IFLA_IPTUN_REMOTE
- NDA_DST
- RTA_DST, RTA_SRC, RTA_GATEWAY, RTA_PREFSRC
- FRA_SRC, FRA_DST
- ipv4
- IFA_BROADCAST
- IFLA_GENEVE_REMOTE
- IFLA_IPTUN_6RD_RELAY_PREFIX
Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20251117024457.3034-3-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Various network interface types make use of needed_{head,tail}room values
to efficiently reserve buffer space for additional encapsulation headers,
such as VXLAN, Geneve, IPSec, etc. However, it is not currently possible
to query these values in a generic way.
Introduce ability to query the needed_{head,tail}room values of a network
device via rtnetlink, such that applications that may wish to use these
values can do so.
For example, Cilium agent iterates over present devices based on user config
(direct routing, vxlan, geneve, wireguard etc.) and in future will configure
netkit in order to expose the needed_{head,tail}room into K8s pods. See
b9ed315d3c ("netkit: Allow for configuring needed_{head,tail}room").
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alasdair McWilliam <alasdair@mcwilliam.dev>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20250917095543.14039-1-alasdair@mcwilliam.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In YAML, it is allowed to declare a scalar strings at the next lines
without explicitly declaring them as a block. Yet, they looks weird, and
can cause issues when ':' or '#' are present.
The modified lines didn't have issues with the special characters, but
it seems better to explicitly declare such blocks as scalar strings to
encourage people to "properly" declare future scalar strings.
The right angle bracket is used with a minus sign to indicate that the
folded style should be used without adding extra newlines. By doing
that, the output is not changed compared to what was done before this
patch.
Suggested-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250913-net-next-ynl-attr-doc-rst-v3-3-4f06420d87db@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We're trying to add a strict regexp for the name format in the spec.
Underscores will not be allowed, dashes should be used instead.
This makes no difference to C (codegen, if used, replaces special
chars in names) but it gives more uniform naming in Python.
Fixes: b2f63d904e ("doc/netlink: Add spec for rt link messages")
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250624211002.3475021-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Driver tests now require GRE tunnels, while we don't configure
them with YNL, YNL will complain when it sees link types it
doesn't recognize. Teach it decoding ip6gre tunnels. The attrs
are largely the same as IPv4 GRE.
Correct the type of encap-limit, but note that this attr is
only used in ip6gre, so the mistake didn't matter until now.
Fixes: 0d0f4174f6 ("selftests: drv-net: add a simple TSO test")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250603135357.502626-3-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
A number of fields in the ip tunnels are lacking the big-endian
designation. I suspect this is not intentional, as decoding
the ports with the right endian seems objectively beneficial.
Fixes: 6ffdbb93a5 ("netlink: specs: rt_link: decode ip6tnl, vti and vti6 link attrs")
Fixes: 077b6022d2 ("doc/netlink/specs: Add sub-message type to rt_link family")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250603135357.502626-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
devconf is even odder than SNMP. On input it reports an array of u32s
which seem to be indexed by the enum values - 1. On output kernel
expects a nest where each attr has the enum type as the nla type.
sub-type: u32 is probably best we can do right now.
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250506194101.696272-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
uAPI doesn't define structs for the SNMP counters, just enums to index
them as arrays. Switch to the same representation in the spec. C codegen
will soon need all the struct types to actually exist.
Note that the existing definition was broken, anyway, as the first
member should be the number of counters reported.
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250506194101.696272-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The purpose of the attribute list is to list the attributes
which will be included in a given message to shrink the objects
for families with huge attr spaces. Fixed headers are always
present in their entirety (between netlink header and the attrs)
so there's no point in listing their members. Current C codegen
doesn't expect them and tries to look them up in the attribute space.
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250418021706.1967583-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The rtnetlink family names are set to rt-$name within the YAML
but the files are called rt_$name. C codegen assumes that the
generated file name will match the family. The use of dashes
is in line with our general expectation that name properties
in the spec use dashes not underscores (even tho, as Donald
points out most genl families use underscores in the name).
We have 3 un-ideal options to choose from:
- accept the slight inconsistency with old families using _, or
- accept the slight annoyance with all languages having to do s/-/_/
when looking up family ID, or
- accept the inconsistency with all name properties in new YAML spec
being separated with - and just the family name always using _.
Pick option 1 and rename the rtnl spec files.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250410014658.782120-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>