Now that we moved all the helpers in place and make use netdev_change_owner()
to fixup the permissions when moving network devices between network
namespaces.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a function to change the owner of the queue entries for a network device
when it is moved between network namespaces.
Currently, when moving network devices between network namespaces the
ownership of the corresponding queue sysfs entries are not changed. This leads
to problems when tools try to operate on the corresponding sysfs files. Fix
this.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a function to change the owner of a network device when it is moved
between network namespaces.
Currently, when moving network devices between network namespaces the
ownership of the corresponding sysfs entries is not changed. This leads
to problems when tools try to operate on the corresponding sysfs files.
This leads to a bug whereby a network device that is created in a
network namespaces owned by a user namespace will have its corresponding
sysfs entry owned by the root user of the corresponding user namespace.
If such a network device has to be moved back to the host network
namespace the permissions will still be set to the user namespaces. This
means unprivileged users can e.g. trigger uevents for such incorrectly
owned devices. They can also modify the settings of the device itself.
Both of these things are unwanted.
For example, workloads will create network devices in the host network
namespace. Other tools will then proceed to move such devices between
network namespaces owner by other user namespaces. While the ownership
of the device itself is updated in
net/core/net-sysfs.c:dev_change_net_namespace() the corresponding sysfs
entry for the device is not:
drwxr-xr-x 5 nobody nobody 0 Jan 25 18:08 .
drwxr-xr-x 9 nobody nobody 0 Jan 25 18:08 ..
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_assign_type
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_len
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 address
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 broadcast
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_changes
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_down_count
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_up_count
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_id
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_port
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dormant
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 duplex
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 flags
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 gro_flush_timeout
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifalias
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifindex
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 iflink
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 link_mode
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 mtu
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 name_assign_type
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 netdev_group
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 operstate
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_id
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_name
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_switch_id
drwxr-xr-x 2 nobody nobody 0 Jan 25 18:09 power
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 proto_down
drwxr-xr-x 4 nobody nobody 0 Jan 25 18:09 queues
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 speed
drwxr-xr-x 2 nobody nobody 0 Jan 25 18:09 statistics
lrwxrwxrwx 1 nobody nobody 0 Jan 25 18:08 subsystem -> ../../../../class/net
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 tx_queue_len
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 type
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:08 uevent
However, if a device is created directly in the network namespace then
the device's sysfs permissions will be correctly updated:
drwxr-xr-x 5 root root 0 Jan 25 18:12 .
drwxr-xr-x 9 nobody nobody 0 Jan 25 18:08 ..
-r--r--r-- 1 root root 4096 Jan 25 18:12 addr_assign_type
-r--r--r-- 1 root root 4096 Jan 25 18:12 addr_len
-r--r--r-- 1 root root 4096 Jan 25 18:12 address
-r--r--r-- 1 root root 4096 Jan 25 18:12 broadcast
-rw-r--r-- 1 root root 4096 Jan 25 18:12 carrier
-r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_changes
-r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_down_count
-r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_up_count
-r--r--r-- 1 root root 4096 Jan 25 18:12 dev_id
-r--r--r-- 1 root root 4096 Jan 25 18:12 dev_port
-r--r--r-- 1 root root 4096 Jan 25 18:12 dormant
-r--r--r-- 1 root root 4096 Jan 25 18:12 duplex
-rw-r--r-- 1 root root 4096 Jan 25 18:12 flags
-rw-r--r-- 1 root root 4096 Jan 25 18:12 gro_flush_timeout
-rw-r--r-- 1 root root 4096 Jan 25 18:12 ifalias
-r--r--r-- 1 root root 4096 Jan 25 18:12 ifindex
-r--r--r-- 1 root root 4096 Jan 25 18:12 iflink
-r--r--r-- 1 root root 4096 Jan 25 18:12 link_mode
-rw-r--r-- 1 root root 4096 Jan 25 18:12 mtu
-r--r--r-- 1 root root 4096 Jan 25 18:12 name_assign_type
-rw-r--r-- 1 root root 4096 Jan 25 18:12 netdev_group
-r--r--r-- 1 root root 4096 Jan 25 18:12 operstate
-r--r--r-- 1 root root 4096 Jan 25 18:12 phys_port_id
-r--r--r-- 1 root root 4096 Jan 25 18:12 phys_port_name
-r--r--r-- 1 root root 4096 Jan 25 18:12 phys_switch_id
drwxr-xr-x 2 root root 0 Jan 25 18:12 power
-rw-r--r-- 1 root root 4096 Jan 25 18:12 proto_down
drwxr-xr-x 4 root root 0 Jan 25 18:12 queues
-r--r--r-- 1 root root 4096 Jan 25 18:12 speed
drwxr-xr-x 2 root root 0 Jan 25 18:12 statistics
lrwxrwxrwx 1 nobody nobody 0 Jan 25 18:12 subsystem -> ../../../../class/net
-rw-r--r-- 1 root root 4096 Jan 25 18:12 tx_queue_len
-r--r--r-- 1 root root 4096 Jan 25 18:12 type
-rw-r--r-- 1 root root 4096 Jan 25 18:12 uevent
Now, when creating a network device in a network namespace owned by a
user namespace and moving it to the host the permissions will be set to
the id that the user namespace root user has been mapped to on the host
leading to all sorts of permission issues:
458752
drwxr-xr-x 5 458752 458752 0 Jan 25 18:12 .
drwxr-xr-x 9 root root 0 Jan 25 18:08 ..
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 addr_assign_type
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 addr_len
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 address
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 broadcast
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_changes
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_down_count
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_up_count
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dev_id
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dev_port
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dormant
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 duplex
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 flags
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 gro_flush_timeout
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 ifalias
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 ifindex
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 iflink
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 link_mode
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 mtu
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 name_assign_type
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 netdev_group
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 operstate
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_port_id
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_port_name
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_switch_id
drwxr-xr-x 2 458752 458752 0 Jan 25 18:12 power
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 proto_down
drwxr-xr-x 4 458752 458752 0 Jan 25 18:12 queues
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 speed
drwxr-xr-x 2 458752 458752 0 Jan 25 18:12 statistics
lrwxrwxrwx 1 root root 0 Jan 25 18:12 subsystem -> ../../../../class/net
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 tx_queue_len
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 type
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 uevent
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cookie argument to devlink_trap_report() allowing driver to pass in
the user cookie. Pass on the cookie down to drop monitor code.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If driver passed along the cookie, push it through Netlink.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend struct flow_action_entry in order to hold TC action cookie
specified by user inserting the action.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg says:
====================
A new set of changes:
* lots of small documentation fixes, from Jérôme Pouiller
* beacon protection (BIGTK) support from Jouni Malinen
* some initial code for TID configuration, from Tamizh chelvam
* I reverted some new API before it's actually used, because
it's wrong to mix controlled port and preauth
* a few other cleanups/fixes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The Bareudp tunnel module provides a generic L3 encapsulation
tunnelling module for tunnelling different protocols like MPLS,
IP,NSH etc inside a UDP tunnel.
Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse reports a warning unix_wait_for_peer()
warning: context imbalance in unix_wait_for_peer() - unexpected unlock
The root cause is the missing annotation at unix_wait_for_peer()
Add the missing annotation __releases(&unix_sk(other)->lock)
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse reports a warning at dccp_child_process()
warning: context imbalance in dccp_child_process() - unexpected unlock
The root cause is the missing annotation at dccp_child_process()
Add the missing __releases(child) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse reports a warning at nr_neigh_stop()
warning: context imbalance in nr_neigh_stop() - unexpected unlock
The root cause is the missing annotation at nr_neigh_stop()
Add the missing __releases(&nr_neigh_list_lock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse reports a warning at nr_neigh_start()
warning: context imbalance in nr_neigh_start() - wrong count at exit
The root cause is the missing annotation at nr_neigh_start()
Add the missing __acquires(&nr_neigh_list_lock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse reports a warning at nr_node_stop()
warning: context imbalance in nr_node_stop() - wrong count at exit
The root cause is the missing annotation at nr_node_stop()
Add the missing __releases(&nr_node_list_lock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse reports a warning at nr_node_start()
warning: context imbalance in nr_node_start() - wrong count at exit
The root cause is the missing annotation at nr_node_start()
Add the missing __acquires(&nr_node_list_lock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse reports a warning at nr_info_stop()
warning: context imbalance in nr_info_stop() - unexpected unlock
The root cause is the missing annotation at nr_info_stop()
Add the missing __releases(&nr_list_lock)
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse reports a warning at nr_info_start()
warning: context imbalance in nr_info_start() - wrong count at exit
The root cause is the missing annotation at nr_info_start()
Add the missing __acquires(&nr_list_lock)
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse reports a warning at llc_seq_start()
warning: context imbalance in llc_seq_start() - wrong count at exit
The root cause is the msiing annotation at llc_seq_start()
Add the missing __acquires(RCU) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse reports a warning at sctp_transport_walk_stop()
warning: context imbalance in sctp_transport_walk_stop
- wrong count at exit
The root cause is the missing annotation at sctp_transport_walk_stop()
Add the missing __releases(RCU) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse reports a warning at sctp_transport_walk_start()
warning: context imbalance in sctp_transport_walk_start
- wrong count at exit
The root cause is the missing annotation at sctp_transport_walk_start()
Add the missing __acquires(RCU) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse reports a warning at sctp_err_finish()
warning: context imbalance in sctp_err_finish() - unexpected unlock
The root cause is a missing annotation at sctp_err_finish()
Add the missing __releases(&((__sk)->sk_lock.slock)) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6mr_for_each_table() macro uses list_for_each_entry_rcu()
for traversing outside an RCU read side critical section
but under the protection of rtnl_mutex. Hence add the
corresponding lockdep expression to silence the following
false-positive warnings:
[ 4.319479] =============================
[ 4.319480] WARNING: suspicious RCU usage
[ 4.319482] 5.5.4-stable #17 Tainted: G E
[ 4.319483] -----------------------------
[ 4.319485] net/ipv6/ip6mr.c:1243 RCU-list traversed in non-reader section!!
[ 4.456831] =============================
[ 4.456832] WARNING: suspicious RCU usage
[ 4.456834] 5.5.4-stable #17 Tainted: G E
[ 4.456835] -----------------------------
[ 4.456837] net/ipv6/ip6mr.c:1582 RCU-list traversed in non-reader section!!
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
md5sig->head maybe traversed using hlist_for_each_entry_rcu
outside an RCU read-side critical section but under the protection
of socket lock.
Hence, add corresponding lockdep expression to silence false-positive
warnings, and harden RCU lists.
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
list_for_each_entry_rcu() has built-in RCU and lock checking.
Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_ulp_list is traversed using list_for_each_entry_rcu
outside an RCU read-side critical section but under the protection
of tcp_ulp_list_lock.
Hence, add corresponding lockdep expression to silence false-positive
warnings, and harden RCU lists.t
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add packet traps that can report packets that were dropped during ACL
processing.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
generic_xdp_tx and xdp_do_generic_redirect are only used by builtin
code, so remove the EXPORT_SYMBOL_GPL for them.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support to configure per TID retry configuration
through the NL80211_TID_CONFIG_ATTR_RETRY_SHORT and
NL80211_TID_CONFIG_ATTR_RETRY_LONG attributes. This TID specific
retry configuration will have more precedence than phy level
configuration.
Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1579506687-18296-3-git-send-email-tamizhr@codeaurora.org
[rebase completely on top of my previous API changes]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Make some changes to the TID-config API:
* use u16 in nl80211 (only, and restrict to using 8 bits for now),
to avoid issues in the future if we ever want to use higher TIDs.
* reject empty TIDs mask (via netlink policy)
* change feature advertising to not use extended feature flags but
have own mechanism for this, which simplifies the code
* fix all variable names from 'tid' to 'tids' since it's a mask
* change to cfg80211_ name prefixes, not ieee80211_
* fix some minor docs/spelling things.
Change-Id: Ia234d464b3f914cdeab82f540e018855be580dce
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
IEEE P802.11-REVmd/D3.0 adds support for protecting Beacon frames using
a new set of keys (BIGTK; key index 6..7) similarly to the way
group-addressed Robust Management frames are protected (IGTK; key index
4..5). Extend cfg80211 and nl80211 to allow the new BIGTK to be
configured. Add an extended feature flag to indicate driver support for
the new key index values to avoid array overflows in driver
implementations and also to indicate to user space when this
functionality is available.
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200222132548.20835-2-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
After commit 2690048c01 ("net: igmp: Allow user-space
configuration of igmp unsolicited report interval"), they
are not used now
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-02-21
The following pull-request contains BPF updates for your *net-next* tree.
We've added 25 non-merge commits during the last 4 day(s) which contain
a total of 33 files changed, 2433 insertions(+), 161 deletions(-).
The main changes are:
1) Allow for adding TCP listen sockets into sock_map/hash so they can be used
with reuseport BPF programs, from Jakub Sitnicki.
2) Add a new bpf_program__set_attach_target() helper for adding libbpf support
to specify the tracepoint/function dynamically, from Eelco Chaudron.
3) Add bpf_read_branch_records() BPF helper which helps use cases like profile
guided optimizations, from Daniel Xu.
4) Enable bpf_perf_event_read_value() in all tracing programs, from Song Liu.
5) Relax BTF mandatory check if only used for libbpf itself e.g. to process
BTF defined maps, from Andrii Nakryiko.
6) Move BPF selftests -mcpu compilation attribute from 'probe' to 'v3' as it has
been observed that former fails in envs with low memlock, from Yonghong Song.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 736b46027e ("net: Add ID (if needed) to sock_reuseport and expose
reuseport_lock") has introduced lazy generation of reuseport group IDs that
survive group resize.
By comparing the identifier we check if BPF reuseport program is not trying
to select a socket from a BPF map that belongs to a different reuseport
group than the one the packet is for.
Because SOCKARRAY used to be the only BPF map type that can be used with
reuseport BPF, it was possible to delay the generation of reuseport group
ID until a socket from the group was inserted into BPF map for the first
time.
Now that SOCK{MAP,HASH} can be used with reuseport BPF we have two options,
either generate the reuseport ID on map update, like SOCKARRAY does, or
allocate an ID from the start when reuseport group gets created.
This patch takes the latter approach to keep sockmap free of calls into
reuseport code. This streamlines the reuseport_id access as its lifetime
now matches the longevity of reuseport object.
The cost of this simplification, however, is that we allocate reuseport IDs
for all SO_REUSEPORT users. Even those that don't use SOCKARRAY in their
setups. With the way identifiers are currently generated, we can have at
most S32_MAX reuseport groups, which hopefully is sufficient. If we ever
get close to the limit, we can switch an u64 counter like sk_cookie.
Another change is that we now always call into SOCKARRAY logic to unlink
the socket from the map when unhashing or closing the socket. Previously we
did it only when at least one socket from the group was in a BPF map.
It is worth noting that this doesn't conflict with sockmap tear-down in
case a socket is in a SOCK{MAP,HASH} and belongs to a reuseport
group. sockmap tear-down happens first:
prot->unhash
`- tcp_bpf_unhash
|- tcp_bpf_remove
| `- while (sk_psock_link_pop(psock))
| `- sk_psock_unlink
| `- sock_map_delete_from_link
| `- __sock_map_delete
| `- sock_map_unref
| `- sk_psock_put
| `- sk_psock_drop
| `- rcu_assign_sk_user_data(sk, NULL)
`- inet_unhash
`- reuseport_detach_sock
`- bpf_sk_reuseport_detach
`- WRITE_ONCE(sk->sk_user_data, NULL)
Suggested-by: Martin Lau <kafai@fb.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200218171023.844439-10-jakub@cloudflare.com
SOCKMAP & SOCKHASH now support storing references to listening
sockets. Nothing keeps us from using these map types a collection of
sockets to select from in BPF reuseport programs. Whitelist the map types
with the bpf_sk_select_reuseport helper.
The restriction that the socket has to be a member of a reuseport group
still applies. Sockets in SOCKMAP/SOCKHASH that don't have sk_reuseport_cb
set are not a valid target and we signal it with -EINVAL.
The main benefit from this change is that, in contrast to
REUSEPORT_SOCKARRAY, SOCK{MAP,HASH} don't impose a restriction that a
listening socket can be just one BPF map at the same time.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200218171023.844439-9-jakub@cloudflare.com
Tooling that populates the SOCK{MAP,HASH} with sockets from user-space
needs a way to inspect its contents. Returning the struct sock * that the
map holds to user-space is neither safe nor useful. An approach established
by REUSEPORT_SOCKARRAY is to return a socket cookie (a unique identifier)
instead.
Since socket cookies are u64 values, SOCK{MAP,HASH} need to support such a
value size for lookup to be possible. This requires special handling on
update, though. Attempts to do a lookup on a map holding u32 values will be
met with ENOSPC error.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-7-jakub@cloudflare.com
Now that sockmap/sockhash can hold listening sockets, when setting up the
psock we will (i) grab references to verdict/parser progs, and (2) override
socket upcalls sk_data_ready and sk_write_space.
However, since we cannot redirect to listening sockets so we don't need to
link the socket to the BPF progs. And more importantly we don't want the
listening socket to have overridden upcalls because they would get
inherited by child sockets cloned from it.
Introduce a separate initialization path for listening sockets that does
not change the upcalls and ignores the BPF progs.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-6-jakub@cloudflare.com