This patch allows a server application to get the TCP SYN headers for
its passive connections. This is useful if the server is doing
fingerprinting of clients based on SYN packet contents.
Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.
The first is used on a socket to enable saving the SYN headers
for child connections. This can be set before or after the listen()
call.
The latter is used to retrieve the SYN headers for passive connections,
if the parent listener has enabled TCP_SAVE_SYN.
TCP_SAVED_SYN is read once, it frees the saved SYN headers.
The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
headers.
Original patch was written by Tom Herbert, I changed it to not hold
a full skb (and associated dst and conntracking reference).
We have used such patch for about 3 years at Google.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes the request types and hacks from the block code and into the
old IDE driver. There is a small amunt of code duplication due to this,
but it's not too bad.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
These values are only used by the IDE driver, so move them into it
by allowing drivers to take cmd_type values after the first private
one. Note that we have to turn cmd_type into a plain unsigned integer
so that gcc doesn't complain about mismatching enum types.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We don't have any arch specific scatterlist now that parisc switched over
to the generic one.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.
If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.
Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Struct bio has an atomic ref count for chained bio's, and we use this
to know when to end IO on the bio. However, most bio's are not chained,
so we don't need to always introduce this atomic operation as part of
ending IO.
Add a helper to elevate the bi_remaining count, and flag the bio as
now actually needing the decrement at end_io time. Rename the field
to __bi_remaining to catch any current users of this doing the
incrementing manually.
For high IOPS workloads, this reduces the overhead of bio_endio()
substantially.
Tested-by: Robert Elliott <elliott@hp.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Addresses the following kernel logs seen during boot of sparc systems:
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Signed-off-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pull crypto fixes from Herbert Xu:
"This fixes a build problem with bcm63xx and yet another fix to the
memzero_explicit function to ensure that the memset is not elided"
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
hwrng: bcm63xx - Fix driver compilation
lib: make memzero_explicit more robust against dead store elimination
Refine the check for the presence of the "compatible" property
if the PRP0001 device ID is present in the device's list of
ACPI/PNP IDs to also print the message if _DSD is missing
entirely or the format of it is incorrect.
One special case to take into accout is that the "compatible"
property need not be provided for devices having the PRP0001
device ID in their lists of ACPI/PNP IDs if they are ancestors
of PRP0001 devices with the "compatible" property present.
This is to cover heriarchies of device objects where the kernel
is only supposed to use a struct device representation for the
topmost one and the others represent, for example, functional
blocks of a composite device.
While at it, reduce the log level of the message to "info"
and reduce the log level of the "broken _DSD" message to
"debug" (noise reduction).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Some devices advertise support for the READ/WRITE LOG DMA EXT commands
but fail when we try to issue them. This can lead to queued TRIM being
unintentionally disabled since the relevant feature flag is located in a
general purpose log page.
Fall back to unqueued READ LOG EXT if the DMA variant fails while
reading a log page.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Support for the READ/WRITE LOG DMA EXT commands can be signaled either
in page 119 or page 120. We should check both pages.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add functionality to enable the port mapper on the passive side to provide to its
clients the actual (non-mapped) ip/tcp address information of the connecting peer
1) Adding remote_info_cb() to process the address info of the connecting peer
The address info is provided by the user space port mapper service when
the connection is initiated by the peer
2) Adding a hash list to store the remote address info
3) Adding functionality to add/remove the remote address info
After the info has been provided to the port mapper client,
it is removed from the hash list
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add support for enabling codec wakeup override signal to allow
re-enumeration of the controller on SKL after resume from low power state.
In SKL, HDMI/DP codec and PCH HD Audio Controller are in different power
wells, so it's necessary to reset display audio codecs when power well on,
otherwise display audio codecs will disappear when resume from low power
state.
Reset steps when power on:
enable codec wakeup -> azx_init_chip() -> disable codec wakeup
v3 by Jani: Simplify to only support toggling the appropriate chicken bit.
v4 by Han: add explanation and specify the hw swquence.
Signed-off-by: Lu, Han <han.lu@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
This is just a code cleanup, make the LED trigger names const
as they're not expected to be modified by drivers.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The EMC driver needs to know the number of external memory devices and
also needs to update the EMEM configuration based on the new rate of the
memory bus.
To know how to update the EMEM config, looks up the values of the burst
regs in the DT, for a given timing.
Signed-off-by: Mikko Perttunen <mperttunen@nvidia.com>
Signed-off-by: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
The magic auth tokens we have are a simple map from cyclic IDs to drm_file
objects. Remove all the old bulk of code and replace it with a simple,
direct IDR.
The previous behavior is kept. Especially calling authmagic multiple times
on the same magic results in EINVAL except on the first call. The only
difference in behavior is that we never allocate IDs multiple times as
long as a client has its FD open.
v2:
- Fix return code of GetMagic()
- Use non-cyclic IDR allocator
- fix off-by-one in "magic > INT_MAX" sanity check
v3:
- drop redundant "magic > INT_MAX" check
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
The power allocator governor is a thermal governor that controls system
and device power allocation to control temperature. Conceptually, the
implementation divides the sustainable power of a thermal zone among
all the heat sources in that zone.
This governor relies on "power actors", entities that represent heat
sources. They can report current and maximum power consumption and
can set a given maximum power consumption, usually via a cooling
device.
The governor uses a Proportional Integral Derivative (PID) controller
driven by the temperature of the thermal zone. The output of the
controller is a power budget that is then allocated to each power
actor that can have bearing on the temperature we are trying to
control. It decides how much power to give each cooling device based
on the performance they are requesting. The PID controller ensures
that the total power budget does not exceed the control temperature.
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
Add a basic power model to the cpu cooling device to implement the
power cooling device API. The power model uses the current frequency,
current load and OPPs for the power calculations. The cpus must have
registered their OPPs using the OPP library.
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Kapileshwar Singh <kapileshwar.singh@arm.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
Add three optional callbacks to the cooling device interface to allow
them to express power. In addition to the callbacks, add helpers to
identify cooling devices that implement the power cooling device API.
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
A governor may need to store its current state between calls to
throttle(). That state depends on the thermal zone, so store it as
private data in struct thermal_zone_device.
The governors may have two new ops: bind_to_tz() and unbind_from_tz().
When provided, these functions let governors do some initialization
and teardown when they are bound/unbound to a tz and possibly store that
information in the governor_data field of the struct
thermal_zone_device.
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
The fair share governor has the concept of weights, which is the
influence of each cooling device in a thermal zone. The current
implementation forces the weights of all cooling devices in a thermal
zone to add up to a 100. This complicates setups, as you need to know
in advance how many cooling devices you are going to have. If you bind a
new cooling device, you have to modify all the other cooling devices
weights, which is error prone. Furthermore, you can't specify a
"default" weight for platforms since that default value depends on the
number of cooling devices in the platform.
This patch generalizes the concept of weight by allowing any number to
be a "weight". Weights are now relative to each other. Platforms that
don't specify weights get the same default value for all their cooling
devices, so all their cdevs are considered to be equally influential.
It's important to note that previous users of the weights don't need to
alter the code: percentages continue to work as they used to. This
patch just removes the constraint of all the weights in a thermal zone
having to add up to a 100. If they do, you get the same behavior as
before. If they don't, fair share now works for that platform.
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Cc: Durgadoss R <durgadoss.r@intel.com>
Acked-by: Durgadoss R <durgadoss.r@intel.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
Currently you can specify the weight of the cooling device in the device
tree but that information is not populated to the
thermal_bind_params where the fair share governor expects it to
be. The of thermal zone device doesn't have a thermal_bind_params
structure and arguably it's better to pass the weight inside the
thermal_instance as it is specific to the bind of a cooling device to a
thermal zone parameter.
Core thermal code is fixed to populate the weight in the instance from
the thermal_bind_params, so platform code that was passing the weight
inside the thermal_bind_params continue to work seamlessly.
While we are at it, create a default value for the weight parameter for
those thermal zones that currently don't define it and remove the
hardcoded default in of-thermal.
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Peter Feuerer <peter@piie.net>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Eduardo Valentin <edubezval@gmail.com>
Cc: Kukjin Kim <kgene@kernel.org>
Cc: Durgadoss R <durgadoss.r@intel.com>
Signed-off-by: Kapileshwar Singh <kapileshwar.singh@arm.com>
Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
The intel_pstate driver is difficult to debug and investigate without tsc.
Also, it is likely use of tsc, and some version of C0 percentage,
will be re-introdcued in futute.
There have also been occasions where it is desirebale to know, and
confirm, the previous target pstate.
This patch brings back tsc, adds previous target pstate,
and adds both to the trace data collection.
Signed-off-by: Doug Smythies <dsmythies@telus.net>
Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Johannes Berg says:
====================
We have only a few fixes right now:
* a fix for an issue with hash collision handling in the
rhashtable conversion
* a merge issue - rhashtable removed default shrinking
just before mac80211 was converted, so enable it now
* remove an invalid WARN that can trigger with legitimate
userspace behaviour
* add a struct member missing from kernel-doc that caused
a lot of warnings
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-05-04
Here's the first bluetooth-next pull request for 4.2:
- Various fixes for at86rf230 driver
- ieee802154: trace events support for rdev->ops
- HCI UART driver refactoring
- New Realtek IDs added to btusb driver
- Off-by-one fix for rtl8723b in btusb driver
- Refactoring of btbcm driver for both UART & USB use
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When a FUA request enters its DATA stage of flush pipeline, the
request is added to mq requeue list, the request will then be added to
ctx->rq_list. blk_mq_attempt_merge() might merge the request with a bio.
Later when the request is finished the flush pipeline, the
request->__data_len is 0. Then I only saw the bio gets endio called, the
original request never finish.
Adding REQ_FLUSH_SEQ into REQ_NOMERGE_FLAGS looks an easy fix.
stable: 3.15+
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
With this patch, the IGMP and MLD message validation functions are moved
from the bridge code to IPv4/IPv6 multicast files. Some small
refactoring was done to enhance readibility and to iron out some
differences in behaviour between the IGMP and MLD parsing code (e.g. the
skb-cloning of MLD messages is now only done if necessary, just like the
IGMP part always did).
Finally, these IGMP and MLD message validation functions are exported so
that not only the bridge can use it but batman-adv later, too.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Part of led-trigger API was in the private drivers/leds/leds.h header.
Move it to the include/linux/leds.h header to unify the API location
and announce it as public. It has been already exported from
led-triggers.c with EXPORT_SYMBOL_GPL macro. The no-op definitions are
changed from macros to inline to match the style of the surrounding code.
Signed-off-by: Jacek Anaszewski <j.anaszewski@samsung.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Bryan Wu <cooloney@gmail.com>
The pmc driver was previously exporting tegra_pmc_restart, which was
assigned to machine_desc.init_machine, taking precedence over the
restart handlers registered through register_restart_handler().
Signed-off-by: David Riley <davidriley@chromium.org>
[tomeu.vizoso@collabora.com: Rebased]
Signed-off-by: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
[treding@nvidia.com: minor cleanups]
Signed-off-by: Thierry Reding <treding@nvidia.com>
Eliminate 90 of these warnings:
Warning(..//include/net/mac80211.h:1682): No description found for parameter 'drv_priv[0]'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Provide clients and swgroups files in debugfs. These files show for
which clients IOMMU translation is enabled and which ASID is associated
with each SWGROUP.
Cc: Hiroshi Doyu <hdoyu@nvidia.com>
Acked-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Subsequent patches will add debugfs files that print the status of the
SWGROUPs. Add a new names field and complement the SoC tables with the
names of the individual SWGROUPs.
Signed-off-by: Thierry Reding <treding@nvidia.com>
In commit 0b053c9518 ("lib: memzero_explicit: use barrier instead
of OPTIMIZER_HIDE_VAR"), we made memzero_explicit() more robust in
case LTO would decide to inline memzero_explicit() and eventually
find out it could be elimiated as dead store.
While using barrier() works well for the case of gcc, recent efforts
from LLVMLinux people suggest to use llvm as an alternative to gcc,
and there, Stephan found in a simple stand-alone user space example
that llvm could nevertheless optimize and thus elimitate the memset().
A similar issue has been observed in the referenced llvm bug report,
which is regarded as not-a-bug.
Based on some experiments, icc is a bit special on its own, while it
doesn't seem to eliminate the memset(), it could do so with an own
implementation, and then result in similar findings as with llvm.
The fix in this patch now works for all three compilers (also tested
with more aggressive optimization levels). Arguably, in the current
kernel tree it's more of a theoretical issue, but imho, it's better
to be pedantic about it.
It's clearly visible with gcc/llvm though, with the below code: if we
would have used barrier() only here, llvm would have omitted clearing,
not so with barrier_data() variant:
static inline void memzero_explicit(void *s, size_t count)
{
memset(s, 0, count);
barrier_data(s);
}
int main(void)
{
char buff[20];
memzero_explicit(buff, sizeof(buff));
return 0;
}
$ gcc -O2 test.c
$ gdb a.out
(gdb) disassemble main
Dump of assembler code for function main:
0x0000000000400400 <+0>: lea -0x28(%rsp),%rax
0x0000000000400405 <+5>: movq $0x0,-0x28(%rsp)
0x000000000040040e <+14>: movq $0x0,-0x20(%rsp)
0x0000000000400417 <+23>: movl $0x0,-0x18(%rsp)
0x000000000040041f <+31>: xor %eax,%eax
0x0000000000400421 <+33>: retq
End of assembler dump.
$ clang -O2 test.c
$ gdb a.out
(gdb) disassemble main
Dump of assembler code for function main:
0x00000000004004f0 <+0>: xorps %xmm0,%xmm0
0x00000000004004f3 <+3>: movaps %xmm0,-0x18(%rsp)
0x00000000004004f8 <+8>: movl $0x0,-0x8(%rsp)
0x0000000000400500 <+16>: lea -0x18(%rsp),%rax
0x0000000000400505 <+21>: xor %eax,%eax
0x0000000000400507 <+23>: retq
End of assembler dump.
As gcc, clang, but also icc defines __GNUC__, it's sufficient to define
this in compiler-gcc.h only to be picked up. For a fallback or otherwise
unsupported compiler, we define it as a barrier. Similarly, for ecc which
does not support gcc inline asm.
Reference: https://llvm.org/bugs/show_bug.cgi?id=15495
Reported-by: Stephan Mueller <smueller@chronox.de>
Tested-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Stephan Mueller <smueller@chronox.de>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: mancha security <mancha1@zoho.com>
Cc: Mark Charlebois <charlebm@gmail.com>
Cc: Behan Webster <behanw@converseincode.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This was a bit too much cargo-culted, so lets make it solid:
- vblank->count doesn't need to be an atomic, writes are always done
under the protection of dev->vblank_time_lock. Switch to an unsigned
long instead and update comments. Note that atomic_read is just a
normal read of a volatile variable, so no need to audit all the
read-side access specifically.
- The barriers for the vblank counter seqlock weren't complete: The
read-side was missing the first barrier between the counter read and
the timestamp read, it only had a barrier between the ts and the
counter read. We need both.
- Barriers weren't properly documented. Since barriers only work if
you have them on boths sides of the transaction it's prudent to
reference where the other side is. To avoid duplicating the
write-side comment 3 times extract a little store_vblank() helper.
In that helper also assert that we do indeed hold
dev->vblank_time_lock, since in some cases the lock is acquired a
few functions up in the callchain.
Spotted while reviewing a patch from Chris Wilson to add a fastpath to
the vblank_wait ioctl.
v2: Add comment to better explain how store_vblank works, suggested by
Chris.
v3: Peter noticed that as-is the 2nd smp_wmb is redundant with the
implicit barrier in the spin_unlock. But that can only be proven by
auditing all callers and my point in extracting this little helper was
to localize all the locking into just one place. Hence I think that
additional optimization is too risky.
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Michel Dänzer <michel@daenzer.net>
Cc: Peter Hurley <peter@hurleysoftware.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-and-tested-by: Mario Kleiner <mario.kleiner.de@gmail.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Some users of flow keys (well just sch_choke now) need to pass
flow_keys in skbuff cb, and use them for exact comparisons of flows
so that skb->hash is not sufficient. In order to increase size of
the flow_keys structure, we introduce another structure for
the purpose of passing flow keys in skbuff cb. We limit this structure
to sixteen bytes, and we will technically treat this as a digest of
flow_keys struct hence its name flow_keys_digest. In the first
incaranation we just copy the flow_keys structure up to 16 bytes--
this is the same information previously passed in the cb. In the
future, we'll adapt this for larger flow_keys and could use something
like SHA-1 over the whole flow_keys to improve the quality of the
digest.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This calls flow_disect and __skb_get_hash to procure a hash for a
packet. Input includes a key to initialize jhash. This function
does not set skb->hash.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes it so that we process the address in
is_multicast_ether_addr at the same size as the other calls. This allows
us to avoid duplicate reads when used with other calls such as
is_zero_ether_addr or eth_addr_copy. In addition I have added a 64 bit
version of the function so in eth_type_trans we can process the destination
address as a 64 bit value throughout.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under presence of TSO/GSO/GRO packets, codel at low rates can be quite
useless. In following example, not a single packet was ever dropped,
while average delay in codel queue is ~100 ms !
qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
Sent 134376498 bytes 88797 pkt (dropped 0, overlimits 0 requeues 0)
backlog 13626b 3p requeues 0
count 0 lastcount 0 ldelay 96.9ms drop_next 0us
maxpacket 9084 ecn_mark 0 drop_overlimit 0
This comes from a confusion of what should be the minimal backlog. It is
pretty clear it is not 64KB or whatever max GSO packet ever reached the
qdisc.
codel intent was to use MTU of the device.
After the fix, we finally drop some packets, and rtt/cwnd of my single
TCP flow are meeting our expectations.
qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
Sent 102798497 bytes 67912 pkt (dropped 1365, overlimits 0 requeues 0)
backlog 6056b 3p requeues 0
count 1 lastcount 1 ldelay 36.3ms drop_next 0us
maxpacket 10598 ecn_mark 0 drop_overlimit 0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Van Jacobson <vanj@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch divides the IPv6 flow label space into two ranges:
0-7ffff is reserved for flow label manager, 80000-fffff will be
used for creating auto flow labels (per RFC6438). This only affects how
labels are set on transmit, it does not affect receive. This range split
can be disbaled by systcl.
Background:
IPv6 flow labels have been an unmitigated disappointment thus far
in the lifetime of IPv6. Support in HW devices to use them for ECMP
is lacking, and OSes don't turn them on by default. If we had these
we could get much better hashing in IPv6 networks without resorting
to DPI, possibly eliminating some of the motivations to to define new
encaps in UDP just for getting ECMP.
Unfortunately, the initial specfications of IPv6 did not clarify
how they are to be used. There has always been a vague concept that
these can be used for ECMP, flow hashing, etc. and we do now have a
good standard how to this in RFC6438. The problem is that flow labels
can be either stateful or stateless (as in RFC6438), and we are
presented with the possibility that a stateless label may collide
with a stateful one. Attempts to split the flow label space were
rejected in IETF. When we added support in Linux for RFC6438, we
could not turn on flow labels by default due to this conflict.
This patch splits the flow label space and should give us
a path to enabling auto flow labels by default for all IPv6 packets.
This is an API change so we need to consider compatibility with
existing deployment. The stateful range is chosen to be the lower
values in hopes that most uses would have chosen small numbers.
Once we resolve the stateless/stateful issue, we can proceed to
look at enabling RFC6438 flow labels by default (starting with
scaled testing).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>