Commit Graph

94276 Commits

Author SHA1 Message Date
Ying Huang
966a967116 smp: Avoid using two cache lines for struct call_single_data
struct call_single_data is used in IPIs to transfer information between
CPUs.  Its size is bigger than sizeof(unsigned long) and less than
cache line size.  Currently it is not allocated with any explicit alignment
requirements.  This makes it possible for allocated call_single_data to
cross two cache lines, which results in double the number of the cache lines
that need to be transferred among CPUs.

This can be fixed by requiring call_single_data to be aligned with the
size of call_single_data. Currently the size of call_single_data is the
power of 2.  If we add new fields to call_single_data, we may need to
add padding to make sure the size of new definition is the power of 2
as well.

Fortunately, this is enforced by GCC, which will report bad sizes.

To set alignment requirements of call_single_data to the size of
call_single_data, a struct definition and a typedef is used.

To test the effect of the patch, I used the vm-scalability multiple
thread swap test case (swap-w-seq-mt).  The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPIs are triggered when unmapping
memory.  In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data, because of faster IPIs.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Huang, Ying <ying.huang@intel.com>
[ Add call_single_data_t and align with size of call_single_data. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-29 15:14:38 +02:00
Peter Zijlstra
f52be57080 locking/lockdep: Untangle xhlock history save/restore from task independence
Where XHLOCK_{SOFT,HARD} are save/restore points in the xhlocks[] to
ensure the temporal IRQ events don't interact with task state, the
XHLOCK_PROC is a fundament different beast that just happens to share
the interface.

The purpose of XHLOCK_PROC is to annotate independent execution inside
one task. For example workqueues, each work should appear to run in its
own 'pristine' 'task'.

Remove XHLOCK_PROC in favour of its own interface to avoid confusion.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: david@fromorbit.com
Cc: johannes@sipsolutions.net
Cc: kernel-team@lge.com
Cc: oleg@redhat.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/20170829085939.ggmb6xiohw67micb@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-29 15:14:38 +02:00
Jiri Slaby
30d6e0a419 futex: Remove duplicated code and fix undefined behaviour
There is code duplicated over all architecture's headers for
futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
and comparison of the result.

Remove this duplication and leave up to the arches only the needed
assembly which is now in arch_futex_atomic_op_inuser.

This effectively distributes the Will Deacon's arm64 fix for undefined
behaviour reported by UBSAN to all architectures. The fix was done in
commit 5f16a046f8 (arm64: futex: Fix undefined behaviour with
FUTEX_OP_OPARG_SHIFT usage). Look there for an example dump.

And as suggested by Thomas, check for negative oparg too, because it was
also reported to cause undefined behaviour report.

Note that s390 removed access_ok check in d12a29703 ("s390/uaccess:
remove pointless access_ok() checks") as access_ok there returns true.
We introduce it back to the helper for the sake of simplicity (it gets
optimized away anyway).

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> [s390]
Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>
Reviewed-by: Will Deacon <will.deacon@arm.com> [core/arm64]
Cc: linux-mips@linux-mips.org
Cc: Rich Felker <dalias@libc.org>
Cc: linux-ia64@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: peterz@infradead.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: sparclinux@vger.kernel.org
Cc: Jonas Bonn <jonas@southpole.se>
Cc: linux-s390@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: linux-hexagon@vger.kernel.org
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: linux-snps-arc@lists.infradead.org
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-xtensa@linux-xtensa.org
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: openrisc@lists.librecores.org
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Stafford Horne <shorne@gmail.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Richard Henderson <rth@twiddle.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-parisc@vger.kernel.org
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: linux-alpha@vger.kernel.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: "David S. Miller" <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20170824073105.3901-1-jslaby@suse.cz
2017-08-25 22:49:59 +02:00
Peter Zijlstra
e6f3faa734 locking/lockdep: Fix workqueue crossrelease annotation
The new completion/crossrelease annotations interact unfavourable with
the extant flush_work()/flush_workqueue() annotations.

The problem is that when a single work class does:

  wait_for_completion(&C)

and

  complete(&C)

in different executions, we'll build dependencies like:

  lock_map_acquire(W)
  complete_acquire(C)

and

  lock_map_acquire(W)
  complete_release(C)

which results in the dependency chain: W->C->W, which lockdep thinks
spells deadlock, even though there is no deadlock potential since
works are ran concurrently.

One possibility would be to change the work 'lock' to recursive-read,
but that would mean hitting a lockdep limitation on recursive locks.
Also, unconditinoally switching to recursive-read here would fail to
detect the actual deadlock on single-threaded workqueues, which do
have a problem with this.

For now, forcefully disregard these locks for crossrelease.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: byungchul.park@lge.com
Cc: david@fromorbit.com
Cc: johannes@sipsolutions.net
Cc: oleg@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:06:33 +02:00
Peter Zijlstra
0e709703af mm, locking/barriers: Clarify tlb_flush_pending() barriers
Better document the ordering around tlb_flush_pending().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:06:31 +02:00
Ingo Molnar
10c9850cb2 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:04:51 +02:00
Linus Torvalds
90a6cd5039 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull more rdma fixes from Doug Ledford:
 "Well, I thought we were going to be done for this -rc cycle. I should
  have known better than to say so though.

  We have four additional items that trickled in.

  One was a simple mistake on my part. I took a patch into my for-next
  thinking that the issue was less severe than it was. I was then
  notified that it needed to be in my -rc area instead.

  The other three were just found late in testing.

  Summary:

   - One core fix accidentally applied first to for-next and then cherry
     picked back because it needed to be in the -rc cycles instead

   - Another core fix

   - Two mlx5 fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
  IB/mlx5: Always return success for RoCE modify port
  IB/mlx5: Fix Raw Packet QP event handler assignment
  IB/core: Avoid accessing non-allocated memory when inferring port type
  RDMA/uverbs: Initialize cq_context appropriately
2017-08-24 15:48:38 -07:00
Linus Torvalds
f7bbf0754b Merge tag 'kbuild-fixes-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:

 - fix linker script regression caused by dead code elimination support

 - fix typos and outdated comments

 - specify kselftest-clean as a PHONY target

 - fix "make dtbs_install" when $(srctree) includes shell special
   characters like '~'

 - Move -fshort-wchar to the global option list because defining it
   partially emits warnings

* tag 'kbuild-fixes-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: update comments of Makefile.asm-generic
  kbuild: Do not use hyphen in exported variable name
  Makefile: add kselftest-clean to PHONY target list
  Kbuild: use -fshort-wchar globally
  fixdep: trivial: typo fix and correction
  kbuild: trivial cleanups on the comments
  kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured
2017-08-24 14:22:27 -07:00
Eric W. Biederman
311fc65c9f pty: Repair TIOCGPTPEER
The implementation of TIOCGPTPEER has two issues.

When /dev/ptmx (as opposed to /dev/pts/ptmx) is opened the wrong
vfsmount is passed to dentry_open.  Which results in the kernel displaying
the wrong pathname for the peer.

The second is simply by caching the vfsmount and dentry of the peer it leaves
them open, in a way they were not previously Which because of the inreased
reference counts can cause unnecessary behaviour differences resulting in
regressions.

To fix these move the ioctl into tty_io.c at a generic level allowing
the ioctl to have access to the struct file on which the ioctl is
being called.  This allows the path of the slave to be derived when
opening the slave through TIOCGPTPEER instead of requiring the path to
the slave be cached.  Thus removing the need for caching the path.

A new function devpts_ptmx_path is factored out of devpts_acquire and
used to implement a function devpts_mntget.   The new function devpts_mntget
takes a filp to perform the lookup on and fsi so that it can confirm
that the superblock that is found by devpts_ptmx_path is the proper superblock.

v2: Lots of fixes to make the code actually work
v3: Suggestions by Linus
    - Removed the unnecessary initialization of filp in ptm_open_peer
    - Simplified devpts_ptmx_path as gotos are no longer required

[ This is the fix for the issue that was reverted in commit
  143c97cc65, but this time without breaking 'pbuilder' due to
  increased reference counts   - Linus ]

Fixes: 54ebbfb160 ("tty: add TIOCGPTPEER ioctl")
Reported-by: Christian Brauner <christian.brauner@canonical.com>
Reported-and-tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-24 13:23:03 -07:00
Noa Osherovich
498ca3c82a IB/core: Avoid accessing non-allocated memory when inferring port type
Commit 44c58487d5 ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
introduced the concept of type in ah_attr:
 * During ib_register_device, each port is checked for its type which
   is stored in ib_device's port_immutable array.
 * During uverbs' modify_qp, the type is inferred using the port number
   in ib_uverbs_qp_dest struct (address vector) by accessing the
   relevant port_immutable array and the type is passed on to
   providers.

IB spec (version 1.3) enforces a valid port value only in Reset to
Init. During Init to RTR, the address vector must be valid but port
number is not mentioned as a field in the address vector, so its
value is not validated, which leads to accesses to a non-allocated
memory when inferring the port type.

Save the real port number in ib_qp during modify to Init (when the
comp_mask indicates that the port number is valid) and use this value
to infer the port type.

Avoid copying the address vector fields if the matching bit is not set
in the attr_mask. Address vector can't be modified before the port, so
no valid flow is affected.

Fixes: 44c58487d5 ('IB/core: Define 'ib' and 'roce' rdma_ah_attr types')
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-08-24 15:33:33 -04:00
Linus Torvalds
143c97cc65 Revert "pty: fix the cached path of the pty slave file descriptor in the master"
This reverts commit c8c03f1858.

It turns out that while fixing the ptmx file descriptor to have the
correct 'struct path' to the associated slave pty is a really good
thing, it breaks some user space tools for a very annoying reason.

The problem is that /dev/ptmx and its associated slave pty (/dev/pts/X)
are on different mounts.  That was what caused us to have the wrong path
in the first place (we would mix up the vfsmount of the 'ptmx' node,
with the dentry of the pty slave node), but it also means that now while
we use the right vfsmount, having the pty master open also keeps the pts
mount busy.

And it turn sout that that makes 'pbuilder' very unhappy, as noted by
Stefan Lippers-Hollmann:

 "This patch introduces a regression for me when using pbuilder
  0.228.7[2] (a helper to build Debian packages in a chroot and to
  create and update its chroots) when trying to umount /dev/ptmx (inside
  the chroot) on Debian/ unstable (full log and pbuilder configuration
  file[3] attached).

  [...]
  Setting up build-essential (12.3) ...
  Processing triggers for libc-bin (2.24-15) ...
  I: unmounting dev/ptmx filesystem
  W: Could not unmount dev/ptmx: umount: /var/cache/pbuilder/build/1340/dev/ptmx: target is busy
          (In some cases useful info about processes that
           use the device is found by lsof(8) or fuser(1).)"

apparently pbuilder tries to unmount the /dev/pts filesystem while still
holding at least one master node open, which is arguably not very nice,
but we don't break user space even when fixing other bugs.

So this commit has to be reverted.

I'll try to figure out a way to avoid caching the path to the slave pty
in the master pty.  The only thing that actually wants that slave pty
path is the "TIOCGPTPEER" ioctl, and I think we could just recreate the
path at that time.

Reported-by: Stefan Lippers-Hollmann <s.l-h@gmx.de>
Cc: Eric W Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian.brauner@canonical.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-23 18:16:11 -07:00
Linus Torvalds
55652400fd Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
 "Six minor and error leg fixes, plus one major change: the reversion of
  scsi-mq as the default.

  We're doing the latter temporarily (with a backport to stable) to give
  us time to fix all the issues that turned up with this default before
  trying again"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: cxgb4i: call neigh_event_send() to update MAC address
  Revert "scsi: default to scsi-mq"
  scsi: sd_zbc: Write unlock zone from sd_uninit_cmnd()
  scsi: aacraid: Fix out of bounds in aac_get_name_resp
  scsi: csiostor: fail probe if fw does not support FCoE
  scsi: megaraid_sas: fix error handle in megasas_probe_one
2017-08-23 11:34:40 -07:00
Linus Torvalds
e3181f2c0c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix IGMP handling wrt VRF, from David Ahern.

 2) Fix timer access to freed object in dccp, from Eric Dumazet.

 3) Use kmalloc_array() in ptr_ring to avoid overflow cases which are
    triggerable by userspace. Also from Eric Dumazet.

 4) Fix infinite loop in unmapping cleanup of nfp driver, from Colin Ian
    King.

 5) Correct datagram peek handling of empty SKBs, from Matthew Dawson.

 6) Fix use after free in TIPC, from Eric Dumazet.

 7) When replacing a route in ipv6 we need to reset the round robin
    pointer, from Wei Wang.

 8) Fix bug in pci_find_pcie_root_port() which was unearthed by the
    relaxed ordering changes, from Thierry Redding. I made sure to get
    an explicit ACK from Bjorn this time around :-)

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits)
  ipv6: repair fib6 tree in failure case
  net_sched: fix order of queue length updates in qdisc_replace()
  tools lib bpf: improve warning
  switchdev: documentation: minor typo fixes
  bpf, doc: also add s390x as arch to sysctl description
  net: sched: fix NULL pointer dereference when action calls some targets
  rxrpc: Fix oops when discarding a preallocated service call
  irda: do not leak initialized list.dev to userspace
  net/mlx4_core: Enable 4K UAR if SRIOV module parameter is not enabled
  PCI: Allow PCI express root ports to find themselves
  tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP
  net: check and errout if res->fi is NULL when RTM_F_FIB_MATCH is set
  ipv6: reset fn->rr_ptr when replacing route
  sctp: fully initialize the IPv6 address in sctp_v6_to_addr()
  tipc: fix use-after-free
  tun: handle register_netdevice() failures properly
  datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  bpf, doc: improve sysctl knob description
  netxen: fix incorrect loop counter decrement
  nfp: fix infinite loop on umapping cleanup
  ...
2017-08-21 13:16:27 -07:00
Oleg Nesterov
dd1c1f2f20 pids: make task_tgid_nr_ns() safe
This was reported many times, and this was even mentioned in commit
52ee2dfdd4 ("pids: refactor vnr/nr_ns helpers to make them safe") but
somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
not safe because task->group_leader points to nowhere after the exiting
task passes exit_notify(), rcu_read_lock() can not help.

We really need to change __unhash_process() to nullify group_leader,
parent, and real_parent, but this needs some cleanups.  Until then we
can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
fix the problem.

Reported-by: Troy Kensinger <tkensinger@google.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-21 12:47:31 -07:00
Konstantin Khlebnikov
68a66d149a net_sched: fix order of queue length updates in qdisc_replace()
This important to call qdisc_tree_reduce_backlog() after changing queue
length. Parent qdisc should deactivate class in ->qlen_notify() called from
qdisc_tree_reduce_backlog() but this happens only if qdisc->q.qlen in zero.

Missed class deactivations leads to crashes/warnings at picking packets
from empty qdisc and corrupting state at reactivating this class in future.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 86a7996cc8 ("net_sched: introduce qdisc_replace() helper")
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-20 20:02:00 -07:00
Linus Torvalds
e46db8d2ef Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "Two fixes for the perf subsystem:

   - Fix an inconsistency of RDPMC mm struct tagging across exec() which
     causes RDPMC to fault.

   - Correct the timestamp mechanics across IOC_DISABLE/ENABLE which
     causes incorrect timestamps and total time calculations"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix time on IOC_ENABLE
  perf/x86: Fix RDPMC vs. mm_struct tracking
2017-08-20 09:20:57 -07:00
Linus Torvalds
e18a5ebc2d Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull watchdog fix from Thomas Gleixner:
 "A fix for the hardlockup watchdog to prevent false positives with
  extreme Turbo-Modes which make the perf/NMI watchdog fire faster than
  the hrtimer which is used to verify.

  Slightly larger than the minimal fix, which just would increase the
  hrtimer frequency, but comes with extra overhead of more watchdog
  timer interrupts and thread wakeups for all users.

  With this change we restrict the overhead to the extreme Turbo-Mode
  systems"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kernel/watchdog: Prevent false positives with turbo modes
2017-08-20 08:54:30 -07:00
Michal Hocko
6b31d5955c mm, oom: fix potential data corruption when oom_reaper races with writer
Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies, and so the
oom_reaper can tear it down, is not entirely true.

__task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT is set
but do_group_exit sends SIGKILL to all threads _after_ the flag is set.
So there is a race window when some threads won't have
fatal_signal_pending while the oom_reaper could start unmapping the
address space.  Moreover some paths might not check for fatal signals
before each PF/g-u-p/copy_from_user.

We already have a protection for oom_reaper vs.  PF races by checking
MMF_UNSTABLE.  This has been, however, checked only for kernel threads
(use_mm users) which can outlive the oom victim.  A simple fix would be
to extend the current check in handle_mm_fault for all tasks but that
wouldn't be sufficient because the current check assumes that a kernel
thread would bail out after EFAULT from get_user*/copy_from_user and
never re-read the same address which would succeed because the PF path
has established page tables already.  This seems to be the case for the
only existing use_mm user currently (virtio driver) but it is rather
fragile in general.

This is even more fragile in general for more complex paths such as
generic_perform_write which can re-read the same address more times
(e.g.  iov_iter_copy_from_user_atomic to fail and then
iov_iter_fault_in_readable on retry).

Therefore we have to implement MMF_UNSTABLE protection in a robust way
and never make a potentially corrupted content visible.  That requires
to hook deeper into the PF path and check for the flag _every time_
before a pte for anonymous memory is established (that means all
!VM_SHARED mappings).

The corruption can be triggered artificially
(http://lkml.kernel.org/r/201708040646.v746kkhC024636@www262.sakura.ne.jp)
but there doesn't seem to be any real life bug report.  The race window
should be quite tight to trigger most of the time.

Link: http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org
Fixes: aac4536355 ("mm, oom: introduce oom reaper")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18 15:32:01 -07:00
Pavel Tatashin
3010f87650 mm: discard memblock data later
There is existing use after free bug when deferred struct pages are
enabled:

The memblock_add() allocates memory for the memory array if more than
128 entries are needed.  See comment in e820__memblock_setup():

  * The bootstrap memblock region count maximum is 128 entries
  * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
  * than that - so allow memblock resizing.

This memblock memory is freed here:
        free_low_memory_core_early()

We access the freed memblock.memory later in boot when deferred pages
are initialized in this path:

        deferred_init_memmap()
                for_each_mem_pfn_range()
                  __next_mem_pfn_range()
                    type = &memblock.memory;

One possible explanation for why this use-after-free hasn't been hit
before is that the limit of INIT_MEMBLOCK_REGIONS has never been
exceeded at least on systems where deferred struct pages were enabled.

Tested by reducing INIT_MEMBLOCK_REGIONS down to 4 from the current 128,
and verifying in qemu that this code is getting excuted and that the
freed pages are sane.

Link: http://lkml.kernel.org/r/1502485554-318703-2-git-send-email-pasha.tatashin@oracle.com
Fixes: 7e18adb4f8 ("mm: meminit: initialise remaining struct pages in parallel with kswapd")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18 15:32:01 -07:00
Luis R. Rodriguez
8ada92799e wait: add wait_event_killable_timeout()
These are the few pending fixes I have queued up for v4.13-final.  One
is a a generic regression fix for recursive loops on kmod and the other
one is a trivial print out correction.

During the v4.13 development we assumed that recursive kmod loops were
no longer possible.  Clearly that is not true.  The regression fix makes
use of a new killable wait.  We use a killable wait to be paranoid in
how signals might be sent to modprobe and only accept a proper SIGKILL.
The signal will only be available to userspace to issue *iff* a thread
has already entered a wait state, and that happens only if we've already
throttled after 50 kmod threads have been hit.

Note that although it may seem excessive to trigger a failure afer 5
seconds if all kmod thread remain busy, prior to the series of changes
that went into v4.13 we would actually *always* fatally fail any request
which came in if the limit was already reached.  The new waiting
implemented in v4.13 actually gives us *more* breathing room -- the wait
for 5 seconds is a wait for *any* kmod thread to finish.  We give up and
fail *iff* no kmod thread has finished and they're *all* running
straight for 5 consecutive seconds.  If 50 kmod threads are running
consecutively for 5 seconds something else must be really bad.

Recursive loops with kmod are bad but they're also hard to implement
properly as a selftest without currently fooling current userspace tools
like kmod [1].  For instance kmod will complain when you run depmod if
it finds a recursive loop with symbol dependency between modules as such
this type of recursive loop cannot go upstream as the modules_install
target will fail after running depmod.

These tests already exist on userspace kmod upstream though (refer to
the testsuite/module-playground/mod-loop-*.c files).  The same is not
true if request_module() is used though, or worst if aliases are used.

Likewise the issue with 64-bit kernels booting 32-bit userspace without
a binfmt handler built-in is also currently not detected and proactively
avoided by userspace kmod tools, or kconfig for all architectures.
Although we could complain in the kernel when some of these individual
recursive issues creep up, proactively avoiding these situations in
userspace at build time is what we should keep striving for.

Lastly, since recursive loops could happen with kmod it may mean
recursive loops may also be possible with other kernel usermode helpers,
this should be investigated and long term if we can come up with a more
sensible generic solution even better!

[0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20170809-kmod-for-v4.13-final
[1] https://git.kernel.org/pub/scm/utils/kernel/kmod/kmod.git

This patch (of 3):

This wait is similar to wait_event_interruptible_timeout() but only
accepts SIGKILL interrupt signal.  Other signals are ignored.

Link: http://lkml.kernel.org/r/20170809234635.13443-2-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michal Marek <mmarek@suse.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Matt Redfearn <matt.redfearn@imgetc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18 15:32:01 -07:00
Johannes Weiner
739f79fc9d mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback()
Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
to update the memcg stats:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
    IP: test_clear_page_writeback+0x12e/0x2c0
    [...]
    RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
    Call Trace:
     <IRQ>
     end_page_writeback+0x47/0x70
     f2fs_write_end_io+0x76/0x180 [f2fs]
     bio_endio+0x9f/0x120
     blk_update_request+0xa8/0x2f0
     scsi_end_request+0x39/0x1d0
     scsi_io_completion+0x211/0x690
     scsi_finish_command+0xd9/0x120
     scsi_softirq_done+0x127/0x150
     __blk_mq_complete_request_remote+0x13/0x20
     flush_smp_call_function_queue+0x56/0x110
     generic_smp_call_function_single_interrupt+0x13/0x30
     smp_call_function_single_interrupt+0x27/0x40
     call_function_single_interrupt+0x89/0x90
    RIP: 0010:native_safe_halt+0x6/0x10

    (gdb) l *(test_clear_page_writeback+0x12e)
    0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
    614		mod_node_page_state(page_pgdat(page), idx, val);
    615		if (mem_cgroup_disabled() || !page->mem_cgroup)
    616			return;
    617		mod_memcg_state(page->mem_cgroup, idx, val);
    618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
    619		this_cpu_add(pn->lruvec_stat->count[idx], val);
    620	}
    621
    622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
    623							gfp_t gfp_mask,

The issue is that writeback doesn't hold a page reference and the page
might get freed after PG_writeback is cleared (and the mapping is
unlocked) in test_clear_page_writeback().  The stat functions looking up
the page's node or zone are safe, as those attributes are static across
allocation and free cycles.  But page->mem_cgroup is not, and it will
get cleared if we race with truncation or migration.

It appears this race window has been around for a while, but less likely
to trigger when the memcg stats were updated first thing after
PG_writeback is cleared.  Recent changes reshuffled this code to update
the global node stats before the memcg ones, though, stretching the race
window out to an extent where people can reproduce the problem.

Update test_clear_page_writeback() to look up and pin page->mem_cgroup
before clearing PG_writeback, then not use that pointer afterward.  It
is a partial revert of 62cccb8c8e ("mm: simplify lock_page_memcg()")
but leaves the pageref-holding callsites that aren't affected alone.

Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
Fixes: 62cccb8c8e ("mm: simplify lock_page_memcg()")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Jaegeuk Kim <jaegeuk@kernel.org>
Tested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reported-by: Bradley Bolen <bradleybolen@gmail.com>
Tested-by: Brad Bolen <bradleybolen@gmail.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org>	[4.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18 15:32:01 -07:00
Matthew Dawson
a0917e0bc6 datagram: When peeking datagrams with offset < 0 don't skip empty skbs
Due to commit e6afc8ace6 ("udp: remove
headers from UDP packets before queueing"), when udp packets are being
peeked the requested extra offset is always 0 as there is no need to skip
the udp header.  However, when the offset is 0 and the next skb is
of length 0, it is only returned once.  The behaviour can be seen with
the following python script:

from socket import *;
f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
f.bind(('::', 0));
addr=('::1', f.getsockname()[1]);
g.sendto(b'', addr)
g.sendto(b'b', addr)
print(f.recvfrom(10, MSG_PEEK));
print(f.recvfrom(10, MSG_PEEK));

Where the expected output should be the empty string twice.

Instead, make sk_peek_offset return negative values, and pass those values
to __skb_try_recv_datagram/__skb_try_recv_from_queue.  If the passed offset
to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
__skb_try_recv_from_queue will then ensure the offset is reset back to 0
if a peek is requested without an offset, unless no packets are found.

Also simplify the if condition in __skb_try_recv_from_queue.  If _off is
greater then 0, and off is greater then or equal to skb->len, then
(_off || skb->len) must always be true assuming skb->len >= 0 is always
true.

Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
as it double checked if MSG_PEEK was set in the flags.

V2:
 - Moved the negative fixup into __skb_try_recv_from_queue, and remove now
redundant checks
 - Fix peeking in udp{,v6}_recvmsg to report the right value when the
offset is 0

V3:
 - Marked new branch in __skb_try_recv_from_queue as unlikely.

Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18 15:12:54 -07:00
Thomas Gleixner
7edaeb6841 kernel/watchdog: Prevent false positives with turbo modes
The hardlockup detector on x86 uses a performance counter based on unhalted
CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the
performance counter period, so the hrtimer should fire 2-3 times before the
performance counter NMI fires. The NMI code checks whether the hrtimer
fired since the last invocation. If not, it assumess a hard lockup.

The calculation of those periods is based on the nominal CPU
frequency. Turbo modes increase the CPU clock frequency and therefore
shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x
nominal frequency) the perf/NMI period is shorter than the hrtimer period
which leads to false positives.

A simple fix would be to shorten the hrtimer period, but that comes with
the side effect of more frequent hrtimer and softlockup thread wakeups,
which is not desired.

Implement a low pass filter, which checks the perf/NMI period against
kernel time. If the perf/NMI fires before 4/5 of the watchdog period has
elapsed then the event is ignored and postponed to the next perf/NMI.

That solves the problem and avoids the overhead of shorter hrtimer periods
and more frequent softlockup thread wakeups.

Fixes: 58687acba5 ("lockup_detector: Combine nmi_watchdog and softlockup detector")
Reported-and-tested-by: Kan Liang <Kan.liang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: dzickus@redhat.com
Cc: prarit@redhat.com
Cc: ak@linux.intel.com
Cc: babu.moger@oracle.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: acme@redhat.com
Cc: stable@vger.kernel.org
Cc: atomlin@redhat.com
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
2017-08-18 12:35:02 +02:00
Ingo Molnar
0c23647913 Merge branch 'x86/asm' into locking/core
We need the ASM_UNREACHABLE() macro for a dependent patch.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-18 10:29:54 +02:00
Linus Torvalds
c8c03f1858 pty: fix the cached path of the pty slave file descriptor in the master
Christian Brauner reported that if you use the TIOCGPTPEER ioctl() to
get a slave pty file descriptor, the resulting file descriptor doesn't
look right in /proc/<pid>/fd/<fd>.  In particular, he wanted to use
readlink() on /proc/self/fd/<fd> to get the pathname of the slave pty
(basically implementing "ptsname{_r}()").

The reason for that was that we had generated the wrong 'struct path'
when we create the pty in ptmx_open().

In particular, the dentry was correct, but the vfsmount pointed to the
mount of the ptmx node. That _can_ be correct - in case you use
"/dev/pts/ptmx" to open the master - but usually is not.  The normal
case is to use /dev/ptmx, which then looks up the pts/ directory, and
then the vfsmount of the ptmx node is obviously the /dev directory, not
the /dev/pts/ directory.

We actually did have the right vfsmount available, but in the wrong
place (it gets looked up in 'devpts_acquire()' when we get a reference
to the pts filesystem), and so ptmx_open() used the wrong mnt pointer.

The end result of this confusion was that the pty worked fine, but when
if you did TIOCGPTPEER to get the slave side of the pty, end end result
would also work, but have that dodgy 'struct path'.

And then when doing "d_path()" on to get the pathname, the vfsmount
would not match the root of the pts directory, and d_path() would return
an empty pathname thinking that the entry had escaped a bind mount into
another mount.

This fixes the problem by making devpts_acquire() return the vfsmount
for the pts filesystem, allowing ptmx_open() to trivially just use the
right mount for the pts dentry, and create the proper 'struct path'.

Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-17 09:10:48 -07:00
Boqun Feng
52fa5bc5cb locking/lockdep: Explicitly initialize wq_barrier::done::map
With the new lockdep crossrelease feature, which checks completions usage,
a false positive is reported in the workqueue code:

> Worker A : acquired of wfc.work -> wait for cpu_hotplug_lock to be released
> Task   B : acquired of cpu_hotplug_lock -> wait for lock#3 to be released
> Task   C : acquired of lock#3 -> wait for completion of barr->done
> (Task C is in lru_add_drain_all_cpuslocked())
> Worker D : wait for wfc.work to be released -> will complete barr->done

Such a dead lock can not happen because Task C's barr->done and Worker D's
barr->done can not be the same instance.

The reason of this false positive is we initialize all wq_barrier::done
at insert_wq_barrier() via init_completion(), which makes them belong to
the same lock class, therefore, impossible circles are reported.

To fix this, explicitly initialize the lockdep map for wq_barrier::done
in insert_wq_barrier(), so that the lock class key of wq_barrier::done
is a subkey of the corresponding work_struct, as a result we won't build
a dependency between a wq_barrier with a unrelated work, and we can
differ wq barriers based on the related works, so the false positive
above is avoided.

Also define the empty lockdep_init_map_crosslock() for !CROSSRELEASE
to make the code simple and away from unnecessary #ifdefs.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170817094622.12915-1-boqun.feng@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 12:12:33 +02:00
Byungchul Park
ea3f2c0fdf locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
'complete' is an adjective and LOCKDEP_COMPLETE sounds like 'lockdep is complete',
so pick a better name that uses a noun.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@lge.com
Link: http://lkml.kernel.org/r/1502960261-16206-3-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 11:38:55 +02:00
Kees Cook
7a46ec0e2f locking/refcounts, x86/asm: Implement fast refcount overflow protection
This implements refcount_t overflow protection on x86 without a noticeable
performance impact, though without the fuller checking of REFCOUNT_FULL.

This is done by duplicating the existing atomic_t refcount implementation
but with normally a single instruction added to detect if the refcount
has gone negative (e.g. wrapped past INT_MAX or below zero). When detected,
the handler saturates the refcount_t to INT_MIN / 2. With this overflow
protection, the erroneous reference release that would follow a wrap back
to zero is blocked from happening, avoiding the class of refcount-overflow
use-after-free vulnerabilities entirely.

Only the overflow case of refcounting can be perfectly protected, since
it can be detected and stopped before the reference is freed and left to
be abused by an attacker. There isn't a way to block early decrements,
and while REFCOUNT_FULL stops increment-from-zero cases (which would
be the state _after_ an early decrement and stops potential double-free
conditions), this fast implementation does not, since it would require
the more expensive cmpxchg loops. Since the overflow case is much more
common (e.g. missing a "put" during an error path), this protection
provides real-world protection. For example, the two public refcount
overflow use-after-free exploits published in 2016 would have been
rendered unexploitable:

  http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/

  http://cyseclabs.com/page?n=02012016

This implementation does, however, notice an unchecked decrement to zero
(i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it
resulted in a zero). Decrements under zero are noticed (since they will
have resulted in a negative value), though this only indicates that a
use-after-free may have already happened. Such notifications are likely
avoidable by an attacker that has already exploited a use-after-free
vulnerability, but it's better to have them reported than allow such
conditions to remain universally silent.

On first overflow detection, the refcount value is reset to INT_MIN / 2
(which serves as a saturation value) and a report and stack trace are
produced. When operations detect only negative value results (such as
changing an already saturated value), saturation still happens but no
notification is performed (since the value was already saturated).

On the matter of races, since the entire range beyond INT_MAX but before
0 is negative, every operation at INT_MIN / 2 will trap, leaving no
overflow-only race condition.

As for performance, this implementation adds a single "js" instruction
to the regular execution flow of a copy of the standard atomic_t refcount
operations. (The non-"and_test" refcount_dec() function, which is uncommon
in regular refcount design patterns, has an additional "jz" instruction
to detect reaching exactly zero.) Since this is a forward jump, it is by
default the non-predicted path, which will be reinforced by dynamic branch
prediction. The result is this protection having virtually no measurable
change in performance over standard atomic_t operations. The error path,
located in .text.unlikely, saves the refcount location and then uses UD0
to fire a refcount exception handler, which resets the refcount, handles
reporting, and returns to regular execution. This keeps the changes to
.text size minimal, avoiding return jumps and open-coded calls to the
error reporting routine.

Example assembly comparison:

refcount_inc() before:

  .text:
  ffffffff81546149:       f0 ff 45 f4             lock incl -0xc(%rbp)

refcount_inc() after:

  .text:
  ffffffff81546149:       f0 ff 45 f4             lock incl -0xc(%rbp)
  ffffffff8154614d:       0f 88 80 d5 17 00       js     ffffffff816c36d3
  ...
  .text.unlikely:
  ffffffff816c36d3:       48 8d 4d f4             lea    -0xc(%rbp),%rcx
  ffffffff816c36d7:       0f ff                   (bad)

These are the cycle counts comparing a loop of refcount_inc() from 1
to INT_MAX and back down to 0 (via refcount_dec_and_test()), between
unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL
(refcount_t-full), and this overflow-protected refcount (refcount_t-fast):

  2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
		    cycles		protections
  atomic_t           82249267387	none
  refcount_t-fast    82211446892	overflow, untested dec-to-zero
  refcount_t-full   144814735193	overflow, untested dec-to-zero, inc-from-zero

This code is a modified version of the x86 PAX_REFCOUNT atomic_t
overflow defense from the last public patch of PaX/grsecurity, based
on my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code. Thanks
to PaX Team for various suggestions for improvement for repurposing this
code to be a refcount-only protection.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Hans Liljestrand <ishkamiel@gmail.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Jann Horn <jannh@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arozansk@redhat.com
Cc: axboe@kernel.dk
Cc: kernel-hardening@lists.openwall.com
Cc: linux-arch <linux-arch@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170815161924.GA133115@beast
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 10:40:26 +02:00
Damien Le Moal
70e42fd02c scsi: sd_zbc: Write unlock zone from sd_uninit_cmnd()
Releasing a zone write lock only when the write commnand that acquired
the lock completes can cause deadlocks due to potential command
reordering if the lock owning request is requeued and not executed. This
problem exists only with the scsi-mq path as, unlike the legacy path,
requests are moved out of the dispatch queue before being prepared and
so before locking a zone for a write command.

Since sd_uninit_cmnd() is now always called when a request is requeued,
call sd_zbc_write_unlock_zone() from that function for write requests
that acquired a zone lock instead of from sd_done(). Acquisition of a zone
lock by a write command is indicated using the new command
flag SCMD_ZONE_WRITE_LOCK.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2017-08-16 20:01:31 -04:00
Eric Dumazet
c780a049f9 ipv4: better IP_MAX_MTU enforcement
While working on yet another syzkaller report, I found
that our IP_MAX_MTU enforcements were not properly done.

gcc seems to reload dev->mtu for min(dev->mtu, IP_MAX_MTU), and
final result can be bigger than IP_MAX_MTU :/

This is a problem because device mtu can be changed on other cpus or
threads.

While this patch does not fix the issue I am working on, it is
probably worth addressing it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 16:28:47 -07:00
Eric Dumazet
81fbfe8ada ptr_ring: use kmalloc_array()
As found by syzkaller, malicious users can set whatever tx_queue_len
on a tun device and eventually crash the kernel.

Lets remove the ALIGN(XXX, SMP_CACHE_BYTES) thing since a small
ring buffer is not fast anyway.

Fixes: 2e0ab8ca83 ("ptr_ring: array based FIFO for pointers")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 16:28:47 -07:00
Linus Torvalds
510c8a899c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix TCP checksum offload handling in iwlwifi driver, from Emmanuel
    Grumbach.

 2) In ksz DSA tagging code, free SKB if skb_put_padto() fails. From
    Vivien Didelot.

 3) Fix two regressions with bonding on wireless, from Andreas Born.

 4) Fix build when busypoll is disabled, from Daniel Borkmann.

 5) Fix copy_linear_skb() wrt. SO_PEEK_OFF, from Eric Dumazet.

 6) Set SKB cached route properly in inet_rtm_getroute(), from Florian
    Westphal.

 7) Fix PCI-E relaxed ordering handling in cxgb4 driver, from Ding
    Tianhong.

 8) Fix module refcnt leak in ULP code, from Sabrina Dubroca.

 9) Fix use of GFP_KERNEL in atomic contexts in AF_KEY code, from Eric
    Dumazet.

10) Need to purge socket write queue in dccp_destroy_sock(), also from
    Eric Dumazet.

11) Make bpf_trace_printk() work properly on 32-bit architectures, from
    Daniel Borkmann.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits)
  bpf: fix bpf_trace_printk on 32 bit archs
  PCI: fix oops when try to find Root Port for a PCI device
  sfc: don't try and read ef10 data on non-ef10 NIC
  net_sched: remove warning from qdisc_hash_add
  net_sched/sfq: update hierarchical backlog when drop packet
  net_sched: reset pointers to tcf blocks in classful qdiscs' destructors
  ipv4: fix NULL dereference in free_fib_info_rcu()
  net: Fix a typo in comment about sock flags.
  ipv6: fix NULL dereference in ip6_route_dev_notify()
  tcp: fix possible deadlock in TCP stack vs BPF filter
  dccp: purge write queue in dccp_destroy_sock()
  udp: fix linear skb reception with PEEK_OFF
  ipv6: release rt6->rt6i_idev properly during ifdown
  af_key: do not use GFP_KERNEL in atomic contexts
  tcp: ulp: avoid module refcnt leak in tcp_set_ulp
  net/cxgb4vf: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
  net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
  PCI: Disable Relaxed Ordering Attributes for AMD A1100
  PCI: Disable Relaxed Ordering for some Intel processors
  PCI: Disable PCIe Relaxed Ordering if unsupported
  ...
2017-08-15 18:52:28 -07:00
Tonghao Zhang
b3dc8f772f net: Fix a typo in comment about sock flags.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15 17:07:17 -07:00
Eric Dumazet
12d94a8049 ipv6: fix NULL dereference in ip6_route_dev_notify()
Based on a syzkaller report [1], I found that a per cpu allocation
failure in snmp6_alloc_dev() would then lead to NULL dereference in
ip6_route_dev_notify().

It seems this is a very old bug, thus no Fixes tag in this submission.

Let's add in6_dev_put_clear() helper, as we will probably use
it elsewhere (once available/present in net-next)

[1]
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 17294 Comm: syz-executor6 Not tainted 4.13.0-rc2+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
task: ffff88019f456680 task.stack: ffff8801c6e58000
RIP: 0010:__read_once_size include/linux/compiler.h:250 [inline]
RIP: 0010:atomic_read arch/x86/include/asm/atomic.h:26 [inline]
RIP: 0010:refcount_sub_and_test+0x7d/0x1b0 lib/refcount.c:178
RSP: 0018:ffff8801c6e5f1b0 EFLAGS: 00010202
RAX: 0000000000000037 RBX: dffffc0000000000 RCX: ffffc90005d25000
RDX: ffff8801c6e5f218 RSI: ffffffff82342bbf RDI: 0000000000000001
RBP: ffff8801c6e5f240 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff10038dcbe37
R13: 0000000000000006 R14: 0000000000000001 R15: 00000000000001b8
FS:  00007f21e0429700(0000) GS:ffff8801dc100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001ddbc22000 CR3: 00000001d632b000 CR4: 00000000001426e0
DR0: 0000000020000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
 refcount_dec_and_test+0x1a/0x20 lib/refcount.c:211
 in6_dev_put include/net/addrconf.h:335 [inline]
 ip6_route_dev_notify+0x1c9/0x4a0 net/ipv6/route.c:3732
 notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93
 __raw_notifier_call_chain kernel/notifier.c:394 [inline]
 raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
 call_netdevice_notifiers_info+0x51/0x90 net/core/dev.c:1678
 call_netdevice_notifiers net/core/dev.c:1694 [inline]
 rollback_registered_many+0x91c/0xe80 net/core/dev.c:7107
 rollback_registered+0x1be/0x3c0 net/core/dev.c:7149
 register_netdevice+0xbcd/0xee0 net/core/dev.c:7587
 register_netdev+0x1a/0x30 net/core/dev.c:7669
 loopback_net_init+0x76/0x160 drivers/net/loopback.c:214
 ops_init+0x10a/0x570 net/core/net_namespace.c:118
 setup_net+0x313/0x710 net/core/net_namespace.c:294
 copy_net_ns+0x27c/0x580 net/core/net_namespace.c:418
 create_new_namespaces+0x425/0x880 kernel/nsproxy.c:107
 unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:206
 SYSC_unshare kernel/fork.c:2347 [inline]
 SyS_unshare+0x653/0xfa0 kernel/fork.c:2297
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x4512c9
RSP: 002b:00007f21e0428c08 EFLAGS: 00000216 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 0000000000718150 RCX: 00000000004512c9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000062020200
RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000216 R12: 00000000004b973d
R13: 00000000ffffffff R14: 000000002001d000 R15: 00000000000002dd
Code: 50 2b 34 82 c7 00 f1 f1 f1 f1 c7 40 04 04 f2 f2 f2 c7 40 08 f3 f3
f3 f3 e8 a1 43 39 ff 4c 89 f8 48 8b 95 70 ff ff ff 48 c1 e8 03 <0f> b6
0c 18 4c 89 f8 83 e0 07 83 c0 03 38 c8 7c 08 84 c9 0f 85
RIP: __read_once_size include/linux/compiler.h:250 [inline] RSP:
ffff8801c6e5f1b0
RIP: atomic_read arch/x86/include/asm/atomic.h:26 [inline] RSP:
ffff8801c6e5f1b0
RIP: refcount_sub_and_test+0x7d/0x1b0 lib/refcount.c:178 RSP:
ffff8801c6e5f1b0
---[ end trace e441d046c6410d31 ]---

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15 17:06:34 -07:00
David S. Miller
0a6f04184d Merge tag 'wireless-drivers-for-davem-2017-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers
Kalle Valo says:

====================
wireless-drivers fixes for 4.13

This time quite a few fixes for iwlwifi and one major regression fix
for brcmfmac. For the iwlwifi aggregation bug a small change was
needed for mac80211, but as Johannes is still away the mac80211 patch
is taken via wireless-drivers tree.

brcmfmac

* fix firmware crash (a recent regression in bcm4343{0,1,8}

iwlwifi

* Some simple PCI HW ID fix-ups and additions for family 9000

* Remove a bogus warning message with new FWs (bug #196915)

* Don't allow illegal channel options to be used (bug #195299)

* A fix for checksum offload in family 9000

* A fix serious throughput degradation in 11ac with multiple streams

* An old bug in SMPS where the firmware was not aware of SMPS changes

* Fix a memory leak in the SAR code

* Fix a stuck queue case in AP mode;

* Convert a WARN to a simple debug in a legitimate race case (from
  which we can recover)

* Fix a severe throughput aggregation on 9000-family devices due to
  aggregation issues, needed a small change in mac80211
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15 10:19:14 -07:00
Al Viro
42b7305905 udp: fix linear skb reception with PEEK_OFF
copy_linear_skb() is broken; both of its callers actually
expect 'len' to be the amount we are trying to copy,
not the offset of the end.
Fix it keeping the meanings of arguments in sync with what the
callers (both of them) expect.
Also restore a saner behavior on EFAULT (i.e. preserving
the iov_iter position in case of failure):

The commit fd851ba9ca ("udp: harden copy_linear_skb()")
avoids the more destructive effect of the buggy
copy_linear_skb(), e.g. no more invalid memory access, but
said function still behaves incorrectly: when peeking with
offset it can fail with EINVAL instead of copying the
appropriate amount of memory.

Reported-by: Sasha Levin <alexander.levin@verizon.com>
Fixes: b65ac44674 ("udp: try to avoid 2 cache miss on dequeue")
Fixes: fd851ba9ca ("udp: harden copy_linear_skb()")
Signed-off-by: Al Viro <viro@ZenIV.linux.org.uk>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Sasha Levin <alexander.levin@verizon.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-14 22:26:51 -07:00
dingtianhong
a99b646afa PCI: Disable PCIe Relaxed Ordering if unsupported
When bit4 is set in the PCIe Device Control register, it indicates
whether the device is permitted to use relaxed ordering.
On some platforms using relaxed ordering can have performance issues or
due to erratum can cause data-corruption. In such cases devices must avoid
using relaxed ordering.

The patch adds a new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING to indicate that
Relaxed Ordering (RO) attribute should not be used for Transaction Layer
Packets (TLP) targeted towards these affected root complexes.

This patch checks if there is any node in the hierarchy that indicates that
using relaxed ordering is not safe. In such cases the patch turns off the
relaxed ordering by clearing the capability for this device.

Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Ashok Raj <ashok.raj@intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-14 22:14:50 -07:00
Linus Torvalds
438630ef5b Merge tag 'tty-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
 "Here are two tty serial driver fixes for 4.13-rc5. One is a revert of
  a -rc1 patch that turned out to not be a good idea, and the other is a
  fix for the pl011 serial driver.

  Both have been in linux-next with no reported issues"

* tag 'tty-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  Revert "serial: Delete dead code for CIR serial ports"
  tty: pl011: fix initialization order of QDF2400 E44
2017-08-13 12:33:35 -07:00
Linus Torvalds
dd95f18607 Merge tag 'staging-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging/iio fixes from Greg KH:
 "Here are some Staging and IIO driver fixes for 4.13-rc5.

  Nothing major, just a number of small fixes for reported issues. All
  of these have been in linux-next for a while now with no reported
  issues. Full details are in the shortlog"

* tag 'staging-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: comedi: comedi_fops: do not call blocking ops when !TASK_RUNNING
  iio: aspeed-adc: wait for initial sequence.
  iio: accel: bmc150: Always restore device to normal mode after suspend-resume
  staging:iio:resolver:ad2s1210 fix negative IIO_ANGL_VEL read
  iio: adc: axp288: Fix the GPADC pin reading often wrongly returning 0
  iio: adc: vf610_adc: Fix VALT selection value for REFSEL bits
  iio: accel: st_accel: add SPI-3wire support
  iio: adc: Revert "axp288: Drop bogus AXP288_ADC_TS_PIN_CTRL register modifications"
  iio: adc: sun4i-gpadc-iio: fix unbalanced irq enable/disable
  iio: pressure: st_pressure_core: disable multiread by default for LPS22HB
  iio: light: tsl2563: use correct event code
2017-08-13 12:30:17 -07:00
Linus Torvalds
a99bcdce83 Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending
Pull SCSI target fixes from Nicholas Bellinger:
 "The highlights include:

   - Fix iscsi-target payload memory leak during
     ISCSI_FLAG_TEXT_CONTINUE (Varun Prakash)

   - Fix tcm_qla2xxx incorrect use of tcm_qla2xxx_free_cmd during ABORT
     (Pascal de Bruijn + Himanshu Madhani + nab)

   - Fix iscsi-target long-standing issue with parallel delete of a
     single network portal across multiple target instances (Gary Guo +
     nab)

   - Fix target dynamic se_node GPF during uncached shutdown regression
     (Justin Maggard + nab)"

* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
  target: Fix node_acl demo-mode + uncached dynamic shutdown regression
  iscsi-target: Fix iscsi_np reset hung task during parallel delete
  qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR ABORT (v2)
  cxgbit: fix sg_nents calculation
  iscsi-target: fix invalid flags in text response
  iscsi-target: fix memory leak in iscsit_setup_text_cmd()
  cxgbit: add missing __kfree_skb()
  tcmu: free old string on reconfig
  tcmu: Fix possible to/from address overflow when doing the memcpy
2017-08-12 12:08:59 -07:00
Eric Dumazet
fd851ba9ca udp: harden copy_linear_skb()
syzkaller got crashes with CONFIG_HARDENED_USERCOPY=y configs.

Issue here is that recvfrom() can be used with user buffer of Z bytes,
and SO_PEEK_OFF of X bytes, from a skb with Y bytes, and following
condition :

Z < X < Y

kernel BUG at mm/usercopy.c:72!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 2917 Comm: syzkaller842281 Not tainted 4.13.0-rc3+ #16
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
task: ffff8801d2fa40c0 task.stack: ffff8801d1fe8000
RIP: 0010:report_usercopy mm/usercopy.c:64 [inline]
RIP: 0010:__check_object_size+0x3ad/0x500 mm/usercopy.c:264
RSP: 0018:ffff8801d1fef8a8 EFLAGS: 00010286
RAX: 0000000000000078 RBX: ffffffff847102c0 RCX: 0000000000000000
RDX: 0000000000000078 RSI: 1ffff1003a3fded5 RDI: ffffed003a3fdf09
RBP: ffff8801d1fef998 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801d1ea480e
R13: fffffffffffffffa R14: ffffffff84710280 R15: dffffc0000000000
FS:  0000000001360880(0000) GS:ffff8801dc000000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000202ecfe4 CR3: 00000001d1ff8000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 check_object_size include/linux/thread_info.h:108 [inline]
 check_copy_size include/linux/thread_info.h:139 [inline]
 copy_to_iter include/linux/uio.h:105 [inline]
 copy_linear_skb include/net/udp.h:371 [inline]
 udpv6_recvmsg+0x1040/0x1af0 net/ipv6/udp.c:395
 inet_recvmsg+0x14c/0x5f0 net/ipv4/af_inet.c:793
 sock_recvmsg_nosec net/socket.c:792 [inline]
 sock_recvmsg+0xc9/0x110 net/socket.c:799
 SYSC_recvfrom+0x2d6/0x570 net/socket.c:1788
 SyS_recvfrom+0x40/0x50 net/socket.c:1760
 entry_SYSCALL_64_fastpath+0x1f/0xbe

Fixes: b65ac44674 ("udp: try to avoid 2 cache miss on dequeue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 15:00:45 -07:00
Daniel Borkmann
e4dde41273 net: fix compilation when busy poll is not enabled
MIN_NAPI_ID is used in various places outside of
CONFIG_NET_RX_BUSY_POLL wrapping, so when it's not set
we run into build errors such as:

  net/core/dev.c: In function 'dev_get_by_napi_id':
  net/core/dev.c:886:16: error: ‘MIN_NAPI_ID’ undeclared (first use in this function)
    if (napi_id < MIN_NAPI_ID)
                  ^~~~~~~~~~~

Thus, have MIN_NAPI_ID always defined to fix these errors.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 14:59:24 -07:00
Andreas Born
ad729bc9ac bonding: require speed/duplex only for 802.3ad, alb and tlb
The patch c4adfc822b ("bonding: make speed, duplex setting consistent
with link state") puts the link state to down if
bond_update_speed_duplex() cannot retrieve speed and duplex settings.
Assumably the patch was written with 802.3ad mode in mind which relies
on link speed/duplex settings. For other modes like active-backup these
settings are not required. Thus, only for these other modes, this patch
reintroduces support for slaves that do not support reporting speed or
duplex such as wireless devices. This fixes the regression reported in
bug 196547 (https://bugzilla.kernel.org/show_bug.cgi?id=196547).

Fixes: c4adfc822b ("bonding: make speed, duplex setting consistent
with link state")
Signed-off-by: Andreas Born <futur.andy@googlemail.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 14:21:42 -07:00
Linus Torvalds
e0d0e045b8 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A set of fixes that should go into this series. This contains:

   - Fix from Bart for blk-mq requeue queue running, preventing a
     continued loop of run/restart.

   - Fix for a bio/blk-integrity issue, in two parts. One from
     Christoph, fixing where verification happens, and one from Milan,
     for a NULL profile.

   - NVMe pull request, most of the changes being for nvme-fc, but also
     a few trivial core/pci fixes"

* 'for-linus' of git://git.kernel.dk/linux-block:
  nvme: fix directive command numd calculation
  nvme: fix nvme reset command timeout handling
  nvme-pci: fix CMB sysfs file removal in reset path
  lpfc: support nvmet_fc defer_rcv callback
  nvmet_fc: add defer_req callback for deferment of cmd buffer return
  nvme: strip trailing 0-bytes in wwid_show
  block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time
  bio-integrity: only verify integrity on the lowest stacked driver
  bio-integrity: Fix regression if profile verify_fn is NULL
2017-08-11 12:26:49 -07:00
Jens Axboe
4a8b53be64 Merge branch 'nvme-4.13' of git://git.infradead.org/nvme into for-linus
Pull NVMe fixes from Christoph:

"A few more small fixes - the fc/lpfc update is the biggest by far."
2017-08-11 08:07:19 -06:00
Ingo Molnar
040cca3ab2 Merge branch 'linus' into locking/core, to resolve conflicts
Conflicts:
	include/linux/mm_types.h
	mm/huge_memory.c

I removed the smp_mb__before_spinlock() like the following commit does:

  8b1b436dd1 ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")

and fixed up the affected commits.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-11 13:51:59 +02:00
Linus Torvalds
b2dbdf2ca1 Merge tag 'drm-fixes-for-v4.13-rc5' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
 "Nothing too earth shattering here, it just seems like lots of little
  things all over the place.

  msm has probably the larger amount of changes, but they all seem fine,
  otherwise, some rockchip, i915, etnaviv and exynos fixes, along with
  one nouveau regression fix for some older GPUs"

* tag 'drm-fixes-for-v4.13-rc5' of git://people.freedesktop.org/~airlied/linux: (35 commits)
  drm/nouveau/disp/nv04: avoid creation of output paths
  drm: make DRM_STM default n
  drm/exynos: forbid creating framebuffers from too small GEM buffers
  drm/etnaviv: Fix off-by-one error in reloc checking
  drm/i915: fix backlight invert for non-zero minimum brightness
  drm/i915/shrinker: Wrap need_resched() inside preempt-disable
  drm/i915/perf: fix flex eu registers programming
  drm/i915: Fix out-of-bounds array access in bdw_load_gamma_lut
  drm/i915/gvt: Change the max length of mmio_reg_rw from 4 to 8
  drm/i915/gvt: Initialize MMIO Block with HW state
  drm/rockchip: vop: report error when check resource error
  drm/rockchip: vop: round_up pitches to word align
  drm/rockchip: vop: fix NV12 video display error
  drm/rockchip: vop: fix iommu page fault when resume
  drm/i915/gvt: clean workload queue if error happened
  drm/i915/gvt: change resetting to resetting_eng
  drm/msm: gpu: don't abuse dma_alloc for non-DMA allocations
  drm/msm: gpu: call qcom_mdt interfaces only for ARCH_QCOM
  drm/msm/adreno: Prevent unclocked access when retrieving timestamps
  drm/msm: Remove __user from __u64 data types
  ...
2017-08-10 22:33:47 -07:00
Linus Torvalds
27df704d43 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "21 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits)
  userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
  zram: rework copy of compressor name in comp_algorithm_store()
  rmap: do not call mmu_notifier_invalidate_page() under ptl
  mm: fix list corruptions on shmem shrinklist
  mm/balloon_compaction.c: don't zero ballooned pages
  MAINTAINERS: copy virtio on balloon_compaction.c
  mm: fix KSM data corruption
  mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
  mm: make tlb_flush_pending global
  mm: refactor TLB gathering API
  Revert "mm: numa: defer TLB flush for THP migration as long as possible"
  mm: migrate: fix barriers around tlb_flush_pending
  mm: migrate: prevent racy access to tlb_flush_pending
  fault-inject: fix wrong should_fail() decision in task context
  test_kmod: fix small memory leak on filesystem tests
  test_kmod: fix the lock in register_test_dev_kmod()
  test_kmod: fix bug which allows negative values on two config options
  test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY"
  userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case
  mm: ratelimit PFNs busy info message
  ...
2017-08-10 16:20:52 -07:00
Minchan Kim
99baac21e4 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
Nadav reported parallel MADV_DONTNEED on same range has a stale TLB
problem and Mel fixed it[1] and found same problem on MADV_FREE[2].

Quote from Mel Gorman:
 "The race in question is CPU 0 running madv_free and updating some PTEs
  while CPU 1 is also running madv_free and looking at the same PTEs.
  CPU 1 may have writable TLB entries for a page but fail the pte_dirty
  check (because CPU 0 has updated it already) and potentially fail to
  flush.

  Hence, when madv_free on CPU 1 returns, there are still potentially
  writable TLB entries and the underlying PTE is still present so that a
  subsequent write does not necessarily propagate the dirty bit to the
  underlying PTE any more. Reclaim at some unknown time at the future
  may then see that the PTE is still clean and discard the page even
  though a write has happened in the meantime. I think this is possible
  but I could have missed some protection in madv_free that prevents it
  happening."

This patch aims for solving both problems all at once and is ready for
other problem with KSM, MADV_FREE and soft-dirty story[3].

TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending
and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can
catch there are parallel threads going on.  In that case, forcefully,
flush TLB to prevent for user to access memory via stale TLB entry
although it fail to gather page table entry.

I confirmed this patch works with [4] test program Nadav gave so this
patch supersedes "mm: Always flush VMA ranges affected by zap_page_range
v2" in current mmotm.

NOTE:

This patch modifies arch-specific TLB gathering interface(x86, ia64,
s390, sh, um).  It seems most of architecture are straightforward but
s390 need to be careful because tlb_flush_mmu works only if
mm->context.flush_mm is set to non-zero which happens only a pte entry
really is cleared by ptep_get_and_clear and friends.  However, this
problem never changes the pte entries but need to flush to prevent
memory access from stale tlb.

[1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.net
[2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrfawp@suse.de
[3] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
[4] https://patchwork.kernel.org/patch/9861621/

[minchan@kernel.org: decrease tlb flush pending count in tlb_finish_mmu]
  Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox
Link: http://lkml.kernel.org/r/20170802000818.4760-7-namit@vmware.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Reported-by: Nadav Amit <namit@vmware.com>
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 15:54:07 -07:00
Minchan Kim
0a2dd266dd mm: make tlb_flush_pending global
Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
COMPACTION] but upcoming patches to solve subtle TLB flush batching
problem will use it regardless of compaction/NUMA so this patch doesn't
remove the dependency.

[akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement]
Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 15:54:07 -07:00