Commit Graph

26349 Commits

Author SHA1 Message Date
Matt Fleming
4bc9f92e64 x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data
efi_mem_reserve() allows us to permanently mark EFI boot services
regions as reserved, which means we no longer need to copy the image
data out and into a separate buffer.

Leaving the data in the original boot services region has the added
benefit that BGRT images can now be passed across kexec reboot.

Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Leif Lindholm <leif.lindholm@linaro.org>
Cc: Peter Jones <pjones@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Môshe van der Sterre <me@moshe.nl>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-09-09 16:08:38 +01:00
Matt Fleming
31ce8cc681 efi/runtime-map: Use efi.memmap directly instead of a copy
Now that efi.memmap is available all of the time there's no need to
allocate and build a separate copy of the EFI memory map.

Furthermore, efi.memmap contains boot services regions but only those
regions that have been reserved via efi_mem_reserve(). Using
efi.memmap allows us to pass boot services across kexec reboot so that
the ESRT and BGRT drivers will now work.

Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Leif Lindholm <leif.lindholm@linaro.org>
Cc: Peter Jones <pjones@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-09-09 16:08:36 +01:00
Matt Fleming
816e76129e efi: Allow drivers to reserve boot services forever
Today, it is not possible for drivers to reserve EFI boot services for
access after efi_free_boot_services() has been called on x86. For
ARM/arm64 it can be done simply by calling memblock_reserve().

Having this ability for all three architectures is desirable for a
couple of reasons,

  1) It saves drivers copying data out of those regions
  2) kexec reboot can now make use of things like ESRT

Instead of using the standard memblock_reserve() which is insufficient
to reserve the region on x86 (see efi_reserve_boot_services()), a new
API is introduced in this patch; efi_mem_reserve().

efi.memmap now always represents which EFI memory regions are
available. On x86 the EFI boot services regions that have not been
reserved via efi_mem_reserve() will be removed from efi.memmap during
efi_free_boot_services().

This has implications for kexec, since it is not possible for a newly
kexec'd kernel to access the same boot services regions that the
initial boot kernel had access to unless they are reserved by every
kexec kernel in the chain.

Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Leif Lindholm <leif.lindholm@linaro.org>
Cc: Peter Jones <pjones@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-09-09 16:08:34 +01:00
Matt Fleming
dca0f971ea efi: Add efi_memmap_init_late() for permanent EFI memmap
Drivers need a way to access the EFI memory map at runtime. ARM and
arm64 currently provide this by remapping the EFI memory map into the
vmalloc space before setting up the EFI virtual mappings.

x86 does not provide this functionality which has resulted in the code
in efi_mem_desc_lookup() where it will manually map individual EFI
memmap entries if the memmap has already been torn down on x86,

  /*
   * If a driver calls this after efi_free_boot_services,
   * ->map will be NULL, and the target may also not be mapped.
   * So just always get our own virtual map on the CPU.
   *
   */
  md = early_memremap(p, sizeof (*md));

There isn't a good reason for not providing a permanent EFI memory map
for runtime queries, especially since the EFI regions are not mapped
into the standard kernel page tables.

Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Leif Lindholm <leif.lindholm@linaro.org>
Cc: Peter Jones <pjones@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-09-09 16:07:43 +01:00
Matt Fleming
9479c7cebf efi: Refactor efi_memmap_init_early() into arch-neutral code
Every EFI architecture apart from ia64 needs to setup the EFI memory
map at efi.memmap, and the code for doing that is essentially the same
across all implementations. Therefore, it makes sense to factor this
out into the common code under drivers/firmware/efi/.

The only slight variation is the data structure out of which we pull
the initial memory map information, such as physical address, memory
descriptor size and version, etc. We can address this by passing a
generic data structure (struct efi_memory_map_data) as the argument to
efi_memmap_init_early() which contains the minimum info required for
initialising the memory map.

In the process, this patch also fixes a few undesirable implementation
differences:

 - ARM and arm64 were failing to clear the EFI_MEMMAP bit when
   unmapping the early EFI memory map. EFI_MEMMAP indicates whether
   the EFI memory map is mapped (not the regions contained within) and
   can be traversed.  It's more correct to set the bit as soon as we
   memremap() the passed in EFI memmap.

 - Rename efi_unmmap_memmap() to efi_memmap_unmap() to adhere to the
   regular naming scheme.

This patch also uses a read-write mapping for the memory map instead
of the read-only mapping currently used on ARM and arm64. x86 needs
the ability to update the memory map in-place when assigning virtual
addresses to regions (efi_map_region()) and tagging regions when
reserving boot services (efi_reserve_boot_services()).

There's no way for the generic fake_mem code to know which mapping to
use without introducing some arch-specific constant/hook, so just use
read-write since read-only is of dubious value for the EFI memory map.

Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Leif Lindholm <leif.lindholm@linaro.org>
Cc: Peter Jones <pjones@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-09-09 16:06:38 +01:00
Matt Fleming
ab72a27da4 x86/efi: Consolidate region mapping logic
EFI regions are currently mapped in two separate places. The bulk of
the work is done in efi_map_regions() but when CONFIG_EFI_MIXED is
enabled the additional regions that are required when operating in
mixed mode are mapping in efi_setup_page_tables().

Pull everything into efi_map_regions() and refactor the test for
which regions should be mapped into a should_map_region() function.
Generously sprinkle comments to clarify the different cases.

Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-09-09 16:06:37 +01:00
Matt Fleming
4971531af3 x86/efi: Test for EFI_MEMMAP functionality when iterating EFI memmap
Both efi_find_mirror() and efi_fake_memmap() really want to know
whether the EFI memory map is available, not just whether the machine
was booted using EFI. efi_fake_memmap() even has a check for
EFI_MEMMAP at the start of the function.

Since we've already got other code that has this dependency, merge
everything under one if() conditional, and remove the now superfluous
check from efi_fake_memmap().

Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-09-09 16:06:34 +01:00
Waiman Long
f99fd22e4d x86/hpet: Reduce HPET counter read contention
On a large system with many CPUs, using HPET as the clock source can
have a significant impact on the overall system performance because
of the following reasons:
 1) There is a single HPET counter shared by all the CPUs.
 2) HPET counter reading is a very slow operation.

Using HPET as the default clock source may happen when, for example,
the TSC clock calibration exceeds the allowable tolerance. Something
the performance slowdown can be so severe that the system may crash
because of a NMI watchdog soft lockup, for example.

During the TSC clock calibration process, the default clock source
will be set temporarily to HPET. For systems with many CPUs, it is
possible that NMI watchdog soft lockup may occur occasionally during
that short time period where HPET clocking is active as is shown in
the kernel log below:

[   71.646504] hpet0: 8 comparators, 64-bit 14.318180 MHz counter
[   71.655313] Switching to clocksource hpet
[   95.679135] BUG: soft lockup - CPU#144 stuck for 23s! [swapper/144:0]
[   95.693363] BUG: soft lockup - CPU#145 stuck for 23s! [swapper/145:0]
[   95.695580] BUG: soft lockup - CPU#582 stuck for 23s! [swapper/582:0]
[   95.698128] BUG: soft lockup - CPU#357 stuck for 23s! [swapper/357:0]

This patch addresses the above issues by reducing HPET read contention
using the fact that if more than one CPUs are trying to access HPET at
the same time, it will be more efficient when only one CPU in the group
reads the HPET counter and shares it with the rest of the group instead
of each group member trying to read the HPET counter individually.

This is done by using a combination quadword that contains a 32-bit
stored HPET value and a 32-bit spinlock.  The CPU that gets the lock
will be responsible for reading the HPET counter and storing it in
the quadword. The others will monitor the change in HPET value and
lock status and grab the latest stored HPET value accordingly. This
change is only enabled on 64-bit SMP configuration.

On a 4-socket Haswell-EX box with 144 threads (HT on), running the
AIM7 compute workload (1500 users) on a 4.8-rc1 kernel (HZ=1000)
with and without the patch has the following performance numbers
(with HPET or TSC as clock source):

TSC		= 1042431 jobs/min
HPET w/o patch	=  798068 jobs/min
HPET with patch	= 1029445 jobs/min

The perf profile showed a reduction of the %CPU time consumed by
read_hpet from 11.19% without patch to 1.24% with patch.

[ tglx: It's really sad that we need to have such hacks just to deal with
  	the fact that cpu vendors have not managed to fix the TSC wreckage
  	within 15+ years. Were They Forgetting? ]

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Tested-by: Prarit Bhargava <prarit@redhat.com>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Randy Wright <rwright@hpe.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1473182530-29175-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-09 15:16:19 +02:00
Dave Hansen
76de993727 x86/pkeys: Allow configuration of init_pkru
As discussed in the previous patch, there is a reliability
benefit to allowing an init value for the Protection Keys Rights
User register (PKRU) which differs from what the XSAVE hardware
provides.

But, having PKRU be 0 (its init value) provides some nonzero
amount of optimization potential to the hardware.  It can, for
instance, skip writes to the XSAVE buffer when it knows that PKRU
is in its init state.

The cost of losing this optimization is approximately 100 cycles
per context switch for a workload which lightly using XSAVE
state (something not using AVX much).  The overhead comes from a
combinaation of actually manipulating PKRU and the overhead of
pullin in an extra cacheline.

This overhead is not huge, but it's also not something that I
think we should unconditionally inflict on everyone.  So, make it
configurable both at boot-time and from debugfs.

Changes to the debugfs value affect all processes created after
the write to debugfs.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: mgorman@techsingularity.net
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163023.407672D2@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-09 13:02:28 +02:00
Dave Hansen
acd547b298 x86/pkeys: Default to a restrictive init PKRU
PKRU is the register that lets you disallow writes or all access to a given
protection key.

The XSAVE hardware defines an "init state" of 0 for PKRU: its most
permissive state, allowing access/writes to everything.  Since we start off
all new processes with the init state, we start all processes off with the
most permissive possible PKRU.

This is unfortunate.  If a thread is clone()'d [1] before a program has
time to set PKRU to a restrictive value, that thread will be able to write
to all data, no matter what pkey is set on it.  This weakens any integrity
guarantees that we want pkeys to provide.

To fix this, we define a very restrictive PKRU to override the
XSAVE-provided value when we create a new FPU context.  We choose a value
that only allows access to pkey 0, which is as restrictive as we can
practically make it.

This does not cause any practical problems with applications using
protection keys because we require them to specify initial permissions for
each key when it is allocated, which override the restrictive default.

In the end, this ensures that threads which do not know how to manage their
own pkey rights can not do damage to data which is pkey-protected.

I would have thought this was a pretty contrived scenario, except that I
heard a bug report from an MPX user who was creating threads in some very
early code before main().  It may be crazy, but folks evidently _do_ it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: mgorman@techsingularity.net
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163021.F3C25D4A@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-09 13:02:28 +02:00
Dave Hansen
f9afc6197e x86: Wire up protection keys system calls
This is all that we need to get the new system calls themselves
working on x86.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: mgorman@techsingularity.net
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163017.E3C06FD2@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-09 13:02:27 +02:00
Dave Hansen
e8c24d3a23 x86/pkeys: Allocation/free syscalls
This patch adds two new system calls:

	int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
	int pkey_free(int pkey);

These implement an "allocator" for the protection keys
themselves, which can be thought of as analogous to the allocator
that the kernel has for file descriptors.  The kernel tracks
which numbers are in use, and only allows operations on keys that
are valid.  A key which was not obtained by pkey_alloc() may not,
for instance, be passed to pkey_mprotect().

These system calls are also very important given the kernel's use
of pkeys to implement execute-only support.  These help ensure
that userspace can never assume that it has control of a key
unless it first asks the kernel.  The kernel does not promise to
preserve PKRU (right register) contents except for allocated
pkeys.

The 'init_access_rights' argument to pkey_alloc() specifies the
rights that will be established for the returned pkey.  For
instance:

	pkey = pkey_alloc(flags, PKEY_DENY_WRITE);

will allocate 'pkey', but also sets the bits in PKRU[1] such that
writing to 'pkey' is already denied.

The kernel does not prevent pkey_free() from successfully freeing
in-use pkeys (those still assigned to a memory range by
pkey_mprotect()).  It would be expensive to implement the checks
for this, so we instead say, "Just don't do it" since sane
software will never do it anyway.

Any piece of userspace calling pkey_alloc() needs to be prepared
for it to fail.  Why?  pkey_alloc() returns the same error code
(ENOSPC) when there are no pkeys and when pkeys are unsupported.
They can be unsupported for a whole host of reasons, so apps must
be prepared for this.  Also, libraries or LD_PRELOADs might steal
keys before an application gets access to them.

This allocation mechanism could be implemented in userspace.
Even if we did it in userspace, we would still need additional
user/kernel interfaces to tell userspace which keys are being
used by the kernel internally (such as for execute-only
mappings).  Having the kernel provide this facility completely
removes the need for these additional interfaces, or having an
implementation of this in userspace at all.

Note that we have to make changes to all of the architectures
that do not use mman-common.h because we use the new
PKEY_DENY_ACCESS/WRITE macros in arch-independent code.

1. PKRU is the Protection Key Rights User register.  It is a
   usermode-accessible register that controls whether writes
   and/or access to each individual pkey is allowed or denied.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-09 13:02:27 +02:00
Dave Hansen
a8502b67d7 x86/pkeys: Make mprotect_key() mask off additional vm_flags
Today, mprotect() takes 4 bits of data: PROT_READ/WRITE/EXEC/NONE.
Three of those bits: READ/WRITE/EXEC get translated directly in to
vma->vm_flags by calc_vm_prot_bits().  If a bit is unset in
mprotect()'s 'prot' argument then it must be cleared in vma->vm_flags
during the mprotect() call.

We do this clearing today by first calculating the VMA flags we
want set, then clearing the ones we do not want to inherit from
the original VMA:

	vm_flags = calc_vm_prot_bits(prot, key);
	...
	newflags = vm_flags;
	newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));

However, we *also* want to mask off the original VMA's vm_flags in
which we store the protection key.

To do that, this patch adds a new macro:

	ARCH_VM_PKEY_FLAGS

which allows the architecture to specify additional bits that it would
like cleared.  We use that to ensure that the VM_PKEY_BIT* bits get
cleared.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163013.E48D6981@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-09 13:02:26 +02:00
Dave Hansen
7d06d9c9bd mm: Implement new pkey_mprotect() system call
pkey_mprotect() is just like mprotect, except it also takes a
protection key as an argument.  On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.

I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.

	int real_prot = PROT_READ|PROT_WRITE;
	pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);

This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set
that denied all access.

We settled on 'unsigned long' for the type of the key here.  We
only need 4 bits on x86 today, but I figured that other
architectures might need some more space.

Semantically, we have a bit of a problem if we combine this
syscall with our previously-introduced execute-only support:
What do we do when we mix execute-only pkey use with
pkey_mprotect() use?  For instance:

	pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
	mprotect(ptr, PAGE_SIZE, PROT_EXEC);  // set pkey=X_ONLY_PKEY?
	mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?

To solve that, we make the plain-mprotect()-initiated execute-only
support only apply to VMAs that have the default protection key (0)
set on them.

Proposed semantics:
1. protection key 0 is special and represents the default,
   "unassigned" protection key.  It is always allocated.
2. mprotect() never affects a mapping's pkey_mprotect()-assigned
   protection key. A protection key of 0 (even if set explicitly)
   represents an unassigned protection key.
   2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
       key may or may not result in a mapping with execute-only
       properties.  pkey_mprotect() plus pkey_set() on all threads
       should be used to _guarantee_ execute-only semantics if this
       is not a strong enough semantic.
3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
   kernel will internally attempt to allocate and dedicate a
   protection key for the purpose of execute-only mappings.  This
   may not be possible in cases where there are no free protection
   keys available.  It can also happen, of course, in situations
   where there is no hardware support for protection keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163012.3DDD36C4@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-09 13:02:26 +02:00
Dave Hansen
e8c6226d48 x86/pkeys: Add fault handling for PF_PK page fault bit
PF_PK means that a memory access violated the protection key
access restrictions.  It is unconditionally an access_error()
because the permissions set on the VMA don't matter (the PKRU
value overrides it), and we never "resolve" PK faults (like
how a COW can "resolve write fault).

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163010.DD1FE1ED@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-09 13:02:26 +02:00
Srinivas Pandruvada
a6cbcdd5ab ACPI / CPPC: Add support for functional fixed hardware address
The CPPC registers can also be accessed via functional fixed hardware
addresse(FFH) in X86. Add support by modifying cpc_read and cpc_write to
be able to read/write MSRs on x86 platform on per cpu basis.
Also with this change, acpi_cppc_processor_probe doesn't bail out if
address space id is not equal to PCC or memory address space and FFH
is supported on the system.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-09-08 23:02:14 +02:00
Prarit Bhargava
a4497a86fb x86, clock: Fix kvm guest tsc initialization
When booting a kvm guest on AMD with the latest kernel the following
messages are displayed in the boot log:

 tsc: Unable to calibrate against PIT
 tsc: HPET/PMTIMER calibration failed

aa297292d7 ("x86/tsc: Enumerate SKL cpu_khz and tsc_khz via CPUID")
introduced a change to account for a difference in cpu and tsc frequencies for
Intel SKL processors. Before this change the native tsc set
x86_platform.calibrate_tsc to native_calibrate_tsc() which is a hardware
calibration of the tsc, and in tsc_init() executed

	tsc_khz = x86_platform.calibrate_tsc();
	cpu_khz = tsc_khz;

The kvm code changed x86_platform.calibrate_tsc to kvm_get_tsc_khz() and
executed the same tsc_init() function.  This meant that KVM guests did not
execute the native hardware calibration function.

After aa297292d7, there are separate native calibrations for cpu_khz and
tsc_khz.  The code sets x86_platform.calibrate_tsc to native_calibrate_tsc()
which is now an Intel specific calibration function, and
x86_platform.calibrate_cpu to native_calibrate_cpu() which is the "old"
native_calibrate_tsc() function (ie, the native hardware calibration
function).

tsc_init() now does

	cpu_khz = x86_platform.calibrate_cpu();
	tsc_khz = x86_platform.calibrate_tsc();
	if (tsc_khz == 0)
		tsc_khz = cpu_khz;
	else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
		cpu_khz = tsc_khz;

The kvm code should not call the hardware initialization in
native_calibrate_cpu(), as it isn't applicable for kvm and it didn't do that
prior to aa297292d7.

This patch resolves this issue by setting x86_platform.calibrate_cpu to
kvm_get_tsc_khz().

v2: I had originally set x86_platform.calibrate_cpu to
cpu_khz_from_cpuid(), however, pbonzini pointed out that the CPUID leaf
in that function is not available in KVM.  I have changed the function
pointer to kvm_get_tsc_khz().

Fixes: aa297292d7 ("x86/tsc: Enumerate SKL cpu_khz and tsc_khz via CPUID")
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Len Brown <len.brown@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08 16:41:55 +02:00
Paolo Bonzini
6f90f1d1d2 Merge tag 'kvm-s390-next-4.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: features and fixes for 4.9

- lazy enablement of runtime instrumentation
- up to 255 CPUs for nested guests
- rework of machine check deliver
- cleanups/fixes
2016-09-08 15:35:44 +02:00
Andy Shevchenko
f43ea76cf3 x86/platform/intel-mid: Keep SRAM powered on at boot
On Penwell SRAM has to be powered on, otherwise it prevents booting.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: ca22312dc8 ("x86/platform/intel-mid: Extend PWRMU to support Penwell")
Link: http://lkml.kernel.org/r/20160908103232.137587-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 14:07:54 +02:00
Andy Shevchenko
8e522e1d32 x86/platform/intel-mid: Add Intel Penwell to ID table
Commit:

  ca22312dc8 ("x86/platform/intel-mid: Extend PWRMU to support Penwell")

... enabled the PWRMU driver on platforms based on Intel Penwell, but
unfortunately this is not enough.

Add Intel Penwell ID to pci-mid.c driver as well. To avoid confusion in the
future add a comment to both drivers.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: ca22312dc8 ("x86/platform/intel-mid: Extend PWRMU to support Penwell")
Link: http://lkml.kernel.org/r/20160908103232.137587-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 14:07:53 +02:00
Suravee Suthikulpanit
411b44ba80 svm: Implements update_pi_irte hook to setup posted interrupt
This patch implements update_pi_irte function hook to allow SVM
communicate to IOMMU driver regarding how to set up IRTE for handling
posted interrupt.

In case AVIC is enabled, during vcpu_load/unload, SVM needs to update
IOMMU IRTE with appropriate host physical APIC ID. Also, when
vcpu_blocking/unblocking, SVM needs to update the is-running bit in
the IOMMU IRTE. Both are achieved via calling amd_iommu_update_ga().

However, if GA mode is not enabled for the pass-through device,
IOMMU driver will simply just return when calling amd_iommu_update_ga.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08 13:18:58 +02:00
Suravee Suthikulpanit
5881f73757 svm: Introduce AMD IOMMU avic_ga_log_notifier
This patch introduces avic_ga_log_notifier, which will be called
by IOMMU driver whenever it handles the Guest vAPIC (GA) log entry.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08 13:11:56 +02:00
Suravee Suthikulpanit
5ea11f2b31 svm: Introduces AVIC per-VM ID
Introduces per-VM AVIC ID and helper functions to manage the IDs.
Currently, the ID will be used to implement 32-bit AVIC IOMMU GA tag.

The ID is 24-bit one-based indexing value, and is managed via helper
functions to get the next ID, or to free an ID once a VM is destroyed.
There should be no ID conflict for any active VMs.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08 12:57:20 +02:00
Wei Yang
3ec979658e x86/e820: Fix very large 'size' handling boundary condition
The (start, size) tuple represents a range [start, start + size - 1],
which means "start" and "start + size - 1" should be compared to see
whether the range overflows.

For example, a range with (start, size):

	(0xffffffff fffffff0, 0x00000000 00000010)

represents

	[0xffffffff fffffff0, 0xffffffff ffffffff]

... would be judged overflow in the original code, while actually it is not.

This patch fixes this and makes sure it still works when size is zero.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/1471657213-31817-1-git-send-email-richard.weiyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 09:11:14 +02:00
Josh Poimboeuf
5a8ff54c26 x86/dumpstack: Remove unnecessary stack pointer arguments
When calling show_stack_log_lvl() or dump_trace() with a regs argument,
providing a stack pointer or frame pointer is redundant.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>d
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1694e2e955e3b9a73a3c3d5ba2634344014dd550.1472057064.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:58:40 +02:00
Josh Poimboeuf
4b8afafbe7 x86/dumpstack: Add get_stack_pointer() and get_frame_pointer()
The various functions involved in dumping the stack all do similar
things with regard to getting the stack pointer and the frame pointer
based on the regs and task arguments.  Create helper functions to
do that instead.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f448914885a35f333fe04da1b97a6c2cc1f80974.1472057064.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:58:40 +02:00
Josh Poimboeuf
d438f5fda3 x86/dumpstack: Make printk_stack_address() more generally useful
Change printk_stack_address() to be useful when called by an unwinder
outside the context of dump_trace().

Specifically:

- printk_stack_address()'s 'data' argument is always used as the log
  level string.  Make that explicit.

- Call touch_nmi_watchdog().

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/9fbe0db05bacf66d337c162edbf61450d0cff1e2.1472057064.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:58:40 +02:00
Josh Poimboeuf
3e344a0db9 oprofile/x86: Add regs->ip to oprofile trace
dump_trace() doesn't add the interrupted instruction's address to the
trace, so add it manually.  This makes the profile more useful, and also
makes it more consistent with what perf profiling does.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Robert Richter <rric@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/6c745a83dbd69fc6857ef9b2f8be0f011d775936.1472057064.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:58:40 +02:00
Josh Poimboeuf
019e579d39 perf/x86: Check perf_callchain_store() error
Add a check to perf_callchain_kernel() so that it returns early if the
callchain entry array is already full.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/dce6d60bab08be2600efd90021d9b85620646161.1472057064.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:58:40 +02:00
Andy Lutomirski
6271cfdfc0 x86/mm: Improve stack-overflow #PF handling
If we get a page fault indicating kernel stack overflow, invoke
handle_stack_overflow().  To prevent us from overflowing the stack
again while handling the overflow (because we are likely to have
very little stack space left), call handle_stack_overflow() on the
double-fault stack.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/6d6cf96b3fb9b4c9aa303817e1dc4de0c7c36487.1472603235.git.luto@kernel.org
[ Minor edit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:47:20 +02:00
Ingo Molnar
2b3061c77c Merge branch 'x86/mm' into x86/asm, to unify the two branches for simplicity
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:41:52 +02:00
Andy Shevchenko
f5fbf84830 x86/cpu: Rename Merrifield2 to Moorefield
Merrifield2 is actually Moorefield.

Rename it accordingly and drop tail digit from Merrifield1.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160906184254.94440-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:13:08 +02:00
Dou Liyang
c291b01515 x86/apic: Fix num_processors value in case of failure
If the topology package map check of the APIC ID and the CPU is a failure,
we don't generate the processor info for that APIC ID yet we increase
disabled_cpus by one - which is buggy.

Only increase num_processors once we are sure we don't fail.

Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1473214893-16481-1-git-send-email-douly.fnst@cn.fujitsu.com
[ Rewrote the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:11:03 +02:00
Andy Shevchenko
bda7b072de x86/platform/intel-mid: Implement power off sequence
Tell SCU that we are about powering off the device.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160907123955.21228-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 08:03:58 +02:00
Suraj Jitindar Singh
8a7e75d47b KVM: Add provisioning for ulong vm stats and u64 vcpu stats
vms and vcpus have statistics associated with them which can be viewed
within the debugfs. Currently it is assumed within the vcpu_stat_get() and
vm_stat_get() functions that all of these statistics are represented as
u32s, however the next patch adds some u64 vcpu statistics.

Change all vcpu statistics to u64 and modify vcpu_stat_get() accordingly.
Since vcpu statistics are per vcpu, they will only be updated by a single
vcpu at a time so this shouldn't present a problem on 32-bit machines
which can't atomically increment 64-bit numbers. However vm statistics
could potentially be updated by multiple vcpus from that vm at a time.
To avoid the overhead of atomics make all vm statistics ulong such that
they are 64-bit on 64-bit systems where they can be atomically incremented
and are 32-bit on 32-bit systems which may not be able to atomically
increment 64-bit numbers. Modify vm_stat_get() to expect ulongs.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-08 12:25:37 +10:00
Carlos Santa
8d9c20e1d1 drm/i915: Remove .is_mobile field from platform struct
As recommended by Ville Syrjala removing .is_mobile field from the
platform struct definition for vlv and hsw+ GPUs as there's no need to
make the distinction in later hardware anymore. Keep it for older GPUs
as it is still needed for ilk-ivb.

Signed-off-by: Carlos Santa <carlos.santa@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2016-09-07 16:07:07 -07:00
Linus Torvalds
ab29b33a84 Merge tag 'seccomp-v4.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp fixes from Kees Cook:
 "Fix UM seccomp vs ptrace, after reordering landed"

* tag 'seccomp-v4.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  seccomp: Remove 2-phase API documentation
  um/ptrace: Fix the syscall number update after a ptrace
  um/ptrace: Fix the syscall_trace_leave call
2016-09-07 10:46:06 -07:00
Jan Dakinevich
9ac7e3e815 KVM: nVMX: expose INS/OUTS information support
Expose the feature to L1 hypervisor if host CPU supports it, since
certain hypervisors requires it for own purposes.

According to Intel SDM A.1, if CPU supports the feature,
VMX_INSTRUCTION_INFO field of VMCS will contain detailed information
about INS/OUTS instructions handling. This field is already copied to
VMCS12 for L1 hypervisor (see prepare_vmcs12 routine) independently
feature presence.

Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:30 +02:00
Paolo Bonzini
16cb025565 KVM: VMX: not use vmcs_config in setup_vmcs_config
setup_vmcs_config takes a pointer to the vmcs_config global.  The
indirection is somewhat pointless, but just keep things consistent
for now.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:30 +02:00
Paolo Bonzini
1a6982353d KVM: x86: remove stale comments
handle_external_intr does not enable interrupts anymore, vcpu_enter_guest
does it after calling guest_exit_irqoff.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:29 +02:00
Paolo Bonzini
bbe41b9508 KVM: x86: ratelimit and decrease severity for guest-triggered printk
These are mostly related to nested VMX.  They needn't have
a loglevel as high as KERN_WARN, and mustn't be allowed to
pollute the host logs.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:29 +02:00
Jan Dakinevich
119a9c01a5 KVM: nVMX: pass valid guest linear-address to the L1
If EPT support is exposed to L1 hypervisor, guest linear-address field
of VMCS should contain GVA of L2, the access to which caused EPT violation.

Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:28 +02:00
Wanpeng Li
f15a75eedc KVM: nVMX: make emulated nested preemption timer pinned
Commit 61abdbe0bc ("kvm: x86: make lapic hrtimer pinned") pins the emulated
lapic timer. This patch does the same for the emulated nested preemption
timer to avoid vmexit an unrelated vCPU and the latency of kicking IPI to
another vCPU.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:27 +02:00
Liang Li
72e0ae58a0 vmx: refine validity check for guest linear address
The validity check for the guest line address is inefficient,
check the invalid value instead of enumerating the valid ones.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:27 +02:00
Mickaël Salaün
ce29856a5e um/ptrace: Fix the syscall number update after a ptrace
Update the syscall number after each PTRACE_SETREGS on ORIG_*AX.

This is needed to get the potentially altered syscall number in the
seccomp filters after RET_TRACE.

This fix four seccomp_bpf tests:
> [ RUN      ] TRACE_syscall.skip_after_RET_TRACE
> seccomp_bpf.c:1560:TRACE_syscall.skip_after_RET_TRACE:Expected -1 (18446744073709551615) == syscall(39) (26)
> seccomp_bpf.c:1561:TRACE_syscall.skip_after_RET_TRACE:Expected 1 (1) == (*__errno_location ()) (22)
> [     FAIL ] TRACE_syscall.skip_after_RET_TRACE
> [ RUN      ] TRACE_syscall.kill_after_RET_TRACE
> TRACE_syscall.kill_after_RET_TRACE: Test exited normally instead of by signal (code: 1)
> [     FAIL ] TRACE_syscall.kill_after_RET_TRACE
> [ RUN      ] TRACE_syscall.skip_after_ptrace
> seccomp_bpf.c:1622:TRACE_syscall.skip_after_ptrace:Expected -1 (18446744073709551615) == syscall(39) (26)
> seccomp_bpf.c:1623:TRACE_syscall.skip_after_ptrace:Expected 1 (1) == (*__errno_location ()) (22)
> [     FAIL ] TRACE_syscall.skip_after_ptrace
> [ RUN      ] TRACE_syscall.kill_after_ptrace
> TRACE_syscall.kill_after_ptrace: Test exited normally instead of by signal (code: 1)
> [     FAIL ] TRACE_syscall.kill_after_ptrace

Fixes: 26703c636c ("um/ptrace: run seccomp after ptrace")

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: James Morris <jmorris@namei.org>
Cc: user-mode-linux-devel@lists.sourceforge.net
Signed-off-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-09-07 09:25:04 -07:00
Kees Cook
e6971009a9 x86/uaccess: force copy_*_user() to be inlined
As already done with __copy_*_user(), mark copy_*_user() as __always_inline.
Without this, the checks for things like __builtin_const_p() won't work
consistently in either hardened usercopy nor the recent adjustments for
detecting usercopy overflows at compile time.

The change in kernel text size is detectable, but very small:

 text      data     bss     dec      hex     filename
12118735  5768608 14229504 32116847 1ea106f vmlinux.before
12120207  5768608 14229504 32118319 1ea162f vmlinux.after

Signed-off-by: Kees Cook <keescook@chromium.org>
2016-09-06 12:16:42 -07:00
Sebastian Andrzej Siewior
9a20ea4b4c x86/kvm: Convert to hotplug state machine
Install the callbacks via the state machine. The online & down callbacks are
invoked on the target CPU so we can avoid using smp_call_function_single().
local_irq_disable() is used because smp_call_function_single() used to invoke
the function with interrupts disabled.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160818125731.27256-15-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-06 18:30:25 +02:00
Jiri Olsa
79d102cbfd perf/x86/intel/cqm: Check cqm/mbm enabled state in event init
Yanqiu Zhang reported kernel panic when using mbm event
on system where CQM is detected but without mbm event
support, like with perf:

  # perf stat -e 'intel_cqm/event=3/' -a

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
  IP: [<ffffffff8100d64c>] update_sample+0xbc/0xe0
  ...
   <IRQ>
   [<ffffffff8100d688>] __intel_mbm_event_init+0x18/0x20
   [<ffffffff81113d6b>] flush_smp_call_function_queue+0x7b/0x160
   [<ffffffff81114853>] generic_smp_call_function_single_interrupt+0x13/0x60
   [<ffffffff81052017>] smp_call_function_interrupt+0x27/0x40
   [<ffffffff816fb06c>] call_function_interrupt+0x8c/0xa0
  ...

The reason is that we currently allow to init mbm event
even if mbm support is not detected.  Adding checks for
both cqm and mbm events and support into cqm's event_init.

Fixes: 33c3cc7acf ("perf/x86/mbm: Add Intel Memory B/W Monitoring enumeration and init")
Reported-by: Yanqiu Zhang <yanqzhan@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1473089407-21857-1-git-send-email-jolsa@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-06 10:42:12 +02:00
Wanpeng Li
e12c8f36f3 KVM: lapic: adjust preemption timer correctly when goes TSC backward
TSC_OFFSET will be adjusted if discovers TSC backward during vCPU load.
The preemption timer, which relies on the guest tsc to reprogram its
preemption timer value, is also reprogrammed if vCPU is scheded in to
a different pCPU. However, the current implementation reprogram preemption
timer before TSC_OFFSET is adjusted to the right value, resulting in the
preemption timer firing prematurely.

This patch fix it by adjusting TSC_OFFSET before reprogramming preemption
timer if TSC backward.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krċmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-05 16:14:39 +02:00
Juergen Gross
99bc67536d xen: Add xen_pin_vcpu() to support calling functions on a dedicated pCPU
Some hardware models (e.g. Dell Studio 1555 laptops) require calls to
the firmware to be issued on CPU 0 only. As Dom0 might have to use
these calls, add xen_pin_vcpu() to achieve this functionality.

In case either the domain doesn't have the privilege to make the
related hypercall or the hypervisor isn't supporting it, issue a
warning once and disable further pinning attempts.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Vrabel <david.vrabel@citrix.com>
Cc: Douglas_Warzecha@dell.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akataria@vmware.com
Cc: boris.ostrovsky@oracle.com
Cc: chrisw@sous-sol.org
Cc: hpa@zytor.com
Cc: jdelvare@suse.com
Cc: jeremy@goop.org
Cc: linux@roeck-us.net
Cc: pali.rohar@gmail.com
Cc: rusty@rustcorp.com.au
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1472453327-19050-5-git-send-email-jgross@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:52:40 +02:00