A number of people are reporting seeing the "setup_efi_pci() failed!"
error message in what used to be a quiet boot,
https://bugzilla.kernel.org/show_bug.cgi?id=81891
The message isn't all that helpful because setup_efi_pci() can return a
non-success error code for a variety of reasons, not all of them fatal.
Let's drop the return code from setup_efi_pci*() altogether, since
there's no way to process it in any meaningful way outside of the inner
__setup_efi_pci*() functions.
Reported-by: Darren Hart <dvhart@linux.intel.com>
Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Ulf Winkelvos <ulf@winkelvos.de>
Cc: Andre Müller <andre.muller@web.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Stephen Rothwell's compiler did something amazing: it unrolled a
loop, discovered that one iteration of that loop contained an
always-true test, and emitted a warning that will IMO only serve
to convince people to disable the warning.
That bogus warning caused me to wonder what prompted such an
absurdity from his compiler, and I discovered that the code in
question was, in fact, completely wrong -- I was looking things
up in the wrong array.
This affects 3.16 as well, but the only effect is to screw up
the error checking a bit. vdso2c's output is unaffected.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/53d96ad5.80ywqrbs33ZBCQej%25akpm@linux-foundation.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This reverts commit f23cf8bd5c ("efi/x86: efistub: Move shared
dependencies to <asm/efi.h>") as well as the x86 parts of commit
f4f75ad574 ("efi: efistub: Convert into static library").
The road leading to these two reverts is long and winding.
The above two commits were merged during the v3.17 merge window and
turned the common EFI boot stub code into a static library. This
necessitated making some symbols global in the x86 boot stub which
introduced new entries into the early boot GOT.
The problem was that we weren't fixing up the newly created GOT entries
before invoking the EFI boot stub, which sometimes resulted in hangs or
resets. This failure was reported by Maarten on his Macbook pro.
The proposed fix was commit 9cb0e39423 ("x86/efi: Fixup GOT in all
boot code paths"). However, that caused issues for Linus when booting
his Sony Vaio Pro 11. It was subsequently reverted in commit
f3670394c2.
So that leaves us back with Maarten's Macbook pro not booting.
At this stage in the release cycle the least risky option is to revert
the x86 EFI boot stub to the pre-merge window code structure where we
explicitly #include efi-stub-helper.c instead of linking with the static
library. The arm64 code remains unaffected.
We can take another swing at the x86 parts for v3.18.
Conflicts:
arch/x86/include/asm/efi.h
Tested-by: Josh Boyer <jwboyer@fedoraproject.org>
Tested-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Tested-by: Leif Lindholm <leif.lindholm@linaro.org> [arm64]
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>,
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Since the arch is found locally in __audit_syscall_entry(), there is no need to
pass it in as a parameter. Delete it from the parameter list.
x86* was the only arch to call __audit_syscall_entry() directly and did so from
assembly code.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-audit@redhat.com
Signed-off-by: Eric Paris <eparis@redhat.com>
---
As this patch relies on changes in the audit tree, I think it
appropriate to send it through my tree rather than the x86 tree.
This patch defines syscall_get_arch() for the um platform. It adds a
new syscall.h header file to define this. It copies the HOST_AUDIT_ARCH
definition from ptrace.h. (that definition will be removed when we
switch audit to use this new syscall_get_arch() function)
Based-on-patch-by: Richard Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: user-mode-linux-devel@lists.sourceforge.net
Conflicts:
arch/mips/net/bpf_jit.c
drivers/net/can/flexcan.c
Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
The _PAGE_IO_MAP PTE flag was only used by Xen PV guests to mark PTEs
that were used to map I/O regions that are 1:1 in the p2m. This
allowed Xen to obtain the correct PFN when converting the MFNs read
from a PTE back to their PFN.
Xen guests no longer use _PAGE_IOMAP for this. Instead mfn_to_pfn()
returns the correct PFN by using a combination of the m2p and p2m to
determine if an MFN corresponds to a 1:1 mapping in the the p2m.
Remove _PAGE_IOMAP, replacing it with _PAGE_UNUSED2 to allow for
future uses of the PTE flag.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Since mfn_to_pfn() returns the correct PFN for identity mappings (as
used for MMIO regions), the use of _PAGE_IOMAP is not required in
pte_mfn_to_pfn().
Do not set the _PAGE_IOMAP flag in pte_pfn_to_mfn() and do not use it
in pte_mfn_to_pfn().
This will allow _PAGE_IOMAP to be removed, making it available for
future use.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If a fault on a kernel address is due to a non-present page, then it
cannot be the result of stale TLB entry from a protection change (RO
to RW or NX to X). Thus the pagetable walk in spurious_fault() can be
skipped.
See the initial if in spurious_fault() and the tests in
spurious_fault_check()) for the set of possible error codes checked
for spurious faults. These are:
IRUWP
Before x00xx && ( 1xxxx || xxx1x )
After ( 10001 || 00011 ) && ( 1xxxx || xxx1x )
Thus the new condition is a subset of the previous one, excluding only
non-present faults (I == 1 and W == 1 are mutually exclusive).
This avoids spurious_fault() oopsing in some cases if the pagetables
it attempts to walk are not accessible. This obscures the location of
the original fault.
This also fixes a crash with Xen PV guests when they access entries in
the M2P corresponding to device MMIO regions. The M2P is mapped
(read-only) by Xen into the kernel address space of the guest and this
mapping may contains holes for non-RAM regions. Read faults will
result in calls to spurious_fault(), but because the page tables for
the M2P mappings are not accessible by the guest the pagetable walk
would fault.
This was not normally a problem as MMIO mappings would not normally
result in a M2P lookup because of the use of the _PAGE_IOMAP bit the
PTE. However, removing the _PAGE_IOMAP bit requires M2P lookups for
MMIO mappings as well.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
I discovered that some needed stuff is defined/declared in headers
which are not included directly. Currently it works but if somebody
remove required headers from currently included headers then build
will break. So, just in case directly include all needed headers.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Instead of ballooning up and down dom0 memory this remaps the existing mfns
that were replaced by the identity map. The reason for this is that the
existing implementation ballooned memory up and and down which caused dom0
to have discontiguous pages. In some cases this resulted in the use of bounce
buffers which reduced network I/O performance significantly. This change will
honor the existing order of the pages with the exception of some boundary
conditions.
To do this we need to update both the Linux p2m table and the Xen m2p table.
Particular care must be taken when updating the p2m table since it's important
to limit table memory consumption and reuse the existing leaf pages which get
freed when an entire leaf page is set to the identity map. To implement this,
mapping updates are grouped into blocks with table entries getting cached
temporarily and then released.
On my test system before:
Total pages: 2105014
Total contiguous: 1640635
After:
Total pages: 2105014
Total contiguous: 2098904
Signed-off-by: Matthew Rushton <mrushton@amazon.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
The DMI tables are only used in __init code, thereby can be marked as
initialization data, too. The same is true for the callback functions
referenced from the DMI tables.
This moves ~9.6 kB of code and r/o data to the init sections, marking the
memory for release after initialization.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Pull PCI fixes from Bjorn Helgaas:
"These fix:
- Boot video device detection on dual-GPU Apple systems
- Hotplug fiascos on VGA switcheroo with radeon & nouveau drivers
- Boot hang on Freescale i.MX6 systems
- Excessive "no hotplug settings from platform" warnings
In particular:
Enumeration
- Don't default exclusively to first video device (Bruno Prémont)
PCI device hotplug
- Remove "no hotplug settings from platform" warning (Bjorn Helgaas)
- Add pci_ignore_hotplug() for VGA switcheroo (Bjorn Helgaas)
Freescale i.MX6
- Put LTSSM in "Detect" state before disabling (Lucas Stach)"
* tag 'pci-v3.17-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
vgaarb: Drop obsolete #ifndef
vgaarb: Don't default exclusively to first video device with mem+io
ACPIPHP / radeon / nouveau: Remove acpi_bus_no_hotplug()
PCI: Remove "no hotplug settings from platform" warning
PCI: Add pci_ignore_hotplug() to ignore hotplug events for a device
PCI: imx6: Put LTSSM in "Detect" state before disabling it
MAINTAINERS: Add Lucas Stach as co-maintainer for i.MX6 PCI driver
Pull perf fixes from Ingo Molnar:
"Two kernel side fixes: a kprobes fix and a perf_remove_from_context()
fix (which does not yet fix the migration bug which is WIP)"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix a race condition in perf_remove_from_context()
kprobes/x86: Free 'optinsn' cache when range check fails
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
EFI fixes, a build fix, a page table dumping (debug) fix and a clang
build fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/arm64: Fix fdt-related memory reservation
x86/mm: Apply the section attribute to the variable, not its type
x86/efi: Fixup GOT in all boot code paths
x86/efi: Only load initrd above 4g on second try
x86-64, ptdump: Mark espfix area only if existent
x86, irq: Fix build error caused by 9eabc99a63
intel_init_thermal() is called from a) at the time of system initializing
and b) at the time of system resume to initialize thermal
monitoring.
In case when thermal monitoring is handled by SMI, we get to know it via
printk(). Currently it gives the message at both cases, but its okay if
we get it only once and no need to get the same message at every time
system resumes.
So, limit showing this message only at system boot time by avoid showing
at system resume and reduce abusing kernel log buffer.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1411068135.5121.10.camel@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull EFI fix from Matt Fleming:
* Increase the number of early_ioremap() slots to fix a regression with
earlyprintk=efi after recent changes to the ACPI code (Dave Young)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In init_rmode_identity_map(), there two variables indicating the return
value, r and ret, and it return 0 on error, 1 on success. The function
is only called by vmx_create_vcpu(), and ret is redundant.
This patch removes the redundant variable, and makes init_rmode_identity_map()
return 0 on success, -errno on failure.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
it is never used to refer to the page at all.
In vcpu initialization, it indicates two things:
1. indicates if ept page is allocated
2. indicates if a memory slot for identity page is initialized
Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
identity pagetable is initialized. So we can remove ept_identity_pagetable.
NOTE: In the original code, ept identity pagetable page is pinned in memroy.
As a result, it cannot be migrated/hot-removed. After this patch, since
kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page
is no longer pinned in memory. And it can be migrated/hot-removed.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 20cde69402 ("x86, ia64: Move EFI_FB vga_default_device()
initialization to pci_vga_fixup()") moved boot video device detection from
efifb to x86 and ia64 pci/fixup.c.
For dual-GPU Apple computers above change represents a regression as code
in efifb did forcefully override vga_default_device while the merge did not
(vgaarb happens prior to PCI fixup).
To improve on initial device selection by vgaarb (it cannot know if PCI
device not behind bridges see/decode legacy VGA I/O or not), move the
screen_info based check from pci_video_fixup() to vgaarb's init function and
use it to refine/override decision taken while adding the individual PCI
VGA devices. This way PCI fixup has no reason to adjust vga_default_device
anymore but can depend on its value for flagging shadowed VBIOS.
This has the nice benefit of removing duplicated code but does introduce a
#if defined() block in vgaarb. Not all architectures have screen_info and
would cause compile to fail without it.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=84461
Reported-and-Tested-By: Andreas Noever <andreas.noever@gmail.com>
Signed-off-by: Bruno Prémont <bonbons@linux-vserver.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CC: Matthew Garrett <matthew.garrett@nebula.com>
CC: stable@vger.kernel.org # v3.5+
This patch replace the set_bit method by kvm_make_request
to make code more readable and consistent.
Signed-off-by: Guo Hui Liu <liuguohui@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The options -mno-mmx and -mno-sse are unconditionally added to
KBUILD_CFLAGS in both branches of an ifeq and through a
$(cc-option) further down. We can safely remove the first
instances.
In fact, since the -mno-mmx and -mno-sse options were introduced
simultaneous with the other two options in the $(cc-option)
[according to http://www.gnu.org/software/gcc/gcc-3.1/changes.html],
and since the former were unconditionally used, one can deduce that
only gcc versions knowing about all four are supported. So also
eliminate the $(cc-option) wrap.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Peter Foley <pefoley2@pefoley.com>
Link: http://lkml.kernel.org/r/1410365139-24440-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In arch/x86/kernel/setup.c::trim_bios_range(), the codes
introduced by 1b5576e6 (base on d8a9e6a5), it updates the first
4Kb of memory to be E820_RESERVED region. That's because it's a
BIOS owned area but generally not listed in the E820 table:
e820: BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x0000000000096fff] usable
BIOS-e820: [mem 0x0000000000097000-0x0000000000097fff] reserved
...
e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
e820: remove [mem 0x000a0000-0x000fffff] usable
But the region of first 4Kb didn't register to nosave memory:
PM: Registered nosave memory: [mem 0x00097000-0x00097fff]
PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
The code in e820_mark_nosave_regions() assumes the first e820
area to be RAM, so it causes the first 4Kb E820_RESERVED region
ignored when register to nosave. This patch removed assumption
of the first e820 area.
Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <len.brown@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Takashi Iwai <tiwai@suse.de>
Link: http://lkml.kernel.org/r/1410491038-17576-1-git-send-email-jlee@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The setup_node_data() function allocates a pg_data_t object,
inserts it into the node_data[] array and initializes the
following fields: node_id, node_start_pfn and
node_spanned_pages.
However, a few function calls later during the kernel boot,
free_area_init_node() re-initializes those fields, possibly with
setup_node_data() is not used.
This causes a small glitch when running Linux as a hyperv numa
guest:
SRAT: PXM 0 -> APIC 0x00 -> Node 0
SRAT: PXM 0 -> APIC 0x01 -> Node 0
SRAT: PXM 1 -> APIC 0x02 -> Node 1
SRAT: PXM 1 -> APIC 0x03 -> Node 1
SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
Initmem setup node 0 [mem 0x00000000-0x7fffffff]
NODE_DATA [mem 0x7ffdc000-0x7ffeffff]
Initmem setup node 1 [mem 0x80800000-0x1081fffff]
NODE_DATA [mem 0x1081ea000-0x1081fdfff]
crashkernel: memory value expected
[ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
[ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
Zone ranges:
DMA [mem 0x00001000-0x00ffffff]
DMA32 [mem 0x01000000-0xffffffff]
Normal [mem 0x100000000-0x1081fffff]
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x00001000-0x0009efff]
node 0: [mem 0x00100000-0x7ffeffff]
node 1: [mem 0x80200000-0xf7ffffff]
node 1: [mem 0x100000000-0x1081fffff]
On node 0 totalpages: 524174
DMA zone: 64 pages used for memmap
DMA zone: 21 pages reserved
DMA zone: 3998 pages, LIFO batch:0
DMA32 zone: 8128 pages used for memmap
DMA32 zone: 520176 pages, LIFO batch:31
On node 1 totalpages: 524288
DMA32 zone: 7672 pages used for memmap
DMA32 zone: 491008 pages, LIFO batch:31
Normal zone: 520 pages used for memmap
Normal zone: 33280 pages, LIFO batch:7
In this dmesg, the SRAT table reports that the memory range for
node 1 starts at 0x80200000. However, the line starting with
"Initmem" reports that node 1 memory range starts at 0x80800000.
The "Initmem" line is reported by setup_node_data() and is
wrong, because the kernel ends up using the range as reported in
the SRAT table.
This commit drops all that dead code from setup_node_data(),
renames it to alloc_node_data() and adds a printk() to
free_area_init_node() so that we report a node's memory range
accurately.
Here's the same dmesg section with this patch applied:
SRAT: PXM 0 -> APIC 0x00 -> Node 0
SRAT: PXM 0 -> APIC 0x01 -> Node 0
SRAT: PXM 1 -> APIC 0x02 -> Node 1
SRAT: PXM 1 -> APIC 0x03 -> Node 1
SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
SRAT: Node 1 PXM 1 [mem 0x80200000-0xf7ffffff]
SRAT: Node 1 PXM 1 [mem 0x100000000-0x1081fffff]
NUMA: Node 1 [mem 0x80200000-0xf7ffffff] + [mem 0x100000000-0x1081fffff] -> [mem 0x80200000-0x1081fffff]
NODE_DATA(0) allocated [mem 0x7ffdc000-0x7ffeffff]
NODE_DATA(1) allocated [mem 0x1081ea000-0x1081fdfff]
crashkernel: memory value expected
[ffffea0000000000-ffffea0001ffffff] PMD -> [ffff88007de00000-ffff88007fdfffff] on node 0
[ffffea0002000000-ffffea00043fffff] PMD -> [ffff880105600000-ffff8801077fffff] on node 1
Zone ranges:
DMA [mem 0x00001000-0x00ffffff]
DMA32 [mem 0x01000000-0xffffffff]
Normal [mem 0x100000000-0x1081fffff]
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x00001000-0x0009efff]
node 0: [mem 0x00100000-0x7ffeffff]
node 1: [mem 0x80200000-0xf7ffffff]
node 1: [mem 0x100000000-0x1081fffff]
Initmem setup node 0 [mem 0x00001000-0x7ffeffff]
On node 0 totalpages: 524174
DMA zone: 64 pages used for memmap
DMA zone: 21 pages reserved
DMA zone: 3998 pages, LIFO batch:0
DMA32 zone: 8128 pages used for memmap
DMA32 zone: 520176 pages, LIFO batch:31
Initmem setup node 1 [mem 0x80200000-0x1081fffff]
On node 1 totalpages: 524288
DMA32 zone: 7672 pages used for memmap
DMA32 zone: 491008 pages, LIFO batch:31
Normal zone: 520 pages used for memmap
Normal zone: 33280 pages, LIFO batch:7
This commit was tested on a two node bare-metal NUMA machine and
Linux as a numa guest on hyperv and qemu/kvm.
PS: The wrong memory range reported by setup_node_data() seems to be
harmless in the current kernel because it's just not used. However,
that bad range is used in kernel 2.6.32 to initialize the old boot
memory allocator, which causes a crash during boot.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When hot-adding/removing memory, sync_global_pgds() is called
for synchronizing PGD to PGD entries of all processes MM. But
when hot-removing memory, sync_global_pgds() does not work
correctly.
At first, sync_global_pgds() checks whether target PGD is none
or not. And if PGD is none, the PGD is skipped. But when
hot-removing memory, PGD may be none since PGD may be cleared by
free_pud_table(). So when sync_global_pgds() is called after
hot-removing memory, sync_global_pgds() should not skip PGD even
if the PGD is none. And sync_global_pgds() must clear PGD
entries of all processes MM.
Currently sync_global_pgds() does not clear PGD entries of all
processes MM when hot-removing memory. So when hot adding
memory which is same memory range as removed memory after
hot-removing memory, following call traces are shown:
kernel BUG at arch/x86/mm/init_64.c:206!
...
[<ffffffff815e0c80>] kernel_physical_mapping_init+0x1b2/0x1d2
[<ffffffff815ced94>] init_memory_mapping+0x1d4/0x380
[<ffffffff8104aebd>] arch_add_memory+0x3d/0xd0
[<ffffffff815d03d9>] add_memory+0xb9/0x1b0
[<ffffffff81352415>] acpi_memory_device_add+0x1af/0x28e
[<ffffffff81325dc4>] acpi_bus_device_attach+0x8c/0xf0
[<ffffffff813413b9>] acpi_ns_walk_namespace+0xc8/0x17f
[<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
[<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
[<ffffffff813418ed>] acpi_walk_namespace+0x95/0xc5
[<ffffffff81326b4c>] acpi_bus_scan+0x9a/0xc2
[<ffffffff81326bff>] acpi_scan_bus_device_check+0x8b/0x12e
[<ffffffff81326cb5>] acpi_scan_device_check+0x13/0x15
[<ffffffff81320122>] acpi_os_execute_deferred+0x25/0x32
[<ffffffff8107e02b>] process_one_work+0x17b/0x460
[<ffffffff8107edfb>] worker_thread+0x11b/0x400
[<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
[<ffffffff81085aef>] kthread+0xcf/0xe0
[<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
[<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
[<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
This patch clears PGD entries of all processes MM when
sync_global_pgds() is called after hot-removing memory
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When hot-adding memory after hot-removing memory, following call
traces are shown:
kernel BUG at arch/x86/mm/init_64.c:206!
...
[<ffffffff815e0c80>] kernel_physical_mapping_init+0x1b2/0x1d2
[<ffffffff815ced94>] init_memory_mapping+0x1d4/0x380
[<ffffffff8104aebd>] arch_add_memory+0x3d/0xd0
[<ffffffff815d03d9>] add_memory+0xb9/0x1b0
[<ffffffff81352415>] acpi_memory_device_add+0x1af/0x28e
[<ffffffff81325dc4>] acpi_bus_device_attach+0x8c/0xf0
[<ffffffff813413b9>] acpi_ns_walk_namespace+0xc8/0x17f
[<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
[<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
[<ffffffff813418ed>] acpi_walk_namespace+0x95/0xc5
[<ffffffff81326b4c>] acpi_bus_scan+0x9a/0xc2
[<ffffffff81326bff>] acpi_scan_bus_device_check+0x8b/0x12e
[<ffffffff81326cb5>] acpi_scan_device_check+0x13/0x15
[<ffffffff81320122>] acpi_os_execute_deferred+0x25/0x32
[<ffffffff8107e02b>] process_one_work+0x17b/0x460
[<ffffffff8107edfb>] worker_thread+0x11b/0x400
[<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
[<ffffffff81085aef>] kthread+0xcf/0xe0
[<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
[<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
[<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
The patch-set fixes the issue.
This patch (of 2):
remove_pagetable() gets start argument and passes the argument
to sync_global_pgds(). In this case, the argument must not be
modified. If the argument is modified and passed to
sync_global_pgds(), sync_global_pgds() does not correctly
synchronize PGD to PGD entries of all processes MM since
synchronized range of memory [start, end] is wrong.
Unfortunately the start argument is modified in
remove_pagetable(). So this patch fixes the issue.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
3.16 kernel boot fail with earlyprintk=efi, it keeps scrolling at the
bottom line of screen.
Bisected, the first bad commit is below:
commit 86dfc6f339
Author: Lv Zheng <lv.zheng@intel.com>
Date: Fri Apr 4 12:38:57 2014 +0800
ACPICA: Tables: Fix table checksums verification before installation.
I did some debugging by enabling both serial and efi earlyprintk, below is
some debug dmesg, seems early_ioremap fails in scroll up function due to
no free slot, see below dmesg output:
WARNING: CPU: 0 PID: 0 at mm/early_ioremap.c:116 __early_ioremap+0x90/0x1c4()
__early_ioremap(ed00c800, 00000c80) not found slot
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 3.17.0-rc1+ #204
Hardware name: Hewlett-Packard HP Z420 Workstation/1589, BIOS J61 v03.15 05/09/2013
Call Trace:
dump_stack+0x4e/0x7a
warn_slowpath_common+0x75/0x8e
? __early_ioremap+0x90/0x1c4
warn_slowpath_fmt+0x47/0x49
__early_ioremap+0x90/0x1c4
? sprintf+0x46/0x48
early_ioremap+0x13/0x15
early_efi_map+0x24/0x26
early_efi_scroll_up+0x6d/0xc0
early_efi_write+0x1b0/0x214
call_console_drivers.constprop.21+0x73/0x7e
console_unlock+0x151/0x3b2
? vprintk_emit+0x49f/0x532
vprintk_emit+0x521/0x532
? console_unlock+0x383/0x3b2
printk+0x4f/0x51
acpi_os_vprintf+0x2b/0x2d
acpi_os_printf+0x43/0x45
acpi_info+0x5c/0x63
? __acpi_map_table+0x13/0x18
? acpi_os_map_iomem+0x21/0x147
acpi_tb_print_table_header+0x177/0x186
acpi_tb_install_table_with_override+0x4b/0x62
acpi_tb_install_standard_table+0xd9/0x215
? early_ioremap+0x13/0x15
? __acpi_map_table+0x13/0x18
acpi_tb_parse_root_table+0x16e/0x1b4
acpi_initialize_tables+0x57/0x59
acpi_table_init+0x50/0xce
acpi_boot_table_init+0x1e/0x85
setup_arch+0x9b7/0xcc4
start_kernel+0x94/0x42d
? early_idt_handlers+0x120/0x120
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0xf3/0x100
Quote reply from Lv.zheng about the early ioremap slot usage in this case:
"""
In early_efi_scroll_up(), 2 mapping entries will be used for the src/dst screen buffer.
In drivers/acpi/acpica/tbutils.c, we've improved the early table loading code in acpi_tb_parse_root_table().
We now need 2 mapping entries:
1. One mapping entry is used for RSDT table mapping. Each RSDT entry contains an address for another ACPI table.
2. For each entry in RSDP, we need another mapping entry to map the table to perform necessary check/override before installing it.
When acpi_tb_parse_root_table() prints something through EFI earlyprintk console, we'll have 4 mapping entries used.
The current 4 slots setting of early_ioremap() seems to be too small for such a use case.
"""
Thus increase the slot to 8 in this patch to fix this issue.
boot-time mappings become 512 page with this patch.
Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: <stable@vger.kernel.org> # v3.16
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
It used to be an ad-hoc hack defined by the x86 version of
<asm/bitops.h> that enabled a couple of library routines to know whether
an integer multiply is faster than repeated shifts and additions.
This just makes it use the real Kconfig system instead, and makes x86
(which was the only architecture that did this) select the option.
NOTE! Even for x86, this really is kind of wrong. If we cared, we would
probably not enable this for builds optimized for netburst (P4), where
shifts-and-adds are generally faster than multiplies. This patch does
*not* change that kind of logic, though, it is purely a syntactic change
with no code changes.
This was triggered by the fact that we have other places that really
want to know "do I want to expand multiples by constants by hand or
not", particularly the hash generation code.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
x86 supports irq work self-IPIs when local apic is available. This is
partly known on runtime so lets implement arch_irq_work_has_interrupt()
accordingly.
This should be safely called after setup_arch().
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
The nohz full code needs irq work to trigger its own interrupt so that
the subsystem can work even when the tick is stopped.
Lets introduce arch_irq_work_has_interrupt() that archs can override to
tell about their support for this ability.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Pull Xen bug fixes from David Vrabel:
- fix for PVHVM suspend/resume and migration
- don't pointlessly retry certain ballooning ops
- fix gntalloc when grefs have run out.
- fix PV boot if KSALR is enable or very large modules are used.
* tag 'stable/for-linus-3.17-b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: don't copy bogus duplicate entries into kernel page tables
xen/gntalloc: safely delete grefs in add_grefs() undo path
xen/gntalloc: fix oops after runnning out of grant refs
xen/balloon: cancel ballooning if adding new memory failed
xen/manage: Always freeze/thaw processes when suspend/resuming
The original motivation for these patches was for an Intel CPU
feature called MPX. The patch to add a disabled feature for it
will go in with the other parts of the support.
But, in the meantime, there are a few other features than MPX
that we can make assumptions about at compile-time based on
compile options. Add them to disabled-features.h and check them
with cpu_feature_enabled().
Note that this gets rid of the last things that needed an #ifdef
CONFIG_X86_64 in cpufeature.h. Yay!
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140911211524.C0EC332A@viggo.jf.intel.com
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
I believe the REQUIRED_MASK aproach was taken so that it was
easier to consult in assembly (arch/x86/kernel/verify_cpu.S).
DISABLED_MASK does not have the same restriction, but I
implemented it the same way for consistency.
We have a REQUIRED_MASK... which does two things:
1. Keeps a list of cpuid bits to check in very early boot and
refuse to boot if those are not present.
2. Consulted during cpu_has() checks, which allows us to
optimize out things at compile-time. In other words, if we
*KNOW* we will not boot with the feature off, then we can
safely assume that it will be present forever.
But, we don't have a similar mechanism for CPU features which
may be present but that we know we will not use. We simply
use our existing mechanisms to repeatedly check the status of
the bit at runtime (well, the alternatives patching helps here
but it does not provide compile-time optimization).
Adding a feature to disabled-features.h allows the bit to be
checked via a new macro: cpu_feature_enabled(). Note that
for features in DISABLED_MASK, checks with this macro have
all of the benefits of an #ifdef. Before, we would have done
this in a header:
#ifdef CONFIG_X86_INTEL_MPX
#define cpu_has_mpx cpu_has(X86_FEATURE_MPX)
#else
#define cpu_has_mpx 0
#endif
and this in the code:
if (cpu_has_mpx)
do_some_mpx_thing();
Now, just add your feature to DISABLED_MASK and you can do this
everywhere, and get the same benefits you would have from
#ifdefs:
if (cpu_feature_enabled(X86_FEATURE_MPX))
do_some_mpx_thing();
We need a new function and *not* a modification to cpu_has()
because there are cases where we actually need to check the CPU
itself, despite what features the kernel supports. The best
example of this is a hypervisor which has no control over what
features its guests are using and where the guest does not depend
on the host for support.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140911211513.9E35E931@viggo.jf.intel.com
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
cpu_has_pae is only referenced in one place: the X86_32 kexec
code (in a file not even built on 64-bit). It hardly warrants
its own macro, or the trouble we go to ensuring that it can't
be called in X86_64 code.
Axe the macro and replace it with a direct cpu feature check.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140911211511.AD76E774@viggo.jf.intel.com
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Initially the tracepoint was added only to the APIC_DM_FIXED case,
also because it reported coalesced interrupts that only made sense
for that case. However, the coalesced argument is not used anymore
and tracing other delivery modes is useful, so hoist the call out
of the switch statement.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We have APIC_DEFAULT_PHYS_BASE defined as 0xfee00000, which is also the address of
apic access page. So use this macro.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded
modules exceeds 512 MiB, then loading modules fails with a warning
(and hence a vmalloc allocation failure) because the PTEs for the
newly-allocated vmalloc address space are not zero.
WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
vmap_page_range_noflush+0x2a1/0x360()
This is caused by xen_setup_kernel_pagetables() copying
level2_kernel_pgt into level2_fixmap_pgt, overwriting many non-present
entries.
Without KASLR, the normal kernel image size only covers the first half
of level2_kernel_pgt and module space starts after that.
L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[ 0..255]->kernel
[256..511]->module
[511]->level2_fixmap_pgt[ 0..505]->module
This allows 512 MiB of of module vmalloc space to be used before
having to use the corrupted level2_fixmap_pgt entries.
With KASLR enabled, the kernel image uses the full PUD range of 1G and
module space starts in the level2_fixmap_pgt. So basically:
L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel
[511]->level2_fixmap_pgt[0..505]->module
And now no module vmalloc space can be used without using the corrupt
level2_fixmap_pgt entries.
Fix this by properly converting the level2_fixmap_pgt entries to MFNs,
and setting level1_fixmap_pgt as read-only.
A number of comments were also using the the wrong L3 offset for
level2_kernel_pgt. These have been corrected.
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: stable@vger.kernel.org