Convert the driver from using the x86 specific MTRR code to
the architecture agnostic arch_phys_wc_add(). arch_phys_wc_add()
will avoid MTRR if write-combining is available, in order to
take advantage of that also ensure the ioremap'd area is requested
as write-combining.
There are a few motivations for this:
a) Take advantage of PAT when available
b) Help bury MTRR code away, MTRR is architecture specific and on
x86 its replaced by PAT
c) Help with the goal of eventually using _PAGE_CACHE_UC over
_PAGE_CACHE_UC_MINUS on x86 on ioremap_nocache() (see commit
de33c442e titled "x86 PAT: fix performance drop for glx,
use UC minus for ioremap(), ioremap_nocache() and
pci_mmap_page_range()")
The conversion done is expressed by the following Coccinelle
SmPL patch, it additionally required manual intervention to
address all the #ifdery and removal of redundant things which
arch_phys_wc_add() already addresses such as verbose message
about when MTRR fails and doing nothing when we didn't get
an MTRR.
@ mtrr_found @
expression index, base, size;
@@
-index = mtrr_add(base, size, MTRR_TYPE_WRCOMB, 1);
+index = arch_phys_wc_add(base, size);
@ mtrr_rm depends on mtrr_found @
expression mtrr_found.index, mtrr_found.base, mtrr_found.size;
@@
-mtrr_del(index, base, size);
+arch_phys_wc_del(index);
@ mtrr_rm_zero_arg depends on mtrr_found @
expression mtrr_found.index;
@@
-mtrr_del(index, 0, 0);
+arch_phys_wc_del(index);
@ mtrr_rm_fb_info depends on mtrr_found @
struct fb_info *info;
expression mtrr_found.index;
@@
-mtrr_del(index, info->fix.smem_start, info->fix.smem_len);
+arch_phys_wc_del(index);
@ ioremap_replace_nocache depends on mtrr_found @
struct fb_info *info;
expression base, size;
@@
-info->screen_base = ioremap_nocache(base, size);
+info->screen_base = ioremap_wc(base, size);
@ ioremap_replace_default depends on mtrr_found @
struct fb_info *info;
expression base, size;
@@
-info->screen_base = ioremap(base, size);
+info->screen_base = ioremap_wc(base, size);
Generated-by: Coccinelle SmPL
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: Jingoo Han <jg1.han@samsung.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Antonino Daplas <adaplas@gmail.com>
Cc: linux-fbdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
This patch adds four new IOCTLs to amdkfd. These IOCTLs expose a H/W
debugger functionality to the userspace.
The IOCTLs are:
- AMDKFD_IOC_DBG_REGISTER:
The purpose of this IOCTL is to notify amdkfd that a process wants to use
GPU debugging facilities on itself only.
It is expected that this IOCTL would be called before any other H/W
debugger requests are sent to amdkfd and for each GPU where the H/W
debugging needs to be enabled. The use of this IOCTL ensures that only
one instance of a debugger is active in the system.
- AMDKFD_IOC_DBG_UNREGISTER:
This IOCTL detaches the debugger/debugged process from the H/W
Debug which was established by the AMDKFD_IOC_DBG_REGISTER IOCTL.
- AMDKFD_IOC_DBG_ADDRESS_WATCH:
This IOCTL allows to set different watchpoints with various conditions as
indicated by the IOCTL's arguments. The available number of watchpoints
is retrieved from topology. This operation is confined to the current
debugged process, which was registered through AMDKFD_IOC_DBG_REGISTER.
- AMDKFD_IOC_DBG_WAVE_CONTROL:
This IOCTL allows to control a wavefront as indicated by the IOCTL's
arguments. For example, you can halt/resume or kill either a
single wavefront or a set of wavefronts. This operation is confined to
the current debugged process, which was registered through
AMDKFD_IOC_DBG_REGISTER.
Because the arguments for the address watch IOCTL and wave control IOCTL
are dynamic, meaning that they could vary in size, the userspace passes a
pointer to a structure (in userspace) that contains the value of the
arguments. The kernel driver is responsible to parse this structure and
validate its contents.
v2: change void* to uint64_t inside ioctl arguments
Signed-off-by: Yair Shachar <yair.shachar@amd.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Define an enum for gpt timer device type in include/soc/imx/timer.h to
tell the gpt block differences among SoCs. Update non-DT users (clock
drivers) to pass the device type.
As we now have include/soc/imx/timer.h, the declaration of
mxc_timer_init() is moved into there as the best fit.
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Enabled DAP (debug access port) by default. This enables the hw-
breakpoint framework to make use of the breakpoints and watchpoints
supported by hardware.
[ 0.215805] hw-breakpoint: found 2 (+1 reserved) breakpoint and 1 watchpoint registers.
[ 0.224624] hw-breakpoint: maximum watchpoint size is 4 bytes.
Without this clock, the hw-breakpoint driver claims an undefined
instruction during initialization:
[ 0.227380] hw-breakpoint: Debug register access (0xee003e17) caused undefined instruction on CPU 0
[ 0.227519] hw-breakpoint: CPU 0 failed to disable vector catch
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
The revision definitions and declarations are widely used by clock
drivers. As a step of moving clock drivers out of arch/arm/mach-imx,
let's create header include/soc/imx/revision.h to accommodate them.
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
It adds the imx7d clock ID definitions which will be used by both imx7d
clock driver and device tree.
Signed-off-by: Frank Li <Frank.Li@freescale.com>
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Kishon writes:
phy: for 4.2 merge window
*) new Broadcom SATA3 PHY driver for Broadcom STB SoCs
*) new phy API to get PHY by index which is used in EHCI and
OHCI controller drivers
*) support specifying supply at port level used for multi-port PHYs
*) sparse warning fixes in miphy PHYs
*) fix pm_runtime issues in twl4030 driver
Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com>
Chanwoo writes:
Update extcon for v4.2
This patchset include the huge update of extcon core and add the new one extcon
driver and fix minor isseu of extcon drivers.
Detailed description for patchset:
1. Update the extcon core.
- Modify the extcon device name on sysfs from device name name to 'extcon[X]'
as following because if same extcon device are included in H/W development board,
the one of the two device driver might be failed on the probe().
: /sys/class/extcon/[device name] -> /sys/class/extcon/extcon[X]
- Use the unique id for external connectors instead of legacy string name.
Previously, extcon used the string name to identify the type of external
connectors. This way have the many potential issues. So, extcon core define the
unique id for each external connectors as following:
enum extcon {
EXTCON_NONE = 0x0,
/* USB external connector */
EXTCON_USB = 0x1,
EXTCON_USB_HOST = 0x2,
/* Charger external connector */
EXTCON_TA = 0x10,
EXTCON_FAST_CHARGER = 0x11,
EXTCON_SLOW_CHARGER = 0x12,
EXTCON_CHARGE_DOWNSTREAM = 0x13,
/* Audio and video external connector */
EXTCON_LINE_IN = 0x20,
EXTCON_LINE_OUT = 0x21,
EXTCON_MICROPHONE = 0x22,
EXTCON_HEADPHONE = 0x23,
...
};
- Update tye prototype of extcon_register_notifier() by using the unique id
(enum extcon) of external connectors.
- Add extcon_get_edev_name() API to get the name of extcon device on extcon
client driver because the name is included in 'struct extcon_dev' and 'struct
extcon_dev' should be handled in only drivers/extcon directory. So. if extcon
client need the name of extcon device, they could use this function.
- Unify the jig/dock and MHL-TA cable name on extcon driver.
: JIG-{USB-ON|USB-OFF|UART-ON|UART-OFF} -> JIG
: Dock-{Smart|Desk|Audio|Card} -> DOCK
: MHL-TA -> TA
- Use the capital letter for the name of all external connectors.
- Remove the optional print_name() function pointer from struct extcon_dev to
maintain the consistent name of extcon device.
2. Add the new extcon-axp288.c extcon driver.
- The extcon-axp288.c driver support for AXP288 PMIC which has the BC1.2
charger detection capability. So this extcon driver can detect the
EXTCON_SLOW_CHARGER, EXTCON_CHARGE_DOWNSTREAM and EXTCON_FAST_CHARGER.
3. Update the extcon-arizona.c driver.
- Add support for selective detection mode when headphone detection.
- Apply HP clamps for WM8280
4. Clean-up the extcon core and drivers.
- Add manufactor information of each extcon device.
- Fix checkpatch warning and minor coding style on extcon.c.c
- Fix build break if GPIOLIB is not enabled on extcon-usb-gpiio.c.
- Set the direction of gpio when calling devm_gpiod_get() on extcon-usb-gpio.c
This patch does two things.
Firstly it presents the Accelerator Function Unit (AFUs) behind the POWER
Service Layer (PSL) as PCI devices on a virtual PCI Host Bridge (vPHB). This
in in addition to the PSL being a PCI device itself.
As part of the Coherent Accelerator Interface Architecture (CAIA) AFUs can
provide an AFU configuration. This AFU configuration recored is architected to
be the same as a PCI config space.
This patch sets discovers the AFU configuration records, provides AFU config
space read/write functions to these configuration records. It then enumerates
the PCI bus. It also hooks in PCI ops where appropriate. It also destroys the
vPHB when the physical card is removed.
Secondly, it add an in kernel API for AFU to use CXL. AFUs must present a
driver that firstly binds as a PCI device. This PCI device can then be using
to do CXL specific operations (that can't sit in the PCI ops) using this API.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This moves the current include file from cxl.h -> cxl-base.h. This current
include file is used only to pass information between the base driver that
needs to be built into the kernel and the cxl module.
This is to make way for a new include/misc/cxl.h which will
contain just the kernel API for other driver to use
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Given a file descriptor on an afu device, libcxl currently uses the
major/minor number obtained from fstat on the fd to construct path to
the afu's sysfs directory. However it is possible that rather than using
one of the device in /dev/cxl, a kernel driver creates its own device
which export generic cxl interface to the userspace. This causes
problems with libcxl as it tries to use a wrong major/minor number to
construct the sysfs path and fail.
So this patch introduces a new ioctl called CXL_IOCTL_GET_AFU_ID on the
afu file descriptor to fetch the cxl_afu_id struct that holds the
card/offset-id and mode information. These info is then used by libcxl to
construct the correct path to the afu sysfs directory.
Testing:
- Build against pseries be/le configs
- Testing with corresponding libcxl changes to verify that it constructs
right sysfs path to the afu.
Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When performing a dma_map_sg() call, the number of sg entries to map is
required. Using sg_nents to retrieve the number of sg entries will
return the total number of entries in the sg list up to the entry marked
as the end. If there happen to be unused entries in the list, these will
still be counted. Some dma_map_sg() implementations will not handle the
unused entries correctly (lib/swiotlb.c) and execute a BUG_ON.
The sg_nents_for_len() function will traverse the sg list and return the
number of entries required to satisfy the supplied length argument. This
can then be supplied to the dma_map_sg() call to successfully map the
sg.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
No new code should be using the return value of crypto_unregister_alg
as it will become void soon.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Now that type-safe init/exit functions exist, they often need
to access the underlying aead_instance. So this patch adds the
helper aead_alg_instance to access aead_instance from a crypto_aead
object.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
As it stands the only non-type safe functions left in the new
AEAD interface are the cra_init/cra_exit functions. It means
exposing the ugly __crypto_aead_cast to every AEAD implementor.
This patch adds type-safe init/exit functions to AEAD. Existing
algorithms are unaffected while new implementations can simply
fill in these two instead of cra_init/cra_exit.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The patch updates the DocBook to cover the new AEAD interface
implementation.
Signed-off-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
By default all the sensors are runtime suspended state (lowest power
state). During Linux suspend process, all the run time suspended
devices are resumed and then suspended. This caused all sensors to
power up and introduced delay in suspend time, when we introduced
runtime PM for HID sensors. The opposite process happens during resume
process.
To fix this, we do powerup process of the sensors only when the request
is issued from user (raw or tiggerred). In this way when runtime,
resume calls for powerup it will simply return as this will not match
user requested state.
Note this is a regression fix as the increase in suspend / resume
times can be substantial (report of 8 seconds on Len's laptop!)
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Len Brown <len.brown@intel.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
The naming convention is to always have the flags prefixed with
IEEE80211_HW_ so they're 'namespaced', make this flag follow it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There are no drivers setting IEEE80211_HW_2GHZ_SHORT_SLOT_INCAPABLE
or IEEE80211_HW_2GHZ_SHORT_PREAMBLE_INCAPABLE, so any code using the
two flags is dead; it's also exceedingly unlikely that any new driver
could ever need to set these flags.
The wcn36xx code is almost certainly broken, but this preserves the
previous behaviour.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Clean up: Merge bc_send() into bc_svc_process().
Note: even thought this touches svc.c, it is a client-side change.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This splits off the reservation of the memory occupied by the FDT
binary itself from the processing of the memory reservations it
contains. This is necessary because the physical address of the FDT,
which is needed to perform the reservation, may not be known to the
FDT driver core, i.e., it may be mapped outside the linear direct
mapping, in which case __pa() returns a bogus value.
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Rob Herring <robh@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Only include SCSI initiator header files in target code that needs
these header files, namely the SCSI pass-through code and the tcm_loop
driver. Change SCSI_SENSE_BUFFERSIZE into TRANSPORT_SENSE_BUFFER in
target code because the former is intended for initiator code and the
latter for target code. With this patch the only initiator include
directives in target code that remain are as follows:
$ git grep -nHE 'include .scsi/(scsi.h|scsi_host.h|scsi_device.h|scsi_cmnd.h)' drivers/target drivers/infiniband/ulp/{isert,srpt} drivers/usb/gadget/legacy/tcm_*.[ch] drivers/{vhost,xen} include/{target,trace/events/target.h}
drivers/target/loopback/tcm_loop.c:29:#include <scsi/scsi.h>
drivers/target/loopback/tcm_loop.c:31:#include <scsi/scsi_host.h>
drivers/target/loopback/tcm_loop.c:32:#include <scsi/scsi_device.h>
drivers/target/loopback/tcm_loop.c:33:#include <scsi/scsi_cmnd.h>
drivers/target/target_core_pscsi.c:39:#include <scsi/scsi_device.h>
drivers/target/target_core_pscsi.c:40:#include <scsi/scsi_host.h>
drivers/xen/xen-scsiback.c:52:#include <scsi/scsi_host.h> /* SG_ALL */
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
Correct the comment above the definition of TCM_MAX_COMMAND_SIZE.
A quote from Christoph:
There aren't any legacy issues, we just decided to handle >
16 byte CDBs in the slow path.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
For the purpose of foreign inode detection, wb's (bdi_writeback's) are
identified by the associated memcg ID. As we create a separate wb for
each memcg, this is enough to identify the active wb's; however, when
blkcg is enabled or disabled higher up in the hierarchy, the mapping
between memcg and blkcg changes which in turn creates a new wb to
service the new mapping. The old wb is unlinked from index and
released after all references are drained. The foreign inode
detection logic can't detect this condition because both the old and
new wb's point to the same memcg and thus never decides to move inodes
attached to the old wb to the new one.
This patch adds logic to initiate switching immediately in
wbc_attach_and_unlock_inode() if the associated wb is dying. We can
make the usual foreign detection logic to distinguish the different
wb's mapped to the memcg but the dying wb is never gonna be in active
service again and there's no point in tracking the usage history and
reaching the switch verdict after enough data points are collected.
It's already known that the wb has to be switched.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
With the previous three patches, all operations which acquire wb from
inode are either under one of inode->i_lock, mapping->tree_lock or
wb->list_lock or protected by unlocked_inode_to_wb transaction. This
will be depended upon by foreign inode wb switching.
This patch adds lockdep assertion to inode_to_wb() so that usages
outside the above list locks can be caught easily. There are three
exceptions.
* locked_inode_to_wb_and_lock_list() is holding wb->list_lock but the
wb may not be the inode's. Ensuring that is the function's role
after all. Updated to deref inode->i_wb directly.
* inode_wb_stat_unlocked_begin() is usually protected by combination
of !I_WB_SWITCH and rcu_read_lock(). Updated to deref inode->i_wb
directly.
* inode_congested() wants to test whether inode->i_wb is set before
starting the transaction. Added inode_to_wb_is_valid() which tests
inode->i_wb directly.
v5: might_lock() removed. It annotates that the lock is grabbed w/
irq enabled which isn't the case and triggering lockdep warning
spuriously.
v4: might_lock() added to unlocked_inode_to_wb_begin().
v3: inode_congested() conversion added.
v2: locked_inode_to_wb_and_lock_list() was missing in the first
version.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
As concurrent write sharing of an inode is expected to be very rare
and memcg only tracks page ownership on first-use basis severely
confining the usefulness of such sharing, cgroup writeback tracks
ownership per-inode. While the support for concurrent write sharing
of an inode is deemed unnecessary, an inode being written to by
different cgroups at different points in time is a lot more common,
and, more importantly, charging only by first-use can too readily lead
to grossly incorrect behaviors (single foreign page can lead to
gigabytes of writeback to be incorrectly attributed).
To resolve this issue, cgroup writeback detects the majority dirtier
of an inode and will transfer the ownership to it. To avoid
unnnecessary oscillation, the detection mechanism keeps track of
history and gives out the switch verdict only if the foreign usage
pattern is stable over a certain amount of time and/or writeback
attempts.
The detection mechanism has fairly low space and computation overhead.
It adds 8 bytes to struct inode (one int and two u16's) and minimal
amount of calculation per IO. The detection mechanism converges to
the correct answer usually in several seconds of IO time when there's
a clear majority dirtier. Even when there isn't, it can reach an
acceptable answer fairly quickly under most circumstances.
Please see wb_detach_inode() for more details.
This patch only implements detection. Following patches will
implement actual switching.
v2: wbc_account_io() now checks whether the wbc is associated with a
wb before dereferencing it. This can happen when pageout() is
writing pages directly without going through the usual writeback
path. As pageout() path is single-threaded, we don't want it to
be blocked behind a slow cgroup and ultimately want it to delegate
actual writing to the usual writeback path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, for cgroup writeback, the IO submission paths directly
associate the bio's with the blkcg from inode_to_wb_blkcg_css();
however, it'd be necessary to keep more writeback context to implement
foreign inode writeback detection. wbc (writeback_control) is the
natural fit for the extra context - it persists throughout the
writeback of each inode and is passed all the way down to IO
submission paths.
This patch adds wbc_attach_and_unlock_inode(), wbc_detach_inode(), and
wbc_attach_fdatawrite_inode() which are used to associate wbc with the
inode being written back. IO submission paths now use wbc_init_bio()
instead of directly associating bio's with blkcg themselves. This
leaves inode_to_wb_blkcg_css() w/o any user. The function is removed.
wbc currently only tracks the associated wb (bdi_writeback). Future
patches will add more for foreign inode detection. The association is
established under i_lock which will be depended upon when migrating
foreign inodes to other wb's.
As currently, once established, inode to wb association never changes,
going through wbc when initializing bio's doesn't cause any behavior
changes.
v2: submit_blk_blkcg() now checks whether the wbc is associated with a
wb before dereferencing it. This can happen when pageout() is
writing pages directly without going through the usual writeback
path. As pageout() path is single-threaded, we don't want it to
be blocked behind a slow cgroup and ultimately want it to delegate
actual writing to the usual writeback path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, majority of cgroup writeback support including all the
above functions are implemented in include/linux/backing-dev.h and
mm/backing-dev.c; however, the portion closely related to writeback
logic implemented in include/linux/writeback.h and mm/page-writeback.c
will expand to support foreign writeback detection and correction.
This patch moves wb[_try]_get() and wb_put() to
include/linux/backing-dev-defs.h so that they can be used from
writeback.h and inode_{attach|detach}_wb() to writeback.h and
page-writeback.c.
This is pure reorganization and doesn't introduce any functional
changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
While cgroup writeback support now connects memcg and blkcg so that
writeback IOs are properly attributed and controlled, the IO back
pressure propagation mechanism implemented in balance_dirty_pages()
and its subroutines wasn't aware of cgroup writeback.
Processes belonging to a memcg may have access to only subset of total
memory available in the system and not factoring this into dirty
throttling rendered it completely ineffective for processes under
memcg limits and memcg ended up building a separate ad-hoc degenerate
mechanism directly into vmscan code to limit page dirtying.
The previous patches updated balance_dirty_pages() and its subroutines
so that they can deal with multiple wb_domain's (writeback domains)
and defined per-memcg wb_domain. Processes belonging to a non-root
memcg are bound to two wb_domains, global wb_domain and memcg
wb_domain, and should be throttled according to IO pressures from both
domains. This patch updates dirty throttling code so that it repeats
similar calculations for the two domains - the differences between the
two are few and minor - and applies the lower of the two sets of
resulting constraints.
wb_over_bg_thresh(), which controls when background writeback
terminates, is also updated to consider both global and memcg
wb_domains. It returns true if dirty is over bg_thresh for either
domain.
This makes the dirty throttling mechanism operational for memcg
domains including writeback-bandwidth-proportional dirty page
distribution inside them but the ad-hoc memcg throttling mechanism in
vmscan is still in place. The next patch will rip it out.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The amount of available memory to a memcg wb_domain can change as
memcg configuration changes. A domain's ->dirty_limit exists to
smooth out sudden drops in dirty threshold; however, when a domain's
size actually drops significantly, it hinders the dirty throttling
from adjusting to the new configuration leading to unexpected
behaviors including unnecessary OOM kills.
This patch resolves the issue by adding wb_domain_size_changed() which
resets ->dirty_limit[_tstmp] and making memcg call it on configuration
changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Dirtyable memory is distributed to a wb (bdi_writeback) according to
the relative bandwidth the wb is writing out in the whole system.
This distribution is global - each wb is measured against all other
wb's and gets the proportinately sized portion of the memory in the
whole system.
For cgroup writeback, the amount of dirtyable memory is scoped by
memcg and thus each wb would need to be measured and controlled in its
memcg. IOW, a wb will belong to two writeback domains - the global
and memcg domains.
The previous patches laid the groundwork to support the two wb_domains
and this patch implements memcg wb_domain. memcg->cgwb_domain is
initialized on css online and destroyed on css release,
wb->memcg_completions is added, and __wb_writeout_inc() is updated to
increment completions against both global and memcg wb_domains.
The following patches will update balance_dirty_pages() and its
subroutines to actually consider memcg wb_domain for throttling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
and rename it to wb_over_bg_thresh(). The function is closely tied to
the dirty throttling mechanism implemented in page-writeback.c. This
relocation will allow future updates necessary for cgroup writeback
support.
While at it, add function comment.
This is pure reorganization and doesn't introduce any behavioral
changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch is a part of the series to define wb_domain which
represents a domain that wb's (bdi_writeback's) belong to and are
measured against each other in. This will enable IO backpressure
propagation for cgroup writeback.
global_dirty_limit exists to regulate the global dirty threshold which
is a property of the wb_domain. This patch moves hard_dirty_limit,
dirty_lock, and update_time into wb_domain.
This is pure reorganization and doesn't introduce any behavioral
changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Dirtyable memory is distributed to a wb (bdi_writeback) according to
the relative bandwidth the wb is writing out in the whole system.
This distribution is global - each wb is measured against all other
wb's and gets the proportinately sized portion of the memory in the
whole system.
For cgroup writeback, the amount of dirtyable memory is scoped by
memcg and thus each wb would need to be measured and controlled in its
memcg. IOW, a wb will belong to two writeback domains - the global
and memcg domains.
Currently, what constitutes the global writeback domain are scattered
across a number of global states. This patch starts collecting them
into struct wb_domain.
* fprop_global which serves as the basis for proportional bandwidth
measurement and its period timer are moved into struct wb_domain.
* global_wb_domain hosts the states for the global domain.
* While at it, flatten wb_writeout_fraction() into its callers. This
thin wrapper doesn't provide any actual benefits while getting in
the way.
This is pure reorganization and doesn't introduce any behavioral
changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
__wb_update_bandwidth() is called from two places -
fs/fs-writeback.c::balance_dirty_pages() and
mm/page-writeback.c::wb_writeback(). The latter updates only the
write bandwidth while the former also deals with the dirty ratelimit.
The two callsites are distinguished by whether @thresh parameter is
zero or not, which is cryptic. In addition, the two files define
their own different versions of wb_update_bandwidth() on top of
__wb_update_bandwidth(), which is confusing to say the least. This
patch cleans up [__]wb_update_bandwidth() in the following ways.
* __wb_update_bandwidth() now takes explicit @update_ratelimit
parameter to gate dirty ratelimit handling.
* mm/page-writeback.c::wb_update_bandwidth() is flattened into its
caller - balance_dirty_pages().
* fs/fs-writeback.c::wb_update_bandwidth() is moved to
mm/page-writeback.c and __wb_update_bandwidth() is made static.
* While at it, add a lockdep assertion to __wb_update_bandwidth().
Except for the lockdep addition, this is pure reorganization and
doesn't introduce any behavioral changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The function name wb_dirty_limit(), its argument @dirty and the local
variable @wb_dirty are mortally confusing given that the function
calculates per-wb threshold value not dirty pages, especially given
that @dirty and @wb_dirty are used elsewhere for dirty pages.
Let's rename the function to wb_calc_thresh() and wb_dirty to
wb_thresh.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
[__]block_write_full_page() is used to implement ->writepage in
various filesystems. All writeback logic is now updated to handle
cgroup writeback and the block cgroup to issue IOs for is encoded in
writeback_control and can be retrieved from the inode; however,
[__]block_write_full_page() currently ignores the blkcg indicated by
inode and issues all bio's without explicit blkcg association.
This patch adds submit_bh_blkcg() which associates the bio with the
specified blkio cgroup before issuing and uses it in
__block_write_full_page() so that the issued bio's are associated with
inode_to_wb_blkcg_css(inode).
v2: Updated for per-inode wb association.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
try_writeback_inodes_sb_nr() wraps writeback_inodes_sb_nr() so that it
handles s_umount locking and skips if writeback is already in
progress. The in progress test is performed on the root wb
(bdi_writeback) which isn't sufficient for cgroup writeback support.
The test must be done per-wb.
To prepare for the change, this patch factors out
__writeback_inodes_sb_nr() from writeback_inodes_sb_nr() and adds
@skip_if_busy and moves the in progress test right before queueing the
wb_writeback_work. try_writeback_inodes_sb_nr() now just grabs
s_umount and invokes __writeback_inodes_sb_nr() with asserted
@skip_if_busy. This way, later addition of multiple wb handling can
skip only the wb's which already have writeback in progress.
This swaps the order between in progress test and s_umount test which
can flip the return value when writeback is in progress and s_umount
is being held by someone else but this shouldn't cause any meaningful
difference. It's a fringe condition and the return value is an
unsynchronized hint anyway.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
If the completion of a wb_writeback_work can be waited upon by setting
its ->done to a struct completion and waiting on it; however, for
cgroup writeback support, it's necessary to issue multiple work items
to multiple bdi_writebacks and wait for the completion of all.
This patch implements wb_completion which can wait for multiple work
items and replaces the struct completion with it. It can be defined
using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and
waited for by wb_wait_for_completion().
Nobody currently issues multiple work items and this patch doesn't
introduce any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
bdi_start_background_writeback() currently takes @bdi and kicks the
root wb (bdi_writeback). In preparation for cgroup writeback support,
make it take wb instead.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
writeback_in_progress() currently takes @bdi and returns whether
writeback is in progress on its root wb (bdi_writeback). In
preparation for cgroup writeback support, make it take wb instead.
While at it, make it an inline function.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
bdi_start_writeback() is a thin wrapper on top of
__wb_start_writeback() which is used only by laptop_mode_timer_fn().
This patches removes bdi_start_writeback(), renames
__wb_start_writeback() to wb_start_writeback() and makes
laptop_mode_timer_fn() use it instead.
This doesn't cause any functional difference and will ease making
laptop_mode_timer_fn() cgroup writeback aware.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
This will be used to implement bdi-wide operations which should be
distributed across all its cgroup bdi_writebacks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
bdi_has_dirty_io() used to only reflect whether the root wb
(bdi_writeback) has dirty inodes. For cgroup writeback support, it
needs to take all active wb's into account. If any wb on the bdi has
dirty inodes, bdi_has_dirty_io() should return true.
To achieve that, as inode_wb_list_{move|del}_locked() now keep track
of the dirty state transition of each wb, the number of dirty wbs can
be counted in the bdi; however, bdi is already aggregating
wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when
there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
dip below 1. bdi_has_dirty_io() can simply test whether
bdi->tot_write_bandwidth is zero or not.
While this bumps the value of wb->avg_write_bandwidth to one when it
used to be zero, this shouldn't cause any meaningful behavior
difference.
bdi_has_dirty_io() is made an inline function which tests whether
->tot_write_bandwidth is non-zero. Also, WARN_ON_ONCE()'s on its
value are added to inode_wb_list_{move|del}_locked().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
cgroup writeback support needs to keep track of the sum of
avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to
distribute write workload. This patch adds bdi->tot_write_bandwidth
and updates inode_wb_list_move_locked(), inode_wb_list_del_locked()
and wb_update_write_bandwidth() to adjust it as wb's gain and lose
dirty inodes and its avg_write_bandwidth gets updated.
As the update events are not synchronized with each other,
bdi->tot_write_bandwidth is an atomic_long_t.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>