This patch introduces rework to zram stats. We have per-stat sysfs nodes,
and it makes things a bit hard to use in user space: it doesn't give an
immediate stats 'snapshot', it requires user space to use more syscalls -
open, read, close for every stat file, with appropriate error checks on
every step, etc.
First, zram now accounts block layer statistics, available in
/sys/block/zram<id>/stat and /proc/diskstats files. So some new stats are
available (see Documentation/block/stat.txt), besides, zram's activities
now can be monitored by sysstat's iostat or similar tools.
Example:
cat /sys/block/zram0/stat
248 0 1984 0 251029 0 2008232 5120 0 5116 5116
Second, group currently exported on per-stat basis nodes into two
categories (files):
-- zram<id>/io_stat
accumulates device's IO stats, that are not accounted by block layer,
and contains:
failed_reads
failed_writes
invalid_io
notify_free
Example:
cat /sys/block/zram0/io_stat
0 0 0 652572
-- zram<id>/mm_stat
accumulates zram mm stats and contains:
orig_data_size
compr_data_size
mem_used_total
mem_limit
mem_used_max
zero_pages
num_migrated
Example:
cat /sys/block/zram0/mm_stat
434634752 270288572 279158784 0 579895296 15060 0
per-stat sysfs nodes are now considered to be deprecated and we plan to
remove them (and clean up some of the existing stat code) in two years (as
of now, there is no warning printed to syslog about deprecated stats being
used). User space is advised to use the above mentioned 3 files.
This patch (of 7):
Remove sysfs `num_migrated' attribute. We are moving away from per-stat
device attrs towards 3 stat files that will accumulate io and mm stats in
a format similar to block layer statistics in /sys/block/<dev>/stat. That
will be easier to use in user space, and reduce the number of syscalls
needed to read zram device statistics.
`num_migrated' will return back in zram<id>/mm_stat file.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull second vfs update from Al Viro:
"Now that net-next went in... Here's the next big chunk - killing
->aio_read() and ->aio_write().
There'll be one more pile today (direct_IO changes and
generic_write_checks() cleanups/fixes), but I'd prefer to keep that
one separate"
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
->aio_read and ->aio_write removed
pcm: another weird API abuse
infinibad: weird APIs switched to ->write_iter()
kill do_sync_read/do_sync_write
fuse: use iov_iter_get_pages() for non-splice path
fuse: switch to ->read_iter/->write_iter
switch drivers/char/mem.c to ->read_iter/->write_iter
make new_sync_{read,write}() static
coredump: accept any write method
switch /dev/loop to vfs_iter_write()
serial2002: switch to __vfs_read/__vfs_write
ashmem: use __vfs_read()
export __vfs_read()
autofs: switch to __vfs_write()
new helper: __vfs_write()
switch hugetlbfs to ->read_iter()
coda: switch to ->read_iter/->write_iter
ncpfs: switch to ->read_iter/->write_iter
net/9p: remove (now-)unused helpers
p9_client_attach(): set fid->uid correctly
...
Originally Xen PV drivers only use single-page ring to pass along
information. This might limit the throughput between frontend and
backend.
The patch extends Xenbus driver to support multi-page ring, which in
general should improve throughput if ring is the bottleneck. Changes to
various frontend / backend to adapt to the new interface are also
included.
Affected Xen drivers:
* blkfront/back
* netfront/back
* pcifront/back
* scsifront/back
* vtpmfront
The interface is documented, as before, in xenbus_client.c.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
The current string_get_size() overflows when the device size goes over
2^64 bytes because the string helper routine computes the suffix from
the size in bytes. However, the entirety of Linux thinks in terms of
blocks, not bytes, so this will artificially induce an overflow on very
large devices. Fix this by making the function string_get_size() take
blocks and the block size instead of bytes. This should allow us to
keep working until the current SCSI standard overflows.
Also fix virtio_blk and mmc (both of which were also artificially
multiplying by the block size to pass a byte side to string_get_size()).
The mathematics of this is pretty simple: we're taking a product of
size in blocks (S) and block size (B) and trying to re-express this in
exponential form: S*B = R*N^E (where N, the exponent is either 1000 or
1024) and R < N. Mathematically, S = RS*N^ES and B=RB*N^EB, so if RS*RB
< N it's easy to see that S*B = RS*RB*N^(ES+EB). However, if RS*BS > N,
we can see that this can be re-expressed as RS*BS = R*N (where R =
RS*BS/N < N) so the whole exponent becomes R*N^(ES+EB+1)
[jejb: fix incorrect 32 bit do_div spotted by kbuild test robot <fengguang.wu@intel.com>]
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
Konrad writes:
This pull has one fix and an cleanup.
Note that David Vrabel in the xen/tip.git tree has other changes for the
Xen block drivers that are related to his grant work - and they do not
conflict with this git pull.
This adds support for the extended metadata formats through the submit
IO ioctl, and simplifies the rest when using a separate metadata format.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Checking fails static analysis due to additional arithmetic prior to
the NULL check. Mapping doesn't return NULL here anyway, so removing
the check.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
class_create() returns ERR_PTR on failure,
so IS_ERR() should be used instead of check for NULL.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Define pr_fmt macro with {xen-blkback: } prefix, then remove all use
of DRV_PFX in the pr sentences. Replace all DPRINTK with pr sentences,
and get rid of DPRINTK macro. It will simplify the code.
And if the pr sentences miss a \n, add it in the end. If the DPRINTK
sentences have redundant \n, remove it. It will format the code.
These all make the readability of the code become better.
Signed-off-by: Tao Chen <boby.chen@huawei.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
The blkback name is like blkback.domid.xvd[a-z], if domid has four digits
(means larger than 1000), then the backmost xvd wouldn't be fully shown.
Define a BLKBACK_NAME_LEN macro to be 20, enlarge the array size of
blkback name, so it will be fully shown in any case.
Signed-off-by: Tao Chen <boby.chen@huawei.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
By returning the error code directly, we can avoid the jump label
error_out.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
dprintk has some name collisions with other frameworks and drivers. It
is also not necessary to have these custom debug print filters. Dynamic
debug offers the same amount of filtered debugging.
This patch replaces all dprintks with dev_dbg(). It also removes the
ioctl dprintk which prints the ingoing ioctls which should be
replaceable by strace or similar stuff.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
The block subsystem uses loff_t to store the device size. Change the
type for nbd_device bytesize to loff_t.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
kthread_run includes the wake_up_process() call, so instead of
kthread_create() followed by wake_up_process() we can use this macro.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
The header is not included anywhere. Remove it and include the private
nbd_device struct in nbd.c.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Usually the admin queue depth of 64 is plenty, but for some use cases we
really need it larger. Examples are use cases like MAT, where you have
to touch all of NAND for init/format like purposes. In those cases, we
see a good 2x increase with an increased queue depth.
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Keith Busch <keith.busch@intel.com>
PRP list calculation is supposed to be based on device's page size.
Systems with page size larger than device's page size cause corruption
to the name space as well as system memory with out this fix.
Systems like x86 might not experience this issue because it uses
PAGE_SIZE of 4K where as powerpc uses PAGE_SIZE of 64k while NVMe device's
page size varies depending upon the vendor.
Signed-off-by: Murali Iyer <mniyer@us.ibm.com>
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The driver may issue commands to a device that may never return, so its
request_queue could always have active requests while the controller is
running. Waiting for the queue to freeze could block forever, which is
what blk-mq's hot cpu notification handler was doing when nvme drives
were in use.
This has the nvme driver make the asynchronous event command's tag
reserved and does not keep the request active. We can't have more than
one since the request is released back to the request_queue before the
command is completed. Having only one avoids potential tag collisions,
and reserving the tag for this purpose prevents other admin tasks from
reusing the tag.
I also couldn't think of a scenario where issuing AEN requests single
depth is worse than issuing them in batches, so I don't think we lose
anything with this change.
As an added bonus, doing it this way removes "Cancelling I/O" warnings
observed when unbinding the nvme driver from a device.
Reported-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This fixes a race accessing an invalid address when a controller's admin
queue is in use during a reset for failure or hot removal occurs. The
admin queue will be frozen to prevent new users from entering prior to
the doorbell queue being unmapped.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
mempool_alloc() does not support __GFP_ZERO since elements may come from
memory that has already been released by mempool_free().
Remove __GFP_ZERO from mempool_alloc() in drbd_req_new() and properly
initialize it to 0.
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
ppc has special instruction forms to efficiently load and store values
in non-native endianness. These can be accessed via the arch-specific
{ld,st}_le{16,32}() inlines in arch/powerpc/include/asm/swab.h.
However, gcc is perfectly capable of generating the byte-reversing
load/store instructions when using the normal, generic cpu_to_le*() and
le*_to_cpu() functions eaning the arch-specific functions don't have much
point.
Worse the "le" in the names of the arch specific functions is now
misleading, because they always generate byte-reversing forms, but some
ppc machines can now run a little-endian kernel.
To start getting rid of the arch-specific forms, this patch removes them
from all the old Power Macintosh drivers, replacing them with the
generic byteswappers.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull block layer fixes from Jens Axboe:
"Two smaller fixes for this cycle:
- A fixup from Keith so that NVMe compiles without BLK_INTEGRITY,
basically just moving the code around appropriately.
- A fixup for shm, fixing an oops in shmem_mapping() for mapping with
no inode. From Sasha"
[ The shmem fix doesn't look block-layer-related, but fixes a bug that
happened due to the backing_dev_info removal.. - Linus ]
* 'for-linus' of git://git.kernel.dk/linux-block:
mm: shmem: check for mapping owner before dereferencing
NVMe: Fix for BLK_DEV_INTEGRITY not set
Need to define and use appropriate functions for when BLK_DEV_INTEGRITY
is not set.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This makes all sync commands uninterruptible and schedules without timeout
so the controller either has to post a completion or the timeout recovery
fails the command. This fixes potential memory or data corruption from
a command timing out too early or woken by a signal. Previously any DMA
buffers mapped for that command would have been released even though we
don't know what the controller is planning to do with those addresses.
Signed-off-by: Keith Busch <keith.busch@intel.com>
We don't track queues in a llist, subscribe to hot-cpu notifications,
or internally retry commands. Delete the unused artifacts.
Signed-off-by: Keith Busch <keith.busch@intel.com>
The driver has to end unreturned commands at some point even if the
controller has not provided a completion. The driver tried to be safe by
deleting IO queues prior to ending all unreturned commands. That should
cause the controller to internally abort inflight commands, but IO queue
deletion request does not have to be successful, so all bets are off. We
still have to make progress, so to be extra safe, this patch doesn't
clear a queue to release the dma mapping for a command until after the
pci device has been disabled.
This patch removes the special handling during device initialization
so controller recovery can be done all the time. This is possible since
initialization is not inlined with pci probe anymore.
Reported-by: Nilish Choudhury <nilesh.choudhury@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
This performs the longest parts of nvme device probe in scheduled work.
This speeds up probe significantly when multiple devices are in use.
Signed-off-by: Keith Busch <keith.busch@intel.com>
This creates a new class type for nvme devices to register their
management character devices with. This is so we do not rely on miscdev
to provide enough minors for as many nvme devices some people plan to
use. The previous limit was approximately 60 NVMe controllers, depending
on the platform and kernel. Now the limit is 1M, which ought to be enough
for anybody.
Since we have a new device class, it makes sense to attach the block
devices under this as well, so part of this patch moves the management
handle initialization prior to the namespaces discovery.
Signed-off-by: Keith Busch <keith.busch@intel.com>
The original translation created collisions on Inquiry VPD 83 for many
existing devices. Newer specifications provide other ways to translate
based on the device's version can be used to create unique identifiers.
Version 1.1 provides an EUI64 field that uniquely identifies each
namespace, and 1.2 added the longer NGUID field for the same reason.
Both follow the IEEE EUI format and readily translate to the SCSI device
identification EUI designator type 2h. For devices implementing either,
the translation will use this type, defaulting to the EUI64 8-byte type if
implemented then NGUID's 16 byte version if not. If neither are provided,
the 1.0 translation is used, and is updated to use the SCSI String format
to guarantee a unique identifier.
Knowing when to use the new fields depends on the nvme controller's
revision. The NVME_VS macro was not decoding this correctly, so that is
fixed in this patch and moved to a more appropriate place.
Since the Identify Namespace structure required an update for the NGUID
field, this patch adds the remaining new 1.2 fields to the structure.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Adds support for NVMe metadata formats and exposes block devices for
all namespaces regardless of their format. Namespace formats that are
unusable will have disk capacity set to 0, but a handle to the block
device is created to simplify device management. A namespace is not
usable when the format requires host interleave block and metadata in
single buffer, has no provisioned storage, or has better data but failed
to register with blk integrity.
The namespace has to be scanned in two phases to support separate
metadata formats. The first establishes the sector size and capacity
prior to invoking add_disk. If metadata is required, the capacity will
be temporarilly set to 0 until it can be revalidated and registered with
the integrity extenstions after add_disk completes.
The driver relies on the integrity extensions to provide the metadata
buffer. NVMe requires this be a single physically contiguous region,
so only one integrity segment is allowed per command. If the metadata
is used for T10 PI, the driver provides mappings to save and restore
the reftag physical block translation. The driver provides no-op
functions for generate and verify if metadata is not used for protection
information. This way the setup is always provided by the block layer.
If a request does not supply a required metadata buffer, the command
is failed with bad address. This could only happen if a user manually
disables verify/generate on such a disk. The only exception to where
this is okay is if the controller is capable of stripping/generating
the metadata, which is possible on some types of formats.
The metadata scatter gather list now occupies the spot in the nvme_iod
that used to be used to link retryable IOD's, but we don't do that
anymore, so the field was unused.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Pull Ceph changes from Sage Weil:
"On the RBD side, there is a conversion to blk-mq from Christoph,
several long-standing bug fixes from Ilya, and some cleanup from
Rickard Strandqvist.
On the CephFS side there is a long list of fixes from Zheng, including
improved session handling, a few IO path fixes, some dcache management
correctness fixes, and several blocking while !TASK_RUNNING fixes.
The core code gets a few cleanups and Chaitanya has added support for
TCP_NODELAY (which has been used on the server side for ages but we
somehow missed on the kernel client).
There is also an update to MAINTAINERS to fix up some email addresses
and reflect that Ilya and Zheng are doing most of the maintenance for
RBD and CephFS these days. Do not be surprised to see a pull request
come from one of them in the future if I am unavailable for some
reason"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits)
MAINTAINERS: update Ceph and RBD maintainers
libceph: kfree() in put_osd() shouldn't depend on authorizer
libceph: fix double __remove_osd() problem
rbd: convert to blk-mq
ceph: return error for traceless reply race
ceph: fix dentry leaks
ceph: re-send requests when MDS enters reconnecting stage
ceph: show nocephx_require_signatures and notcp_nodelay options
libceph: tcp_nodelay support
rbd: do not treat standalone as flatten
ceph: fix atomic_open snapdir
ceph: properly mark empty directory as complete
client: include kernel version in client metadata
ceph: provide seperate {inode,file}_operations for snapdir
ceph: fix request time stamp encoding
ceph: fix reading inline data when i_size > PAGE_SIZE
ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions)
ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps)
ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync)
rbd: fix error paths in rbd_dev_refresh()
...