linux

mirror of https://github.com/torvalds/linux.git synced 2026-04-29 03:52:30 -04:00

Author	SHA1	Message	Date
Trond Myklebust	6edf96097b	NFS: Don't mark the data cache as invalid if it has been flushed Now that we have functions such as nfs_write_pageuptodate() that use the cache_validity flags to check if the data cache is valid or not, it is a little more important to keep the flags in sync with the state of the data cache. In particular, we'd like to ensure that if the data cache is empty, we don't start marking it as needing revalidation. Reported-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-06-24 18:46:57 -04:00
Trond Myklebust	f2467b6f64	NFS: Clear NFS_INO_REVAL_PAGECACHE when we update the file size In nfs_update_inode(), if the change attribute is seen to change on the server, then we set NFS_INO_REVAL_PAGECACHE in order to make sure that we check the file size. However, if we also update the file size in the same function, we don't need to check it again. So make sure that we clear the NFS_INO_REVAL_PAGECACHE that was set earlier. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-06-24 18:46:57 -04:00
Scott Mayhew	18dd78c427	nfs: Fix cache_validity check in nfs_write_pageuptodate() NFS_INO_INVALID_DATA cannot be ignored, even if we have a delegation. We're still having some problems with data corruption when multiple clients are appending to a file and those clients are being granted write delegations on open. To reproduce: Client A: vi /mnt/`hostname -s` while :; do echo "XXXXXXXXXXXXXXX" >>/mnt/file; sleep $(( $RANDOM % 5 )); done Client B: vi /mnt/`hostname -s` while :; do echo "YYYYYYYYYYYYYYY" >>/mnt/file; sleep $(( $RANDOM % 5 )); done What's happening is that in nfs_update_inode() we're recognizing that the file size has changed and we're setting NFS_INO_INVALID_DATA accordingly, but then we ignore the cache_validity flags in nfs_write_pageuptodate() because we have a delegation. As a result, in nfs_updatepage() we're extending the write to cover the full page even though we've not read in the data to begin with. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Cc: <stable@vger.kernel.org> # v3.11+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-06-24 18:46:56 -04:00
Oleg Nesterov	855ef0dec7	aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx() ioctx_add_table() is the writer, it does not need rcu_read_lock() to protect ->ioctx_table. It relies on mm->ioctx_lock and rcu locks just add the confusion. And it doesn't need rcu_dereference() by the same reason, it must see any updates previously done under the same ->ioctx_lock. We could use rcu_dereference_protected() but the patch uses rcu_dereference_raw(), the function is simple enough. The same for kill_ioctx(), although it does not update the pointer. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>	2014-06-24 18:10:25 -04:00
Oleg Nesterov	4b70ac5fd9	aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock() On 04/30, Benjamin LaHaise wrote: > > > - ctx->mmap_size = 0; > > - > > - kill_ioctx(mm, ctx, NULL); > > + if (ctx) { > > + ctx->mmap_size = 0; > > + kill_ioctx(mm, ctx, NULL); > > + } > > Rather than indenting and moving the two lines changing mmap_size and the > kill_ioctx() call, why not just do "if (!ctx) ... continue;"? That reduces > the number of lines changed and avoid excessive indentation. OK. To me the code looks better/simpler with "if (ctx)", but this is subjective of course, I won't argue. The patch still removes the empty line between mmap_size = 0 and kill_ioctx(), we reset mmap_size only for kill_ioctx(). But feel free to remove this change. ------------------------------------------------------------------------------- Subject: [PATCH v3 1/2] aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock() 1. We can read ->ioctx_table only once and we do not read rcu_read_lock() or even rcu_dereference(). This mm has no users, nobody else can play with ->ioctx_table. Otherwise the code is buggy anyway, if we need rcu_read_lock() in a loop because ->ioctx_table can be updated then kfree(table) is obviously wrong. 2. Update the comment. "exit_mmap(mm) is coming" is the good reason to avoid munmap(), but another reason is that we simply can't do vm_munmap() unless current->mm == mm and this is not true in general, the caller is mmput(). 3. We do not really need to nullify mm->ioctx_table before return, probably the current code does this to catch the potential problems. But in this case RCU_INIT_POINTER(NULL) looks better. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>	2014-06-24 18:10:24 -04:00
Benjamin LaHaise	edfbbf388f	aio: fix kernel memory disclosure in io_getevents() introduced in v3.10 A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10 by commit `a31ad380be`. The changes made to aio_read_events_ring() failed to correctly limit the index into ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of an arbitrary page with a copy_to_user() to copy the contents into userspace. This vulnerability has been assigned CVE-2014-0206. Thanks to Mateusz and Petr for disclosing this issue. This patch applies to v3.12+. A separate backport is needed for 3.10/3.11. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: stable@vger.kernel.org	2014-06-24 13:46:01 -04:00
Benjamin LaHaise	f8567a3845	aio: fix aio request leak when events are reaped by userspace The aio cleanups and optimizations by kmo that were merged into the 3.10 tree added a regression for userspace event reaping. Specifically, the reference counts are not decremented if the event is reaped in userspace, leading to the application being unable to submit further aio requests. This patch applies to 3.12+. A separate backport is required for 3.10/3.11. This issue was uncovered as part of CVE-2014-0206. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org Cc: Kent Overstreet <kmo@daterainc.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com>	2014-06-24 13:32:27 -04:00
Steve French	ce36d9ab3b	[CIFS] fix mount failure with broken pathnames when smb3 mount with mapchars option When we SMB3 mounted with mapchars (to allow reserved characters : \ / > < * ? via the Unicode Windows to POSIX remap range) empty paths (eg when we open "" to query the root of the SMB3 directory on mount) were not null terminated so we sent garbarge as a path name on empty paths which caused SMB2/SMB2.1/SMB3 mounts to fail when mapchars was specified. mapchars is particularly important since Unix Extensions for SMB3 are not supported (yet) Signed-off-by: Steve French <smfrench@gmail.com> Cc: <stable@vger.kernel.org> Reviewed-by: David Disseldorp <ddiss@suse.de>	2014-06-24 08:10:24 -05:00
Xue jiufei	ac4fef4d23	ocfs2/dlm: do not purge lockres that is queued for assert master When workqueue is delayed, it may occur that a lockres is purged while it is still queued for master assert. it may trigger BUG() as follows. N1 N2 dlm_get_lockres() ->dlm_do_master_requery is the master of lockres, so queue assert_master work dlm_thread() start running and purge the lockres dlm_assert_master_worker() send assert master message to other nodes receiving the assert_master message, set master to N2 dlmlock_remote() send create_lock message to N2, but receive DLM_IVLOCKID, if it is RECOVERY lockres, it triggers the BUG(). Another BUG() is triggered when N3 become the new master and send assert_master to N1, N1 will trigger the BUG() because owner doesn't match. So we should not purge lockres when it is queued for assert master. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-23 16:47:45 -07:00
jiangyiwen	b9aaac5a6b	ocfs2: do not return DLM_MIGRATE_RESPONSE_MASTERY_REF to avoid endless,loop during umount The following case may lead to endless loop during umount. node A node B node C node D umount volume, migrate lockres1 to B want to lock lockres1, send MASTER_REQUEST_MSG to C init block mle send MIGRATE_REQUEST_MSG to C find a block mle, and then return DLM_MIGRATE_RESPONSE_MASTERY_REF to B set C in refmap umount successfully try to umount, endless loop occurs when migrate lockres1 since C is in refmap So we can fix this endless loop case by only returning DLM_MIGRATE_RESPONSE_MASTERY_REF if it has a mastery mle when receiving MIGRATE_REQUEST_MSG. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: jiangyiwen <jiangyiwen@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Xue jiufei <xuejiufei@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-23 16:47:45 -07:00
jiangyiwen	595297a8f9	ocfs2: manually do the iput once ocfs2_add_entry failed in ocfs2_symlink and ocfs2_mknod When the call to ocfs2_add_entry() failed in ocfs2_symlink() and ocfs2_mknod(), iput() will not be called during dput(dentry) because no d_instantiate(), and this will lead to umount hung. Signed-off-by: jiangyiwen <jiangyiwen@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-23 16:47:45 -07:00
Yiwen Jiang	f7a14f32e7	ocfs2: fix a tiny race when running dirop_fileop_racer When running dirop_fileop_racer we found a dead lock case. 2 nodes, say Node A and Node B, mount the same ocfs2 volume. Create /race/16/1 in the filesystem, and let the inode number of dir 16 is less than the inode number of dir race. Node A Node B mv /race/16/1 /race/ right after Node A has got the EX mode of /race/16/, and tries to get EX mode of /race ls /race/16/ In this case, Node A has got the EX mode of /race/16/, and wants to get EX mode of /race/. Node B has got the PR mode of /race/, and wants to get the PR mode of /race/16/. Since EX and PR are mutually exclusive, dead lock happens. This patch fixes this case by locking in ancestor order before trying inode number order. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-23 16:47:45 -07:00
Xue jiufei	a270c6d3c0	ocfs2/dlm: fix misuse of list_move_tail() in dlm_run_purge_list() When a lockres in purge list but is still in use, it should be moved to the tail of purge list. dlm_thread will continue to check next lockres in purge list. However, code list_move_tail(&dlm->purge_list, &lockres->purge) will do no movements, so dlm_thread will purge the same lockres in this loop again and again. If it is in use for a long time, other lockres will not be processed. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-23 16:47:45 -07:00
Wengang Wang	8a8ad1c2f6	ocfs2: refcount: take rw_lock in ocfs2_reflink This patch tries to fix this crash: #5 [ffff88003c1cd690] do_invalid_op at ffffffff810166d5 #6 [ffff88003c1cd730] invalid_op at ffffffff8159b2de [exception RIP: ocfs2_direct_IO_get_blocks+359] RIP: ffffffffa05dfa27 RSP: ffff88003c1cd7e8 RFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff88003c1cdaa8 RCX: 0000000000000000 RDX: 000000000000000c RSI: ffff880027a95000 RDI: ffff88003c79b540 RBP: ffff88003c1cd858 R8: 0000000000000000 R9: ffffffff815f6ba0 R10: 00000000000001c9 R11: 00000000000001c9 R12: ffff88002d271500 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000001000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff88003c1cd860] do_direct_IO at ffffffff811cd31b #8 [ffff88003c1cd950] direct_IO_iovec at ffffffff811cde9c #9 [ffff88003c1cd9b0] do_blockdev_direct_IO at ffffffff811ce764 #10 [ffff88003c1cdb80] __blockdev_direct_IO at ffffffff811ce7cc #11 [ffff88003c1cdbb0] ocfs2_direct_IO at ffffffffa05df756 [ocfs2] #12 [ffff88003c1cdbe0] generic_file_direct_write_iter at ffffffff8112f935 #13 [ffff88003c1cdc40] ocfs2_file_write_iter at ffffffffa0600ccc [ocfs2] #14 [ffff88003c1cdd50] do_aio_write at ffffffff8119126c #15 [ffff88003c1cddc0] aio_rw_vect_retry at ffffffff811d9bb4 #16 [ffff88003c1cddf0] aio_run_iocb at ffffffff811db880 #17 [ffff88003c1cde30] io_submit_one at ffffffff811dc238 #18 [ffff88003c1cde80] do_io_submit at ffffffff811dc437 #19 [ffff88003c1cdf70] sys_io_submit at ffffffff811dc530 #20 [ffff88003c1cdf80] system_call_fastpath at ffffffff8159a159 It crashes at BUG_ON(create && (ext_flags & OCFS2_EXT_REFCOUNTED)); in ocfs2_direct_IO_get_blocks. ocfs2_direct_IO_get_blocks is expecting the OCFS2_EXT_REFCOUNTED be removed in ocfs2_prepare_inode_for_write() if it was there. But no cluster lock is taken during the time before (or inside) ocfs2_prepare_inode_for_write() and after ocfs2_direct_IO_get_blocks(). It can happen in this case: Node A(which crashes) Node B ------------------------ --------------------------- ocfs2_file_aio_write ocfs2_prepare_inode_for_write ocfs2_inode_lock ... ocfs2_inode_unlock #no refcount found .... ocfs2_reflink ocfs2_inode_lock ... ocfs2_inode_unlock #now, refcount flag set on extent ... flush change to disk ocfs2_direct_IO_get_blocks ocfs2_get_clusters #extent map miss #buffer_head miss read extents from disk found refcount flag on extent crash.. Fix: Take rw_lock in ocfs2_reflink path Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-23 16:47:45 -07:00
Xue jiufei	b253bfd878	ocfs2: revert "ocfs2: fix NULL pointer dereference when dismount and ocfs2rec simultaneously" `75f82eaa50` ("ocfs2: fix NULL pointer dereference when dismount and ocfs2rec simultaneously") may cause umount hang while shutting down truncate log. The situation is as followes: ocfs2_dismout_volume -> ocfs2_recovery_exit -> free osb->recovery_map -> ocfs2_truncate_shutdown -> lock global bitmap inode -> ocfs2_wait_for_recovery -> check whether osb->recovery_map->rm_used is zero Because osb->recovery_map is already freed, rm_used can be any other values, so it may yield umount hang. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-23 16:47:45 -07:00
Tariq Saeed	27bf6305cf	ocfs2: fix deadlock when two nodes are converting same lock from PR to EX and idletimeout closes conn Orabug: 18639535 Two node cluster and both nodes hold a lock at PR level and both want to convert to EX at the same time. Master node 1 has sent BAST and then closes the connection due to idletime out. Node 0 receives BAST, sends unlock req with cancel flag but gets error -ENOTCONN. The problem is this error is ignored in dlm_send_remote_unlock_request() on the incorrect assumption that the master is dead. See NOTE in comment why it returns DLM_NORMAL. Upon getting DLM_NORMAL, node 0 proceeds to sends convert (without cancel flg) which fails with -ENOTCONN. waits 5 sec and resends. This time gets DLM_IVLOCKID from the master since lock not found in grant, it had been moved to converting queue in response to conv PR->EX req. No way out. Node 1 (master) Node 0 ============== ====== lock mode PR PR convert PR -> EX mv grant -> convert and que BAST ... <-------- convert PR -> EX convert que looks like this: ((node 1, PR -> EX) (node 0, PR -> EX)) ... BAST (want PR -> NL) ------------------> ... idle timout, conn closed ... In response to BAST, sends unlock with cancel convert flag gets -ENOTCONN. Ignores and sends remote convert request gets -ENOTCONN, waits 5 Sec, retries ... reconnects <----------------- convert req goes through on next try does not find lock on grant que status DLM_IVLOCKID ------------------> ... No way out. Fix is to keep retrying unlock with cancel flag until it succeeds or the master dies. Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-23 16:47:45 -07:00
alex chen	5fb1beb069	ocfs2: should add inode into orphan dir after updating entry in ocfs2_rename() There are two files a and b in dir /mnt/ocfs2. node A node B mv a b In ocfs2_rename(), after calling ocfs2_orphan_add(), the inode of file b will be added into orphan dir. If ocfs2_update_entry() fails, ocfs2_rename return error and mv operation fails. But file b still exists in the parent dir. ocfs2_queue_orphan_scan -> ocfs2_queue_recovery_completion -> ocfs2_complete_recovery -> ocfs2_recover_orphans The inode of the file b will be put with iput(). ocfs2_evict_inode -> ocfs2_delete_inode -> ocfs2_wipe_inode -> ocfs2_remove_inode OCFS2_VALID_FL in the inode i_flags will be cleared. The file b still can be accessed on node B. ls /mnt/ocfs2 When first read the file b with ocfs2_read_inode_block(). It will validate the inode using ocfs2_validate_inode_block(). Because OCFS2_VALID_FL not set in the inode i_flags, so the file system will be readonly. So we should add inode into orphan dir after updating entry in ocfs2_rename(). Signed-off-by: alex.chen <alex.chen@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-23 16:47:45 -07:00
Jeff Layton	d4c8e34fe8	nfsd: properly handle embedded newlines in fault_injection input Currently rpc_pton() fails to handle the case where you echo an address into the file, as it barfs on the newline. Ensure that we NULL out the first occurrence of any newline. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-23 11:31:38 -04:00
Jeff Layton	f7ce5d2842	nfsd: fix return of nfs4_acl_write_who AFAICT, the only way to hit this error is to pass this function a bogus "who" value. In that case, we probably don't want to return -1 as that could get sent back to the client. Turn this into nfserr_serverfault, which is a more appropriate error for a server bug like this. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-23 11:31:38 -04:00
Jeff Layton	94ec938b61	nfsd: add appropriate __force directives to filehandle generation code The filehandle structs all use host-endian values, but will sometimes stuff big-endian values into those fields. This is OK since these values are opaque to the client, but it confuses sparse. Add __force to make it clear that we are doing this intentionally. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-23 11:31:37 -04:00
Jeff Layton	e2afc81919	nfsd: nfsd_splice_read and nfsd_readv should return __be32 The callers expect a __be32 return and the functions they call return __be32, so having these return int is just wrong. Also, nfsd_finish_read can be made static. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-23 11:31:37 -04:00
Jeff Layton	b3d8d1284a	nfsd: clean up sparse endianness warnings in nfscache.c We currently hash the XID to determine a hash bucket to use for the reply cache entry, which is fed into hash_32 without byte-swapping it. Add __force to make sparse happy, and add some comments to explain why. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-23 11:31:37 -04:00
Jeff Layton	f419992c1f	nfsd: add __force to opaque verifier field casts sparse complains that we're stuffing non-byte-swapped values into __be32's here. Since they're supposed to be opaque, it doesn't matter much. Just add __force to make sparse happy. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-23 11:31:37 -04:00
Kinglong Mee	bf18f163e8	NFSD: Using exp_get for export getting Don't using cache_get besides export.h, using exp_get for export. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-23 11:31:36 -04:00
Kinglong Mee	0da22a919d	NFSD: Using path_get when assigning path for export Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-23 11:31:36 -04:00
Kinglong Mee	f15a5cf912	SUNRPC/NFSD: Change to type of bool for rq_usedeferral and rq_splice_ok rq_usedeferral and rq_splice_ok are used as 0 and 1, just defined to bool. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-23 11:31:36 -04:00
Kinglong Mee	3c7aa15d20	NFSD: Using min/max/min_t/max_t for calculate Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-23 11:31:36 -04:00
Jaegeuk Kim	98397ff3cd	f2fs: fix not to allocate unnecessary blocks during fallocate This patch fixes the fallocate bug like below. (See xfstests/255) In fallocate(fd, 0, 20480), expand_inode_data processes for (index = pg_start; index <= pg_end; index++) { f2fs_reserve_block(); ... } So, even though fallocate requests 20480, 5 blocks, f2fs allocates 6 blocks including pg_end. So, this patch adds one condition to avoid block allocation. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-06-23 10:05:08 +09:00
Jaegeuk Kim	ead432756a	f2fs: recover fallocated data and its i_size together This patch arranges the f2fs_locks to cover the fallocated data and its i_size. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-06-23 10:05:08 +09:00
Jaegeuk Kim	ccfb30001f	f2fs: fix to report newly allocate region as extent Previous get_block in f2fs didn't report the newly allocated region which has NEW_ADDR. For reader, it should not report, but fiemap needs this. So, this patch introduces two get_block sharing core function. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2014-06-23 10:05:07 +09:00
Eric Sandeen	b474c7ae43	xfs: Nuke XFS_ERROR macro XFS_ERROR was designed long ago to trap return values, but it's not runtime configurable, it's not consistently used, and we can do similar error trapping with ftrace scripts and triggers from userspace. Just nuke XFS_ERROR and associated bits. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2014-06-22 15:04:54 +10:00
Eric Sandeen	d99831ff39	xfs: return is not a function return is not a function. "return(EIO);" is silly; "return (EIO);" moreso. return is not a function. Nuke the pointless parens. [dchinner: catch a couple of extra cases in xfs_attr_list.c, xfs_acl.c and xfs_linux.h.] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2014-06-22 15:03:54 +10:00
Linus Torvalds	2dfded8210	Merge tag 'locks-v3.16-2' of git://git.samba.org/jlayton/linux Pull file locking fixes from Jeff Layton: "File locking related bugfixes Nothing too earth-shattering here. A fix for a potential regression due to a patch in pile #1, and the addition of a memory barrier to prevent a race condition between break_deleg and generic_add_lease" * tag 'locks-v3.16-2' of git://git.samba.org/jlayton/linux: locks: set fl_owner for leases back to current->files locks: add missing memory barrier in break_deleg	2014-06-21 16:40:30 -10:00
Linus Torvalds	e13d100beb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This fixes some lockups in btrfs reported with rc1. It probably has some performance impact because it is backing off our spinning locks more often and switching to a blocking lock. I'll be able to nail that down next week, but for now I want to get the lockups taken care of. Otherwise some more stack reduction and assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix wrong error handle when the device is missing or is not writeable Btrfs: fix deadlock when mounting a degraded fs Btrfs: use bio_endio_nodec instead of open code Btrfs: fix NULL pointer crash when running balance and scrub concurrently btrfs: Skip scrubbing removed chunks to avoid -ENOENT. Btrfs: fix broken free space cache after the system crashed Btrfs: make free space cache write out functions more readable Btrfs: remove unused wait queue in struct extent_buffer Btrfs: fix deadlocks with trylock on tree nodes	2014-06-21 14:21:43 -10:00
Linus Torvalds	147f1404db	Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linux Pull nfsd bugfixes from Bruce Fields: "Fixes for a new regression from the xdr encoding rewrite, and a delegation problem we've had for a while (made somewhat more annoying by the vfs delegation support added in 3.13)" * 'for-3.16' of git://linux-nfs.org/~bfields/linux: NFSD: fix bug for readdir of pseudofs NFSD: Don't hand out delegations for 30 seconds after recalling them.	2014-06-21 14:20:38 -10:00
Miao Xie	8408c716d7	Btrfs: fix wrong error handle when the device is missing or is not writeable The original bio might be submitted, so we shoud increase bi_remaining to account for it when we deal with the error that the device is missing or is not writeable, or we would skip the endio handle. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-19 14:20:56 -07:00
Miao Xie	c55f139640	Btrfs: fix deadlock when mounting a degraded fs The deadlock happened when we mount degraded filesystem, the reproduced steps are following: # mkfs.btrfs -f -m raid1 -d raid1 <dev0> <dev1> # echo 1 > /sys/block/`basename <dev0>`/device/delete # mount -o degraded <dev1> <mnt> The reason was that the counter -- bi_remaining was wrong. If the missing or unwriteable device was the last device in the mapping array, we would not submit the original bio, so we shouldn't increase bi_remaining of it in btrfs_end_bio(), or we would skip the final endio handle. Fix this problem by adding a flag into btrfs bio structure. If we submit the original bio, we will set the flag, and we increase bi_remaining counter, or we don't. Though there is another way to fix it -- decrease bi_remaining counter of the original bio when we make sure the original bio is not submitted, this method need add more check and is easy to make mistake. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-19 14:20:56 -07:00
Miao Xie	e990f16763	Btrfs: use bio_endio_nodec instead of open code Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-19 14:20:55 -07:00
Wang Shilong	298a8f9cf1	Btrfs: fix NULL pointer crash when running balance and scrub concurrently While running balance, scrub, fsstress concurrently we hit the following kernel crash: [56561.448845] BTRFS info (device sde): relocating block group 11005853696 flags 132 [56561.524077] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 [56561.524237] IP: [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs] [56561.524297] PGD 9be28067 PUD 7f3dd067 PMD 0 [56561.524325] Oops: 0000 [#1] SMP [....] [56561.527237] Call Trace: [56561.527309] [<ffffffffa038980e>] scrub_enumerate_chunks+0x24e/0x490 [btrfs] [56561.527392] [<ffffffff810abe00>] ? abort_exclusive_wait+0x50/0xb0 [56561.527476] [<ffffffffa038add4>] btrfs_scrub_dev+0x1a4/0x530 [btrfs] [56561.527561] [<ffffffffa0368107>] btrfs_ioctl+0x13f7/0x2a90 [btrfs] [56561.527639] [<ffffffff811c82f0>] do_vfs_ioctl+0x2e0/0x4c0 [56561.527712] [<ffffffff8109c384>] ? vtime_account_user+0x54/0x60 [56561.527788] [<ffffffff810f768c>] ? __audit_syscall_entry+0x9c/0xf0 [56561.527870] [<ffffffff811c8551>] SyS_ioctl+0x81/0xa0 [56561.527941] [<ffffffff815707f7>] tracesys+0xdd/0xe2 [...] [56561.528304] RIP [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs] [56561.528395] RSP <ffff88004c0f5be8> [56561.528454] CR2: 0000000000000078 This is because in btrfs_relocate_chunk(), we will free @bdev directly while scrub may still hold extent mapping, and may access freed memory. Fix this problem by wrapping freeing @bdev work into free_extent_map() which is based on reference count. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-19 14:20:55 -07:00
Qu Wenruo	ced96edc48	btrfs: Skip scrubbing removed chunks to avoid -ENOENT. When run scrub with balance, sometimes -ENOENT will be returned, since in scrub_enumerate_chunks() will search dev_extent in COMMIT_ROOT, but btrfs_lookup_block_group() will search block group in MEMORY, so if a chunk is removed but not committed, -ENOENT will be returned. However, there is no need to stop scrubbing since other chunks may be scrubbed without problem. So this patch changes the behavior to skip removed chunks and continue to scrub the rest. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-19 14:20:54 -07:00
Miao Xie	e570fd27f2	Btrfs: fix broken free space cache after the system crashed When we mounted the filesystem after the crash, we got the following message: BTRFS error (device xxx): block group xxxx has wrong amount of free space BTRFS error (device xxx): failed to load free space cache for block group xxx It is because we didn't update the metadata of the allocated space (in extent tree) until the file data was written into the disk. During this time, there was no information about the allocated spaces in either the extent tree nor the free space cache. when we wrote out the free space cache at this time (commit transaction), those spaces were lost. In fact, only the free space that is used to store the file data had this problem, the others didn't because the metadata of them is updated in the same transaction context. There are many methods which can fix the above problem - track the allocated space, and write it out when we write out the free space cache - account the size of the allocated space that is used to store the file data, if the size is not zero, don't write out the free space cache. The first one is complex and may make the performance drop down. This patch chose the second method, we use a per-block-group variant to account the size of that allocated space. Besides that, we also introduce a per-block-group read-write semaphore to avoid the race between the allocation and the free space cache write out. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-19 14:20:54 -07:00
Miao Xie	5349d6c3ff	Btrfs: make free space cache write out functions more readable This patch makes the free space cache write out functions more readable, and beisdes that, it also reduces the stack space that the function -- __btrfs_write_out_cache uses from 194bytes to 144bytes. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-19 14:20:54 -07:00
Filipe Manana	46fefe41b5	Btrfs: remove unused wait queue in struct extent_buffer The lock_wq wait queue is not used anywhere, therefore just remove it. On a x86_64 system, this reduced sizeof(struct extent_buffer) from 320 bytes down to 296 bytes, which means a 4Kb page can now be used for 13 extent buffers instead of 12. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-19 14:20:28 -07:00
Chris Mason	ea4ebde02e	Btrfs: fix deadlocks with trylock on tree nodes The Btrfs tree trylock function is poorly named. It always takes the spinlock and backs off if the blocking lock is held. This can lead to surprising lockups because people expect it to really be a trylock. This commit makes it a pure trylock, both for the spinlock and the blocking lock. It also reworks the nested lock handling slightly to avoid taking the read lock while a spinning write lock might be held. Signed-off-by: Chris Mason <clm@fb.com>	2014-06-19 14:19:55 -07:00
Jeff Layton	08bc03539d	cifs: revalidate mapping prior to satisfying read_iter request with cache=loose Before satisfying a read with cache=loose, we should always check that the pagecache is valid before allowing a read to be satisfied out of it. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Jeff Layton <jlayton@poochiereds.net> Signed-off-by: Steve French <smfrench@gmail.com>	2014-06-19 13:34:04 -05:00
Paul Bolle	fe786f61f3	befs: remove check for CONFIG_BEFS_RW Befs contains a check for CONFIG_BEFS_RW for over a decade now. The related Kconfig symbol never existed, so this check always evaluated to true. Remove it. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2014-06-19 15:28:14 +02:00
Kinglong Mee	f41c5ad2ff	NFSD: fix bug for readdir of pseudofs Commit `561f0ed498` (nfsd4: allow large readdirs) introduces a bug about readdir the root of pseudofs. Call xdr_truncate_encode() revert encoded name when skipping. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-17 16:42:48 -04:00
NeilBrown	6282cd5655	NFSD: Don't hand out delegations for 30 seconds after recalling them. If nfsd needs to recall a delegation for some reason it implies that there is contention on the file, so further delegations should not be handed out. The current code fails to do so, and the result is effectively a live-lock under some workloads: a client attempting a conflicting operation on a read-delegated file receives NFS4ERR_DELAY and retries the operation, but by the time it retries the server may already have given out another delegation. We could simply avoid delegations for (say) 30 seconds after any recall, but this is probably too heavy handed. We could keep a list of inodes (or inode numbers or filehandles) for recalled delegations, but that requires memory allocation and searching. The approach taken here is to use a bloom filter to record the filehandles which are currently blocked from delegation, and to accept the cost of a few false positives. We have 2 bloom filters, each of which is valid for 30 seconds. When a delegation is recalled the filehandle is added to one filter and will remain disabled for between 30 and 60 seconds. We keep a count of the number of filehandles that have been added, so when that count is zero we can bypass all other tests. The bloom filters have 256 bits and 3 hash functions. This should allow a couple of dozen blocked filehandles with minimal false positives. If many more filehandles are all blocked at once, behaviour will degrade towards rejecting all delegations for between 30 and 60 seconds, then resetting and allowing new delegations. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-17 16:42:47 -04:00
Konstantin Khlebnikov	ebe06187bf	epoll: fix use-after-free in eventpoll_release_file This fixes use-after-free of epi->fllink.next inside list loop macro. This loop actually releases elements in the body. The list is rcu-protected but here we cannot hold rcu_read_lock because we need to lock mutex inside. The obvious solution is to use list_for_each_entry_safe(). RCU-ness isn't essential because nobody can change this list under us, it's final fput for this file. The bug was introduced by `ae10b2b4eb` ("epoll: optimize EPOLL_CTL_DEL using rcu") Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Stable <stable@vger.kernel.org> # 3.13+ Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-16 17:21:59 -10:00
Björn Baumbach	a1d0b84c30	fs/cifs: fix regression in cifs_create_mf_symlink() commit `d81b8a40e2` ("CIFS: Cleanup cifs open codepath") changed disposition to FILE_OPEN. Signed-off-by: Björn Baumbach <bb@sernet.de> Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Stefan Metzmacher <metze@samba.org> Cc: <stable@vger.kernel.org> # v3.14+ Cc: Pavel Shilovsky <piastry@etersoft.ru> Cc: Steve French <sfrench@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2014-06-16 13:50:11 -05:00

... 103 104 105 106 107 ...

41981 Commits