Pull yet more bcachefs fixes from Kent Overstreet:
"This gets recovery working again for the affected user I've been
working with, and I'm still waiting to hear back on other bug reports
but should fix it for everyone else who's been having issues with
recovery.
- Various recovery fixes:
- fixes for the btree_insert_entry being resized on path
allocation btree_path array recently became dynamically
resizable, and btree_insert_entry along with it; this was being
observed during journal replay, when write buffer btree updates
don't use the write buffer and instead use the normal btree
update path
- multiple fixes for deadlock in recovery when we need to do lots
of btree node merges; excessive merges were clocking up the
whole pipeline
- write buffer path now correctly does btree node merges when
needed
- fix failure to go RW when superblock indicates recovery passes
needed (i.e. to complete an unfinished upgrade)
- Various unsafety fixes - test case contributed by a user who had
two drives out of a six drive array write out a whole bunch of
garbage after power failure
- New (tiny) on disk format feature: since it appears the btree node
scan tool will be a more regular thing (crappy hardware, user
error) - this adds a 64 bit per-device bitmap of regions that have
ever had btree nodes.
- A path->should_be_locked fix, from a larger patch series tightening
up invariants and assertions around btree transaction and path
locking state.
This particular fix prevents us from keeping around btree_paths
that are no longer needed"
* tag 'bcachefs-2024-04-15' of https://evilpiepirate.org/git/bcachefs: (24 commits)
bcachefs: set_btree_iter_dontneed also clears should_be_locked
bcachefs: fix error path of __bch2_read_super()
bcachefs: Check for backpointer bucket_offset >= bucket size
bcachefs: bch_member.btree_allocated_bitmap
bcachefs: sysfs internal/trigger_journal_flush
bcachefs: Fix bch2_btree_node_fill() for !path
bcachefs: add safety checks in bch2_btree_node_fill()
bcachefs: Interior known are required to have known key types
bcachefs: add missing bounds check in __bch2_bkey_val_invalid()
bcachefs: Fix btree node merging on write buffer btrees
bcachefs: Disable merges from interior update path
bcachefs: Run merges at BCH_WATERMARK_btree
bcachefs: Fix missing write refs in fs fio paths
bcachefs: Fix deadlock in journal replay
bcachefs: Go rw if running any explicit recovery passes
bcachefs: Standardize helpers for printing enum strs with bounds checks
bcachefs: don't queue btree nodes for rewrites during scan
bcachefs: fix race in bch2_btree_node_evict()
bcachefs: fix unsafety in bch2_stripe_to_text()
bcachefs: fix unsafety in bch2_extent_ptr_to_text()
...
This is part of a larger series cleaning up the semantics of
should_be_locked and adding assertions around it; if we don't need an
iterator/path anymore, it clearly doesn't need to be locked.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In __bch2_read_super(), if kstrdup() fails, it needs to release memory
in sb->holder, fix to call bch2_free_super() in the error path.
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a small (64 bit) per-device bitmap that tracks ranges that
have btree nodes, for accelerating btree node scan if it is ever needed.
- New helpers, bch2_dev_btree_bitmap_marked() and
bch2_dev_bitmap_mark(), for checking and updating the bitmap
- Interior btree update path updates the bitmaps when required
- The check_allocations pass has a new fsck_err check,
btree_bitmap_not_marked
- New on disk format version, mi_btree_mitmap, which indicates the new
bitmap is present
- Upgrade table lists the required recovery pass and expected fsck error
- Btree node scan uses the bitmap to skip ranges if we're on the new
version
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We shouldn't be doing the unlock/relock dance when we're not using a
path - this fixes an assertion pop when called from btree node scan.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For forwards compatibilyt, we allow bkeys of unknown type in leaf nodes;
we can simply ignore metadata we don't understand. Pointers to btree
nodes must always be of known types, howwever.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The btree write buffer flush fastpath that avoids the main transaction
commit path had the unfortunate side effect of not doing btree node
merging.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's been a bug in the btree write buffer where it wasn't triggering
btree node merges - and leaving behind a bunch of nearly empty btree
nodes.
Then during journal replay, when updates to the backpointers btree
aren't using the btree write buffer (because we require synchronization
with journal replay), we end up doing those merges all at once.
Then if it's the interior update path running them, we deadlock because
those run with the highest watermark.
There's no real need for the interior update path to be doing btree node
merges; other code paths can handle that at lower watermarks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a deadlock where the interior update path during journal
replay ends up doing a ton of merges on the backpointers btree, and
deadlocking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_key_can_insert_cached() should be checking the watermark -
BCH_TRANS_COMMIT_journal_replay really means nonblocking mode when
watermark < reclaim, it was being used incorrectly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bug where we fail to start when upgrading/downgrading
because we forgot we needed to go rw.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The btree paths array is now dynamically resizable - and as well the
btree_insert_entries array, as it needs to be the same size.
The merge path (and interior update path) allocates new btree paths,
thus can trigger a resize; thus we need to not retain direct pointers
after invoking merge; similarly when running btree node triggers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It turns out - btree splits happen with the rest of the transaction
still locked, to avoid unnecessary restarts, which means using nofail
doesn't work here - we can deadlock.
Fortunately, we now have the ability to return errors here.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Pull more bcachefs fixes from Kent Overstreet:
"Notable user impacting bugs
- On multi device filesystems, recovery was looping in
btree_trans_too_many_iters(). This checks if a transaction has
touched too many btree paths (because of iteration over many keys),
and isuses a restart to drop unneeded paths.
But it's now possible for some paths to exceed the previous limit
without iteration in the interior btree update path, since the
transaction commit will do alloc updates for every old and new
btree node, and during journal replay we don't use the btree write
buffer for locking reasons and thus those updates use btree paths
when they wouldn't normally.
- Fix a corner case in rebalance when moving extents on a
durability=0 device. This wouldn't be hit when a device was
formatted with durability=0 since in that case we'll only use it as
a write through cache (only cached extents will live on it), but
durability can now be changed on an existing device.
- bch2_get_acl() could rarely forget to handle a transaction restart;
this manifested as the occasional missing acl that came back after
dropping caches.
- Fix a major performance regression on high iops multithreaded write
workloads (only since 6.9-rc1); a previous fix for a deadlock in
the interior btree update path to check the journal watermark
introduced a dependency on the state of btree write buffer flushing
that we didn't want.
- Assorted other repair paths and recovery fixes"
* tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs: (25 commits)
bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()
bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail()
bcachefs: Fix a race in btree_update_nodes_written()
bcachefs: btree_node_scan: Respect member.data_allowed
bcachefs: Don't scan for btree nodes when we can reconstruct
bcachefs: Fix check_topology() when using node scan
bcachefs: fix eytzinger0_find_gt()
bcachefs: fix bch2_get_acl() transaction restart handling
bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list
bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()
bcachefs: Rename struct field swap to prevent macro naming collision
MAINTAINERS: Add entry for bcachefs documentation
Documentation: filesystems: Add bcachefs toctree
bcachefs: JOURNAL_SPACE_LOW
bcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINE
bcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystems
bcachefs: fix rand_delete unit test
bcachefs: fix ! vs ~ typo in __clear_bit_le64()
bcachefs: Fix rebalance from durability=0 device
bcachefs: Print shutdown journal sequence number
...
We weren't respecting trans->journal_replay_not_finished - we shouldn't
be searching the journal keys unless we have a ref on them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
dropping read locks in bch2_btree_node_lock_write_nofail() dates from
before we had the cycle detector; we can now tell the cycle detector
directly when taking a lock may not fail because we can't handle
transaction restarts.
This is needed for adding should_be_locked asserts.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
One btree update might have terminated in a node update, and then while
it is in flight another btree update might free that original node.
This race has to be handled in btree_update_nodes_written() - we were
missing a READ_ONCE().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- fix return types: promoting from unsigned to ssize_t does not do what
we want here, and was pointless since the rest of the eytzinger code
is u32
- nr, not size
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_acl_from_disk() uses allocate_dropping_locks, and can thus return
a transaction restart - this wasn't handled.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When allocating bkey_cached from bc->freed_pcpu list, it missed
decreasing the count of nr_freed_pcpu which would cause the mismatch
between the value of nr_freed_pcpu and the list items. This problem
also exists in moving new bkey_cached to bc->freed_pcpu list.
If these happened, the bug info may appear in
bch2_fs_btree_key_cache_exit by the follow code:
BUG_ON(list_count_nodes(&bc->freed_pcpu) != bc->nr_freed_pcpu);
BUG_ON(list_count_nodes(&bc->freed_nonpcpu) != bc->nr_freed_nonpcpu);
Fixes: c65c13f0ea ("bcachefs: Run btree key cache shrinker less aggressively")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Multiple bug fixes for journal iters:
- When the journal keys gap buffer is resized, we have to adjust the
iterators for moving the gap to the end
- We don't want to rewind iterators to point to the key we just
inserted if it's not for the correct btree/level
Also, add some new assertions.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The struct field swap can collide with the swap() macro defined in
linux/minmax.h. Rename the struct field to prevent such collisions.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
"bcachefs; Fix deadlock in bch2_btree_update_start()" was a significant
performance regression (nearly 50%) on multithreaded random writes with
fio.
The reason is that the journal watermark checks multiple things,
including the state of the btree write buffer, and on multithreaded
update heavy workloads we're bottleneked on write buffer flushing - we
don't want kicknig off btree updates to depend on the state of the write
buffer.
This isn't strictly correct; the interior btree update path does do
write buffer updates, but it's a tiny fraction of total accounting
updates and we're more concerned with space in the journal itself.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_IOCTL_FSCK_OFFLINE allows the userspace fsck tool to use the kernel
implementation of fsck - primarily when the kernel version is a better
version match.
It should look and act exactly like the normal userspace fsck that the
user expected to be invoking, so errors should never result in a kernel
panic.
We may want to consider further restricting errors=panic - it's only
intended for debugging in controlled test environments, it should have
no purpose it normal usage.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
To open an encrypted filesystem, we use request_key() to get the
encryption key from the user's keyring - but request_key() needs to
happen in the context of the process that invoked the ioctl.
This easily fixed by using bch2_fs_open() in nostart mode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The ! was obviously intended to be ~. As it is, this function does
the equivalent to: "addr[bit / 64] = 0;".
Fixes: 27fcec6c27 ("bcachefs: Clear recovery_passes_required as they complete without errors")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Pull vfs fixes from Christian Brauner:
"This contains a few small fixes. This comes with some delay because I
wanted to wait on people running their reproducers and the Easter
Holidays meant that those replies came in a little later than usual:
- Fix handling of preventing writes to mounted block devices.
Since last kernel we allow to prevent writing to mounted block
devices provided CONFIG_BLK_DEV_WRITE_MOUNTED isn't set and the
block device is opened with restricted writes. When we switched to
opening block devices as files we altered the mechanism by which we
recognize when a block device has been opened with write
restrictions.
The detection logic assumed that only read-write mounted
filesystems would apply write restrictions to their block devices
from other openers. That of course is not true since it also makes
sense to apply write restrictions for filesystems that are
read-only.
Fix the detection logic using an FMODE_* bit. We still have a few
left since we freed up a couple a while ago. I also picked up a
patch to free up four additional FMODE_* bits scheduled for the
next merge window.
- Fix counting the number of writers to a block device. This just
changes the logic to be consistent.
- Fix a bug in aio causing a NULL pointer derefernce after we
implemented batched processing in aio.
- Finally, add the changes we discussed that allows to yield block
devices early even though file closing itself is deferred.
This also allows us to remove two holder operations to get and
release the holder to align lifetime of file and holder of the
block device"
* tag 'vfs-6.9-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
aio: Fix null ptr deref in aio_complete() wakeup
fs,block: yield devices early
block: count BLK_OPEN_RESTRICT_WRITES openers
block: handle BLK_OPEN_RESTRICT_WRITES correctly
sysfs is limited to PAGE_SIZE, and when we're debugging strange
deadlocks/priority inversions we need to see the full list.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Snapshot table accesses generally need to be checking for invalid
snapshot ID now, fix one that was missed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>