A demo fs for learning from bcachefs.
Backed by a single image file (default mode is still in-memory; pass
--image <path> to mount on a persistent file). Synchronous writes only —
no journal, no crash recovery yet.
See docs/snapshot-delete-plan.md for the delete + snapshot design, including the bcachefs mapping.
Done
- COW B-tree with split / rebalance
- Zerocopy on-disk layout (
NodeHeader/DiskEntry, 4 KB nodes) - Btree API returns
Result(Io / BadMagic / ChecksumMismatch / BlockNotFound) - Multi-tree view via key prefix: inode / dirent / extent sharing one physical Btree (bcachefs style)
- FUSE via
fuser0.17 pure-rust (nolibfuse-devdependency) -
lookup / getattr / readdir / read / write / create / mkdir— enough formkdir / echo > / cat / ls / cd .. - Multi-block writes (4 KB chunks + read-modify-write) and zero-filled sparse reads
- Btree delete with bcachefs-style
Deleted/Whiteoutdistinction - snap_id embedded in every key + iterator ancestor filter (
BTREE_ITER_FILTER_SNAPSHOTSsemantics) - Snapshot tree + Subvolume tree (
KIND_SNAPSHOT/KIND_SUBVOL) -
Btree::transactionfor atomic multi-key ops (used byunlink,rmdir,rename) -
unlink / rmdir / renameexposed via FUSE -
Fs::snapshot_subvol+Fs::switch_subvol— bcachefs-style writable snapshots: src keeps writing under a new id, snapshot subvol gets a readonly id, both inheriting from the old snap_id - Multi-bset node layout (bcachefs
bsetinfra): per-nodeBsetHeader { seq, nkeys, flags }, up toBSET_TREE_NR_MAX=4sorted runs per leaf, k-way merged search/iter (highest-seq wins on dup), append-to-last-bset insert, soft-limit roll-over, 4-bsets-full → compact, split compacts multi-bset source to single bset - In-place delete optimization: when
visible_snap == snap(delete-of-own-key), flipkindbyte fromLivetoDeletedin the existing entry instead of sort-inserting a fresh tombstone —nkeysunchanged, no new bset opened, no possible split. Cross-snap deletes still write aWhiteoutentry. - Persistence (Phase 1: image file, no journal) — single backing file; superblock at block 0 with magic / version / CRC32 / root_block / next_block_nr / next_bset_seq / next_ino / next_snap_id / next_subvol_id / current_subvol; CRC32 stamped into every node block on write and verified on read; magic mismatch and checksum mismatch surfaced as typed
btree::Errorvariants. Allocator unified between btree nodes and data blocks (store.alloc()returns next free block_nr from a single namespace).Fs::create(path)/Fs::open(path)/Fs::sync(); FUSE picks up--image <path>, auto-syncs ondestroy(). -
BlockStorewith append-only cache —elsa::sync::FrozenMapkeeps cached blocks borrow-stable across cache faults soread_node(&self, nr) -> &BtreeNodeRawdoesn't need&mut self. COW guarantees a freshly-allocatedblock_nris written exactly once before being read, matching FrozenMap's append-only contract perfectly.BtreeNodeRawandDataBlockuse separate cache lanes so 4 KB-aligned zerocopy casts stay sound.
TODO
- Btree node size: currently 4 KB to match an OS page and keep COW cheap; bcachefs uses 256 KB by default. Once persistence + journal land, raise
BLOCK_SIZEto 64 KB / 256 KB soMAX_ENTRIES(now 29) andBSET_SOFT_LIMIT(now 7) become big enough for multi-bset to actually pay off vs single-bset binary search. Needs benchmarks for COW write amplification (clone_to_heapcost scales linearly with node size). -
needs_whiteoutbit + whiteout-only compaction - Snapshot deletion: walk all btrees, drop keys at the gone snap_id, clean dependent whiteouts
- Sibling merge / rebalance on sparse leaves
-
setattr: truncate / chmod / utimens - Reclaim old data block on extent overwrite / unlink (snapshot inheritance complicates this — needs per-block refcounting)
- Subvolume management exposed via FUSE (currently only via Rust API)
TODO (persistence — phase 2 and beyond)
- Write-ahead journal + crash recovery: append-only journal region; each transaction stages dirty block_nrs, flushes, then advances the superblock; on open, find the last valid journal entry and replay.
NodeHeader.generationalready in place for this. - Block GC (mark-and-sweep, bcachefs style): periodically walk live roots (current + every snapshot's), compute
live_set, free(0..next_block_nr) - live_set, drop those entries from the cache, prefer them on nextalloc(). Until this lands, the image file grows monotonically. - Bounded cache + LRU eviction: today the
FrozenMapcache never shrinks (matches "no GC" but blows RAM on large images). Needs a real cache with eviction, which probably means dropping the FrozenMap append-only invariant and going through anRwLock<HashMap>or a per-entry lock as bcachefs does. - Direct I/O (
O_DIRECT) + 4 KB-aligned buffers: bypass the OS page cache, control writeback ordering ourselves. Today we use buffered IO and rely onfsyncfor ordering — fine for a learning project, not OK for a real fs. - Multi-superblock + watermark for atomic superblock update: bcachefs writes N copies of the superblock; we currently overwrite block 0 in place and a torn write at the wrong moment is unrecoverable.