pathlockd

Fast, scalable, opinionated path-based distributed locking primitives with embedded Multi-Raft and RocksDB, exposed over gRPC.

pathlockd is a self-contained daemon that coordinates concurrent access to a hierarchical path namespace (handler:/a/b/c) across many processes and machines. It exposes a small, precise set of locking primitives over gRPC and stores all durable state in an embedded RocksDB engine — with no external dependencies.

Replication status. Lock state is Raft-replicated across an elastic Multi-Raft cluster (one embedded openraft group per shard, SWIM/foca for discovery). Any node serves any request — writes forward to the shard's leader; a leader crash fails over within the election timeout with no lost acknowledged state; fencing tokens stay globally monotonic across failovers. 1 node works (no fault tolerance), 3+ nodes give HA; replicas join and leave at runtime and groups re-place themselves automatically.

It is opinionated: the locking model is exactly the one a virtual-filesystem needs — write locks that cover a whole subtree, point reads that don't, fencing tokens to make stale writers detectable, leases that expire if a holder dies, and built-in deadlock detection — rather than a general-purpose lock manager you have to assemble yourself.

Why path locking

When many workers manipulate a shared tree (rename a folder here, upload a file there, reconcile a subtree elsewhere) you need more than a flat mutex per key. You need locks that understand containment:

A write lock on /a must conflict with any lock on /a, on an ancestor of /a, or anywhere in the /a/... subtree — locking /a means "this whole subtree is mine".
A read lock is point-only: it protects exactly one node. An ancestor read does not cover its descendants, so it neither blocks nor is blocked by a write deeper in the tree.

pathlockd enforces this containment directly, with O(subtree) conflict checks (not O(keyspace)) via descendant indexes.

It's a reader-writer lock — but not the textbook one. A classic RWLock is symmetric and flat: one key, readers share, a writer is exclusive. pathlockd keeps the shared-readers / exclusive-writer rule but generalizes it to a tree asymmetrically: a write claims its entire subtree, while a read claims only its single node. So a write and a read collide only when the write's subtree contains the read's node — an ancestor read does not cover its descendants, and a descendant write does not block an ancestor read. That asymmetry (and the precedence between conflict reasons) is the part most worth understanding before you use it — read docs/locking-semantics.md, the normative spec for the full conflict matrix, fencing, leases, and re-entrancy rules.

Core concepts

Concept	What it does
Owner	A caller-supplied id that owns a lock and all the paths it holds.
Read / write modes	Shared readers, exclusive writer — but hierarchical, not flat: a write covers its whole subtree, a read covers only its node. A tree-shaped RWLock, not the symmetric textbook one. Full rules: docs/locking-semantics.md.
Fencing token	A monotonic token stamped on every write-locked path. A holder can `AssertFencing` to prove it still owns a path at its token; a stale token is rejected, so a paused-then-resumed writer can't corrupt newer state.
TTL lease + renewal	Every lock is a lease. The holder renews it; if the holder dies, the lease expires and the subtree frees itself — no orphaned locks.
Liveness & pruning	Read sets self-heal: members whose owner lease has lapsed are pruned on the next touch.
Deadlock detection	Wait edges (`owner → blocker`, plus the path/reason being waited on) form a wait-for graph. `DetectCycle` walks it and drops stale edges; a client that finds a cycle resolves it with a cooperative revoke, then a forced release if the victim doesn't yield.
Per-owner event stream	A `Subscribe` stream bound to one owner delivers only that owner's lifecycle events (`released` / `killed` / `revoke`). A lock's channel carries only that lock's information.

Architecture

   your application (one lock = one owner id = one connection)
        │  gRPC
        ▼
   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
   │  pathlockd  │◄──┤  pathlockd  ├──►│  pathlockd  │  N nodes
   └──────┬──────┘   └──────┬──────┘   └──────┬──────┘  SWIM gossip (UDP)
          │                 │                 │         Raft RPC (gRPC)
   ┌──────▼──────┐   ┌──────▼──────┐   ┌──────▼──────┐
   │   RocksDB   │   │   RocksDB   │   │   RocksDB   │  embedded, per node;
   └─────────────┘   └─────────────┘   └─────────────┘  groups replicate via
                                                        their Raft logs

Self-contained binary. All durable state lives in an embedded RocksDB engine inside each node. No external coordination service is needed — start a single binary and it is ready.
RocksDB for persistence. Lock metadata is stored in 14 column families with TTL-based expiry and background GC sweeps. WAL fsync guarantees durability across process crashes.
Multi-Raft consensus. The path namespace shards into group_count Raft groups by routing domain (the handler prefix, optionally deeper): a path, its ancestors, and its whole subtree always share one group, so every lock operation is single-group — no cross-shard transactions, ever. Each group is an independent openraft core; writes commit on the group's leader and apply identically on every replica. A dedicated system group holds the cluster-global state: the monotonic fencing counter, the deadlock wait-graph, and the membership directory (replicated to every node).
Elastic membership. SWIM gossip (foca) discovers nodes and suspicion; Raft membership stays the correctness authority. Each group's leader reconciles its voter set toward an HRW placement over the stable members: new nodes are adopted learner-first and promoted via joint consensus, dead voters are replaced after an eviction window (never breaking quorum), and leadership spreads across nodes. The replication factor upgrades automatically as nodes arrive (1 → 3 → 5).
Atomicity. Each command applies inside a single WriteBatch with read-your-writes semantics, serialized by its group's Raft apply loop. Rejected outcomes (conflicts) commit nothing. One WAL fsync per batched group of appends — across all groups on the node — preserves group-commit throughput. Forwarded commands carry request ids and dedupe, so an ambiguous retry (leader change mid-flight) applies exactly once.
TTL-based leases. Every record carries an absolute expiry timestamp. Reads treat elapsed entries as absent (correctness); a background GC sweep reclaims expired records (housekeeping, configurable interval). Set entries (read sets, descendant indexes) expire per member, so a short-lived lock never shortens the visibility of a longer-lived one sharing the same key.

Scope & limits

Write throughput scales per handler, not without bound. All mutations that touch a given handler serialize through the handler's Raft group leader. Spread load across handlers — or split a hot handler — to scale; a single hot handler is the throughput ceiling.
Descendant index size. A write lock is indexed under every ancestor up to the handler root, so the root index aggregates every write lock in the handler in one value. This bounds the practical number of concurrent locks per handler; very wide/deep trees in one handler are not yet sharded (future work).
Input limits (enforced server-side). ttl_ms must be > 0 (a 0 TTL would never expire) and ≤ 7 days; paths must be normalized (<handler>:/rooted/path, no //, ./.., or trailing slash); owner_id/paths are length-bounded; DetectCycle.max_depth is clamped. Malformed input is rejected with InvalidArgument.
Trust model. There is no authentication on the gRPC surface — any client can release or revoke any owner's locks. Run pathlockd on a trusted network (or behind a TLS-terminating, authenticating proxy).
Storage format. This is a pre-1.0 daemon; the on-disk value encoding may change between versions. Run against a fresh/flushed keyspace when upgrading.

Roadmap to 1.0.0 (TODO)

The following are not yet implemented and are planned for the final 1.0.0 release:

Authentication & authorization, TLS — the gRPC surface is currently unauthenticated and in plaintext; until then, run pathlockd only on a trusted network or behind a TLS-terminating, authenticating proxy.
Multitenancy — no tenant isolation yet (per-tenant authn/authz, namespacing beyond the handler convention, and quotas).

Internals are documented for contributors and tools in llmwiki/. For end-to-end, copy-pasteable usage when building a user-space virtual filesystem, see the usage guide.

Platform support

Container images are published for linux/amd64 and linux/arm64 (Apple Silicon / AWS Graviton). The Node.js client targets linux/amd64.

Running from the container image

Pre-built images are published to GHCR on every version tag (v*):

Image tag	Binary	Notes
`ghcr.io/alexpacio/pathlockd:0.6.0`	native (amd64/arm64)

Run pathlockd (single node, no external dependencies):

docker run -d --restart=unless-stopped \
  -p 50051:50051 \
  -e PATHLOCKD_BOOTSTRAP=true \
  -e PATHLOCKD_DATA_DIR=/data/pathlockd \
  -v pathlockd-data:/data/pathlockd \
  ghcr.io/alexpacio/pathlockd:0.6.0

Key env vars (see Configuration for the full list):

Env var	Default	Notes
`PATHLOCKD_LISTEN`	`0.0.0.0:50051`	gRPC bind address
`PATHLOCKD_DATA_DIR`	`/var/lib/pathlockd`	RocksDB data directory
`PATHLOCKD_NODE_ID`	`pathlockd-0`	Stable node identifier
`PATHLOCKD_BOOTSTRAP`	`false`	Bootstrap a new cluster (single node or first node)
`PATHLOCKD_SEED_NODES`	(none)	Comma-separated gossip seed addresses (multi-node)
`PATHLOCKD_PEERS`	(none)	Comma-separated sibling addresses for event fan-out
`PATHLOCKD_LOG_LEVEL`	`info`	`trace` / `debug` / `info` / `warn` / `error`

The daemon runs as a non-root user (uid 10001) and exposes a liveness HEALTHCHECK via --health-check.

Quick start (development / playground)

Single-binary quick start — no external services required:

docker compose up --build
# pathlockd is now at localhost:50051

Try it with grpcurl:

grpcurl -plaintext -d '{}' localhost:50051 pathlockd.v1.PathLock/IncrFencingToken
grpcurl -plaintext localhost:50051 pathlockd.v1.PathLock/Health

Or use the typed Node.js client, pathlockd-nodejs-client.

To run the daemon on your host for development, see llmwiki/06-testing.md.

Production deployment

A cluster is N self-contained nodes. Each node needs three things:

a stable identity — node_id ending in a unique integer (pathlockd-0, pathlockd-1, …) that survives restarts;
a persistent volume of its own — a node must come back on its own disk (a wiped disk means rejoining as a learner and re-syncing);
addresses peers can reach: raft_addr (gRPC/TCP), gossip_addr (UDP), public_addr (client gRPC, used for event fan-out).

Exactly one node sets bootstrap = true (it initializes the cluster the first time, idempotently); every node lists seed_nodes (gossip addresses of the others). A bootstrap-flagged node restarting on an empty disk refuses to re-initialize when its cluster still answers through the seeds, and joins it instead — so the flag is safe to leave set in static configs.

Single node (dev or no-HA):

docker run -d --restart=unless-stopped -p 50051:50051 \
  -e PATHLOCKD_NODE_ID="pathlockd-0" \
  -e PATHLOCKD_BOOTSTRAP="true" \
  -v pathlockd-data:/data/pathlockd \
  ghcr.io/alexpacio/pathlockd:latest

Docker Swarm (3-node HA): see docker-stack.yml — a ready-to-deploy reference stack. The pattern is one single-replica service per node (pathlockd-0/1/2), because lock state is per-task and Swarm's replicas: 3 gives tasks neither stable identity nor stable volumes:

# Pin each instance to a host so it always finds its volume:
docker node update --label-add pathlockd=0 <node-A>
docker node update --label-add pathlockd=1 <node-B>
docker node update --label-add pathlockd=2 <node-C>
docker stack deploy -c docker-stack.yml pathlockd

Clients on the same overlay network reach any service (pathlockd-0:50051, …); every node serves every request, forwarding writes to the right Raft leader internally. Kill any one container/host: the other two keep serving, acknowledged locks survive, and the node rejoins and re-syncs when it returns.

On Kubernetes, the same shape is a StatefulSet with a headless Service: ordinal hostnames give the node ids, volumeClaimTemplates give per-pod disks, and seed_nodes points at the headless DNS name of pod 0 (or all pods).

Clocks. Lease expiry uses a now_ms stamped at proposal and clamped monotonically inside each group's replicated state machine, so a backwards clock step (NTP, VM resume) or a leader change to a node with a slower clock can never make later commands apply with earlier timestamps. Fencing tokens are one monotonic counter in the system Raft group.

Event fan-out across instances

The per-owner event stream (Subscribe → released / killed / revoke) raises an event on whichever node handled the call, which may be a different node than the one holding the subscriber. Nodes discover each other via gossip and forward events peer-to-peer automatically — no configuration needed. Fan-out is best-effort by design: the client-side recheck poll is the correctness backstop, so a dropped event costs wakeup latency, never safety.

Scaling and write throughput

Reads scale with nodes (any replica serves stale-tolerable reads locally). Writes scale with routing domains: every domain (handler prefix by default) serializes through one Raft group leader, and leaders spread across nodes. Many handlers → near-linear write scaling. Few handlers → set routing_prefix_segments = K to shard by the first K path segments instead, accepting that locks above depth K are rejected (containment must stay single-group). Renews should declare their domains (RenewRequest.domains) so each heartbeat touches only the groups that actually hold state.

To decommission a node gracefully, mark it draining (internal RaftTransport/SetDraining RPC, or just stop it and let the eviction window re-place its groups); scale-up is automatic on join.

Configuration

A TOML file (--config pathlockd.toml or PATHLOCKD_CONFIG) overlaid by PATHLOCKD_* environment variables (env wins). See pathlockd.example.toml.

TOML key	Env var	Default	Meaning
`listen`	`PATHLOCKD_LISTEN`	`0.0.0.0:50051`	Client gRPC listen address
`node_id`	`PATHLOCKD_NODE_ID`	`pathlockd-0`	Stable identifier; must end in a unique integer per node
`data_dir`	`PATHLOCKD_DATA_DIR`	`/var/lib/pathlockd`	RocksDB data directory (one per node, persistent)
`public_addr`	`PATHLOCKD_PUBLIC_ADDR`	`http://localhost:50051`	Client gRPC address advertised to peers (event fan-out)
`raft_addr`	`PATHLOCKD_RAFT_ADDR`	`http://localhost:50052`	Internal Raft/forwarding gRPC address advertised to peers
`gossip_addr`	`PATHLOCKD_GOSSIP_ADDR`	`0.0.0.0:7946`	SWIM gossip UDP bind address
`gossip_advertise_addr`	`PATHLOCKD_GOSSIP_ADVERTISE_ADDR`	auto	Concrete `ip:port` advertised for gossip (set behind NAT)
`seed_nodes`	`PATHLOCKD_SEED_NODES`	`[]`	Gossip addresses of existing members (required unless bootstrapping)
`bootstrap`	`PATHLOCKD_BOOTSTRAP`	`false`	Initialize a brand-new cluster (exactly one node; guarded against re-init on empty disks)
`group_count`	`PATHLOCKD_GROUP_COUNT`	`32`	Number of Raft groups (fixed at cluster birth)
`routing_prefix_segments`	`PATHLOCKD_ROUTING_PREFIX_SEGMENTS`	`0`	Path depth of the routing domain (0 = handler only)
`replication_factor`	`PATHLOCKD_REPLICATION_FACTOR`	`3`	Voters per group (odd; auto-degrades/upgrades with node count)
`stability_window_secs`	`PATHLOCKD_STABILITY_WINDOW_SECS`	`30`	Node uptime required before group placement
`eviction_window_secs`	`PATHLOCKD_EVICTION_WINDOW_SECS`	`60`	How long a voter must be gone before replacement
`leader_balance_interval_secs`	`PATHLOCKD_LEADER_BALANCE_INTERVAL_SECS`	`60`	Leadership rebalancing cadence
`max_inflight_per_group`	`PATHLOCKD_MAX_INFLIGHT_PER_GROUP`	`1024`	Per-group write budget; overflow rejected with `UNAVAILABLE`
`raft_election_timeout_min_ms` / `_max_ms`	`PATHLOCKD_RAFT_ELECTION_TIMEOUT_*`	`1500`/`3000`	Election window (failover time ceiling)
`raft_heartbeat_interval_ms`	`PATHLOCKD_RAFT_HEARTBEAT_INTERVAL_MS`	`500`	Leader heartbeat
`raft_snapshot_interval_entries`	—	`10000`	Snapshot after this many log entries
`group_gc_interval_secs`	`PATHLOCKD_GROUP_GC_INTERVAL_SECS`	`1`	GC sweep interval (0 disables; leaders sweep their groups)
`group_gc_batch`	`PATHLOCKD_GROUP_GC_BATCH`	`1024`	Keys per GC sweep command
`gc_compact_interval_secs`	`PATHLOCKD_GC_COMPACT_INTERVAL_SECS`	`600`	Physically compact swept expiry regions (0 disables)
`rocksdb_wal_sync`	`PATHLOCKD_ROCKSDB_WAL_SYNC`	`true`	Fsync the WAL once per batched append group
`rocksdb_max_total_wal_size_mb`	`PATHLOCKD_ROCKSDB_MAX_TOTAL_WAL_SIZE_MB`	`512`	Upper bound on total WAL size
`rocksdb_max_background_jobs`	`PATHLOCKD_ROCKSDB_MAX_BACKGROUND_JOBS`	`4`	RocksDB flush/compaction parallelism
`rocksdb_block_cache_mb`	`PATHLOCKD_ROCKSDB_BLOCK_CACHE_MB`	`128`	Shared block cache size
`rocksdb_write_buffer_mb`	`PATHLOCKD_ROCKSDB_WRITE_BUFFER_MB`	`16`	Per-column-family memtable size
`peers`	`PATHLOCKD_PEERS`	`[]`	Extra static event fan-out endpoints (members are auto-discovered)
`event_buffer`	`PATHLOCKD_EVENT_BUFFER`	`8192`	in-process event channel capacity
`log_level`	`PATHLOCKD_LOG_LEVEL`	`info`	tracing filter

OpenTelemetry

Remote APM export is configured with standard OTEL_* environment variables, not TOML. Traces and metrics are enabled when OTEL_EXPORTER_OTLP_ENDPOINT (or the signal-specific traces/metrics endpoint) is set, or when the matching OTEL_TRACES_EXPORTER / OTEL_METRICS_EXPORTER includes otlp.

Common variables:

Env var	Meaning
`OTEL_SERVICE_NAME`	service name resource attribute (defaults to `pathlockd`)
`OTEL_RESOURCE_ATTRIBUTES`	extra resource attributes, e.g. `deployment.environment.name=prod`
`OTEL_EXPORTER_OTLP_ENDPOINT`	shared OTLP collector/APM endpoint
`OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`	traces-only OTLP endpoint
`OTEL_EXPORTER_OTLP_METRICS_ENDPOINT`	metrics-only OTLP endpoint
`OTEL_EXPORTER_OTLP_PROTOCOL`	`http/protobuf` or `grpc`
`OTEL_EXPORTER_OTLP_HEADERS`	comma-separated auth headers for HTTP OTLP
`OTEL_SDK_DISABLED`	set to `true` to disable OTEL entirely

Example:

export OTEL_SERVICE_NAME=pathlockd
export OTEL_RESOURCE_ATTRIBUTES=deployment.environment.name=prod,service.namespace=locks
export OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.example:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

gRPC API

The full contract is in proto/pathlockd.proto. The PathLock service: Acquire, Release, ReleaseAll, Renew, ForceRelease, AssertFencing, DetectCycle, IsBlocking, IncrFencingToken, SetWaitEdge, ClearWaitEdge, SetClaim, ClearClaim, IsOwnerAlive, RequestRevoke, Subscribe (server stream), Health.

Claims (SetClaim/ClearClaim) are TTL-governed anti-starvation reservations: a waiter plants a claim on the path it is queued for, new overlapping acquires by other owners bounce with preempt_claimed while existing holders drain, and the claimant's own acquire consumes the claim atomically on grant. SetClaim is claim-if-absent — a live foreign claim is reported, never overwritten — and claims require no liveness lease, so a pure waiter (holding nothing yet) can reserve, and a crashed claimant's reservation simply expires.

Building

cargo build --release with standard Rust tooling. The Dockerfile bundles the builder stage, so docker build needs nothing on the host.

Testing

Everything runs inside containers, so Docker is the only prerequisite (no host cargo/protoc/clang). The first run builds a small cached builder image.

./scripts/test-unit.sh           # crate unit tests (no cluster needed)
cargo test --test engine_tests    # lock engine tests (RocksDB integration)
cargo test --test e2e_tests       # full e2e tests (starts a 1-node cluster, drives gRPC)
cargo test --test cluster_tests   # 3-node cluster: formation, leader-kill failover under
                                  # contention (exactly-one-holder invariant), wiped-disk
                                  # bootstrap guard, node rejoin
cargo test --test load            # throughput benchmarks
./scripts/test-e2e-stress.sh     # starts peered replicas, checks cross-replica events, runs GC stress

Engine tests and e2e tests run directly against the embedded RocksDB — no external cluster is needed. See llmwiki/06-testing.md.

Releasing

scripts/release.sh builds the linux/amd64 artifacts, tags, pushes, and publishes the GitHub release in one shot.

# 1. bump the version in Cargo.toml, commit it
# 2. write the release notes for the tag:
#      release_notes/v0.1.2/gh.md      # used as the release body + tag message
# 3. publish (tag must match Cargo.toml; tree must be clean):
./scripts/release.sh v0.1.2

# preview without tagging/pushing/publishing:
./scripts/release.sh --dry-run v0.1.2
# extra flags: --prerelease, --draft

It refuses to run on a dirty tree, on a version/tag mismatch, or if the tag or release already exists. Artifacts land in dist/<tag>/ (release + debug tarballs + SHA256SUMS).

Container images are published automatically by the Docker publish workflow whenever a v* tag is pushed from the same Dockerfile:

Tag pattern	`RUSTFLAGS`	Notes
`:v1.2.3`, `:1.2`	(none)	native on amd64 and arm64

Images are pushed to ghcr.io/alexpacio/pathlockd using the built-in GITHUB_TOKEN; no extra secrets are required.

License

AGPL-3.0-or-later.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.github/workflows		.github/workflows
deploy/grafana		deploy/grafana
docs		docs
llmwiki		llmwiki
proto		proto
release_notes		release_notes
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
docker-compose.yml		docker-compose.yml
docker-stack.yml		docker-stack.yml
pathlockd.example.toml		pathlockd.example.toml
pathlockd.png		pathlockd.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pathlockd

Why path locking

Core concepts

Architecture

Scope & limits

Roadmap to 1.0.0 (TODO)

Platform support

Running from the container image

Quick start (development / playground)

Production deployment

Event fan-out across instances

Scaling and write throughput

Configuration

OpenTelemetry

gRPC API

Building

Testing

Releasing

License

About

Uh oh!

Releases 24

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pathlockd

Why path locking

Core concepts

Architecture

Scope & limits

Roadmap to 1.0.0 (TODO)

Platform support

Running from the container image

Quick start (development / playground)

Production deployment

Event fan-out across instances

Scaling and write throughput

Configuration

OpenTelemetry

gRPC API

Building

Testing

Releasing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 24

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages