Skip to content

fix(runtime): eliminate deadlock + comprehensive CI overhaul#62

Open
ViewWay wants to merge 20 commits into
mainfrom
fix/runtime-deadlock-async-executor
Open

fix(runtime): eliminate deadlock + comprehensive CI overhaul#62
ViewWay wants to merge 20 commits into
mainfrom
fix/runtime-deadlock-async-executor

Conversation

@ViewWay

@ViewWay ViewWay commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

Two intertwined efforts on this branch:

  1. Runtime deadlock fix — replace the self-built scheduler/driver with async-executor + async-io, removing ~6100 lines of dead code and eliminating the deadlock. Includes spawn_blocking, bounded-channel backpressure, and end-to-end HTTP server on hiver-runtime.
  2. CI overhaul — fix broken gating checks, harden workflows, add cargo-deny + MSRV jobs, migrate 18 crates to workspace deps, sync README with reality.

Runtime changes (prior commits on branch)

  • f58a6aa3 eliminate deadlock via async-executor/async-io backend
  • 8017ad07 remove ~6100 lines of dead self-built scheduler/driver code
  • f9d0bdb6 HTTP server runs on hiver-runtime end-to-end
  • af079fa5 add #[hiver::test] macro + migrate hiver-resilience
  • 44ca5f51 reserve io-uring feature flag + Linux backend stub
  • eabbe9b5 spawn_blocking via the blocking crate
  • 77258a3f bounded channel with real backpressure + try_recv Empty variant
  • c91199ed #[pre_authorize] now generates enforcement code
  • d09e5b53 cloud gateway actually proxies via reqwest

CI overhaul (this session)

Commit What
fc9b4ffa checkout v4→v6, permissions/concurrency/timeout/paths-ignore, new deny + msrv jobs, remove obsolete debug-sigsegv.yml + auto-publish.yml, unify caching (Swatinem/rust-cache), codecov v4→v5
ca1e1521 commit missing examples/*.rs sources (manifest declared them but untracked → broke CodeQL build)
62597f1f migrate 18 crates to workspace deps; fix detection rule (POSIX [[:space:]]\s silently fails on BSD awk)
6556f2b9 semver.yml api-diff: cargo-public-api has no --workspace flag (loop per-crate)
6a5223e1 semver.yml semver-check: drop removed --fail-on-error flag
1e64603c fix deny/MSRV/fmt surfaced by new gating jobs
82c48ea3 allow Unicode-3.0 license
60a72a64 deny: 7 transitive licenses + sync advisory ignore list + license fields on hiver-benches/hiver-examples

CI status (verified green on this branch)

  • ✅ CI: fmt / clippy / build / test / audit / deny / msrv
  • ✅ Check Workspace Dependencies
  • ✅ Docs
  • ⚠️ Benchmark: pre-existing (needs gh-pages baseline branch — out of scope)
  • ⚠️ CodeQL: analysis engine internal failure (build itself passes — out of scope)

Notable findings the new checks exposed (now fixed)

  • MSRV was wrong: Cargo.toml declared rust-version = "1.87" but deps need 1.91 → bumped
  • check-workspace-deps was cosmetic: never exit 1, so 18 crates' direct version specs went unnoticed
  • semver.yml was cosmetic: --fail-on-error || true always passed
  • 2 crates unlicensed: hiver-benches / hiver-examples had no license field

Test plan

  • cargo check --workspace --lib passes (1m42s)
  • cargo +nightly fmt --all -- --check passes
  • cargo deny check advisories licenses bans sources passes (licenses/bans/sources locally)
  • All 7 CI jobs green on branch
  • Reviewer: spot-check the 18-crate workspace-deps migration in 62597f1f

Out of scope (follow-ups)

  • Benchmark gh-pages baseline branch setup
  • CodeQL analysis engine failure investigation
  • macOS/Windows test matrix

ViewWay added 12 commits June 20, 2026 08:43
The self-built runtime deadlocked due to three intertwined defects, all
now resolved. Public API (Runtime, block_on, spawn, JoinHandle, io::*,
time::*, #[hiver_main]) is unchanged — downstream hiver-http, hiver-macros,
and hiver-middleware compile with zero source changes.

Root causes and fixes:

1. Control-flow split deadlock (test_nested_spawn hung): the main thread's
   block_on drove the I/O driver while worker threads polled spawned tasks
   without ever touching the driver — a spawned task Pending on I/O waited
   for driver events that only the (blocked) main thread could produce.
   Replaced the self-built scheduler/driver with async-executor +
   async-io: the main future, spawned tasks, and the I/O reactor now all
   make progress in one block_on event loop.

2. GLOBAL_HANDLE test isolation leak: OnceLock::set succeeded only once,
   so the first test's handle leaked into every subsequent test. Switched
   to RwLock<Option<Handle>> and, in the new single-thread-executor design
   where spawned tasks run on the block_on thread, rely on the thread-local
   CURRENT_HANDLE (no global write needed).

3. Fire-and-forget tasks never ran: async_executor::Task cancels its task
   on Drop (per async-task docs). JoinHandle::Drop now calls .detach() so
   fire-and-forget spawns (e.g. hiver-http's per-connection handlers) keep
   running. Additionally, block_on uses async_io::block_on (smol's own
   reactor-aware driver) instead of futures_lite::future::block_on, which
   is a plain parker that does not drive the reactor.

I/O primitives (TcpListener/TcpStream/UdpSocket) rewritten on async-net;
time::Sleep rewritten on async_io::Timer — both driven by the same
async-io reactor in block_on, mirroring smol's architecture.

Verified: 105 runtime tests pass (lib + integration + real TCP echo);
e2e http_server_e2e and annotated_app_e2e both pass (real HTTP over real
sockets). curl confirms /hello -> "Hello from Hiver!" and /echo -> "echo".
Full workspace builds with zero errors.

Self-built scheduler/driver/raw_task code (~6000 lines of hand-written
unsafe) is now functionally dead and will be removed in a follow-up; this
commit preserves it to keep the change reviewable and the fix revertible.
…iver code

After the async-executor/async-io backend landed (commit f58a6aa), the
self-built scheduler, driver, and raw_task code became unreachable: the new
block_on drives async_executor::Executor + async_io's reactor, spawn submits
to the executor, and io.rs/time.rs delegate to async-net/async_io::Timer. This
commit physically removes the dead code so it can no longer mislead, mask the
real API, or carry hand-written unsafe (raw_task ref-counting, scheduler waker
vtable with mem::forget that "avoids UAF", epoll/kqueue/io_uring drivers).

Deleted (entire modules):
- scheduler/{mod,local,work_stealing,handle,queue}.rs — 1636 lines
- driver/{mod,epoll,kqueue,iouring,config,interest,queue}.rs — 3562 lines
- task/raw_task.rs — 381 lines (manual TaskCore ref-count + Box::from_raw)
- io_registry.rs — 99 lines

Rewritten (keeping the public contract intact):
- runtime.rs: Runtime/Handle trimmed to just `executor` + compat config;
  run_once/process_completions/advance_timers/flush_events dead methods and
  the scheduler/driver/io_registry fields removed. RuntimeConfig/Builder kept
  as compat shims (driver_type/io_entries deprecated to no-ops).
- task.rs: clean rewrite — TaskId/gen_task_id migrated from scheduler, JoinHandle
  wraps async_executor::Task with Drop-detach, spawn/block_on preserved. Old
  TaskInner/WaitForTask/waker-vtable/TaskState removed.
- lib.rs: dropped driver/scheduler/io_registry module decls and re-exports.
- integration_test.rs: dropped tests referencing deleted DriverFactory/
  TimerWheel/Interest/SchedulerConfig; kept std-invariant + async I/O tests.

time.rs's old TimerWheel/global_timer remains (silently dead, guarded by
#[allow(dead_code)]) — a follow-up can prune it; it does not affect behavior.

Verified: 50 lib + 10 integration + 1 real-TCP tests pass (61 total); full
workspace builds with zero errors; e2e HTTP server serves /hello and /echo
correctly over a real socket.
…me end-to-end

These changes are the direct companion to the runtime deadlock fix
(commit f58a6aa): they route the HTTP server, the #[hiver_main] entry
macro, and the timeout middleware through hiver_runtime instead of tokio,
so that spawn/accept/waker/timer all go through the same async-io reactor.

- server.rs: accept loop races accept against a shutdown signal
  (select(accept, shutdown)) for graceful shutdown; handle_connection now
  bounds reads by the request timeout and honors the Connection: close
  header (previously a close-requested client would hang).
- spring_stereotype.rs: #[hiver_main] builds hiver_runtime::Runtime instead
  of tokio::runtime::Runtime, so the boot path runs on the framework's own
  runtime.
- timeout.rs: middleware uses hiver_runtime::time::sleep + select instead of
  tokio::time::timeout (tokio's time driver is inactive under hiver_runtime).
- http lib.rs/request.rs/response.rs/routes.rs: supporting fixes for the
  server to actually serve — Request identity extractor, GET method match,
  \r\n header terminator fix, unified router.route() entry for the #[get]
  route macros.
- e2e_tests.rs + examples: add the http_server_e2e and annotated_app_e2e
  integration tests (spawn the real binary, assert over a real socket) and
  register the example binaries.
- retry/router/spel/reactor: additional test coverage and the unified route
  registration that the route macros depend on.
- Cargo.lock + ci.yml: pick up the new async-net dependency; CI runs the
  socket/waker/e2e integration tests.

Verified: cargo build --workspace (0 errors); hiver-runtime + hiver-http +
hiver-macros + hiver-middleware = 285 tests pass; e2e HTTP /hello and /echo
serve correctly over a real socket.
Introduces a #[hiver::test] attribute macro (mirroring #[tokio::test]) that
wraps an async fn test body in hiver_runtime::Runtime::block_on, so the test's
async code runs against the framework's own reactor — not tokio's. This makes
the tokio→hiver-runtime test migration a mechanical find/replace rather than
per-test manual wrapping.

Migration pattern (apply per crate):
- dev-dependency: hiver-macros
- #[tokio::test] → #[hiver_macros::test] (fully-qualified, to avoid shadowing
  the built-in #[test] used by sync tests)
- tokio::time::sleep → hiver_runtime::time::sleep

hiver-resilience migrated as the reference:
- Production code: Timeout::call and the standalone timeout() fn now race
  against hiver_runtime::time::sleep (tokio::time::timeout would never resolve
  under hiver_runtime). Retry's backoff sleep also switched.
- Tests: 10 #[tokio::test] → #[hiver_macros::test]; tokio::time::sleep →
  hiver_runtime::time::sleep.

Verified: hiver-resilience 81 tests pass (including 10 migrated async tests on
hiver-runtime); workspace builds clean.
Adds a `io-uring` feature flag to hiver-runtime and a documented `iouring`
module that is the integration point for a future Linux-only thread-per-core
io_uring reactor (modeled after monoio). The feature is inert by default and
on non-Linux platforms (the module only compiles under
`cfg(all(target_os="linux", feature="io-uring"))`).

This preserves the "self-built high-performance runtime" narrative for
latency-sensitive Linux deployments while keeping the default path on the
stable async-io reactor. The concrete `IoUringDriver` is not yet implemented
(it needs a Linux environment to build/benchmark); Cargo.toml and the module
doc spell out the implementation plan so a Linux contributor has a clear
integration point.

No behavior change: default builds are identical; the feature compiles cleanly
on and off (on macOS it is a no-op).
…est]

Mechanical migration of pure-logic test crates to run on hiver-runtime:
- hiver-ldap: 40 #[tokio::test] → #[hiver_macros::test] (0 prod tokio refs)
- hiver-modulith: 3 tests migrated
- hiver-cloud-bus: 6 tests migrated
- hiver-grpc: 6 tests migrated; test_builder_lazy KEPT on #[tokio::test]
  because it exercises the tonic/hyper client whose internals require the
  tokio runtime.

Migration rule established: tests that use tokio-based libraries (tonic,
hyper, sqlx, ...) must stay on #[tokio::test]; only pure async-logic tests
move to #[hiver_macros::test]. tokio::sync::{RwLock,mpsc,oneshot} stay (they
don't need a reactor).

Verified: ldap 154 + grpc 25 + modulith 22 + cloud-bus 11 tests pass.
…macros::test]

Batch migration:
- hiver-kafka: 7 tests → #[hiver_macros::test] (128 pass)
- hiver-websocket-stomp: 17 tests (37 pass)
- hiver-graphql: 13 tests (23 pass)
- hiver-session: 16 tests (25 pass); test_session_expiration's tokio::time::sleep
  → hiver_runtime::time::sleep (the tokio time driver is inactive under #[hiver::test]).

Verified: all 4 crates pass on hiver-runtime.
- hiver-events: 23 tests migrated (tokio::time::sleep → hiver_runtime::time::sleep
  in test code); 2 tests (test_publish_async, test_context_event) kept on
  #[tokio::test] because they exercise production code using
  tokio::task::spawn_blocking. 95 pass.
- hiver-cache: 31 tests migrated (54 pass).
- hiver-tx: 24 tests migrated (52 pass).
This is a CI-only overhaul — no Rust code touched.

Fix broken checks (were cosmetic, never gated merges):
- semver.yml: fix missing `echo` on 4 summary lines (shell parse error),
  drop `--fail-on-error || true` (made it always pass), compute a real
  API diff instead of posting the new listing labeled "diff".
- check-workspace-deps.yml: actually `exit 1` on violations (previously
  always printed "✅" and exited 0).

Remove obsolete files:
- debug-sigsegv.yml: the SIGSEGV/deadlock it diagnosed is fixed
  (commits f58a6aa, 8017ad0).
- auto-publish.yml: disabled, trigger never fired, conflicts with
  release.yml (tag-based publisher).

Harden ci.yml:
- checkout@v4 -> @v6, add permissions/concurrency/timeout-minutes/
  paths-ignore.
- New `deny` job (cargo-deny against deny.toml, which had no consumer).
- New `msrv` job (verify Cargo.toml rust-version = "1.87").

Standardize caching + versions:
- Replace fragile 3-step manual actions/cache with Swatinem/rust-cache@v2
  in benchmark/codeql/coverage/docs/outdated/semver.
- codecov-action@v4 -> @v5 in coverage.yml.

Sync .github/workflows/README.md with reality (10 workflows, removed the
5 phantom files quality/linux/macos/windows/format.yml).
examples/Cargo.toml declares [[bin]] targets for annotated_app_example
and http_server_example, but the source files were never committed
(both untracked since the branch was created). This broke any CI job
that compiles the whole examples crate — notably codeql.yml's
`cargo build --all --all-features` (failed in run 27872394524):

  error: can't find bin `annotated_app_example` at path
         `examples/src/annotated_app_example.rs`
  error: can't find bin `http_server_example` at path
         `examples/src/http_server_example.rs`

Note: ci.yml's `test` job also builds http_server_example for the e2e
test, so it would have failed too once it reached that step.
The check-workspace-deps job (made gating in the prior commit) found 18
crates declaring direct dependency versions instead of workspace = true.
Migrate them so the check passes and the policy is enforced.

Root Cargo.toml [workspace.dependencies]:
- Add 14 deps that were missing: lapin, rdkafka, flate2, brotli, tonic,
  mongodb, redis, deadpool, deadpool-redis, rmp-serde, async-graphql,
  lettre, ldap3, testcontainers-modules.

Crates migrated (direct version → { workspace = true, ... }):
  hiver-amqp hiver-cache hiver-data-commons hiver-data-mongodb
  hiver-data-orm hiver-data-rdbc hiver-data-redis hiver-graphql
  hiver-grpc hiver-kafka hiver-ldap hiver-mail hiver-middleware
  hiver-resilience hiver-retry-macros hiver-state-machine hiver-test
  hiver-ws

check-workspace-deps.yml rule fix:
- Was grepping `.workspace = true` globally → ~13 false positives on
  legitimate package-metadata shorthand (version.workspace = true etc).
- Now scans only inside [dependencies]/[dev-dependencies]/[build-
  dependencies] tables (awk section-aware), excludes path= deps, and
  uses POSIX [[:space:]] instead of \s (the \s GNU extension silently
  fails on BSD awk, reporting zero violations).

Verified: cargo check --workspace --lib passes (1m42s); detection rule
now reports zero violations.
The api-diff job passed --workspace to cargo-public-api, which errored
with "unexpected argument '--workspace'". cargo-public-api operates on
one crate at a time (via rustdoc JSON); it has no --workspace option.
The old `diff ... || true` masked the error, so the posted PR comment
was always an empty diff.

Fix: loop over every workspace library crate (via cargo metadata),
generate a per-crate unified diff between origin/main and the PR head,
and concatenate them into the comment.

Triggered by run 27873124985 after my earlier refactoring of this job
removed the masking `|| true` and surfaced the real error.
@github-actions

Copy link
Copy Markdown

📦 Public API Changes

No public API changes.

cargo-semver-checks dropped the --fail-on-error flag in a recent release;
passing it errors with "unexpected argument" (run 27873350345). The tool
fails (exit 1) on violations by default, so the flag was redundant.

Now the command is just: cargo semver-checks check-release --workspace --all-features
@github-actions

Copy link
Copy Markdown

📦 Public API Changes

No public API changes.

Three follow-ups to the CI overhaul, each fixing a REAL issue the new
gating jobs correctly surfaced (none were false positives):

1. cargo-deny (deny.toml): allow MPL-2.0, BSL-1.0, Zlib — three standard
   OSI-approved licenses from transitive deps (ascii_utils→async-graphql
   uses MPL-2.0, etc.) that weren't in the allow-list. The deny job is
   new (added in the CI overhaul); these were previously unchecked.

2. MSRV (Cargo.toml + ci.yml): bump rust-version 1.87 → 1.91. The old
   value was incorrect — deps require up to rustc 1.91 (alloy@2.0.5 etc.).
   The new MSRV job caught the mismatch on its first run. Pin ci.yml's
   MSRV job to 1.91 and update README.

3. fmt: `cargo +nightly fmt --all` over 13 files. Pre-existing drift
   (e.g. hiver-resilience had `std::fmt::Debug` that should be
   `fmt::Debug` under nightly rustfmt's unused-qualifications lint).

Verified locally: cargo fmt --check passes, cargo check --workspace --lib
passes, deny.toml/ci.yml/semver.yml all parse cleanly.
@github-actions

Copy link
Copy Markdown

📦 Public API Changes

No public API changes.

cargo-deny now also rejects Unicode-3.0 — the newer Unicode License v3
(OSI-approved, successor to Unicode-DFS-2016 which was already allowed).
Used by idna, url, icu_*, alloy, mongodb, lettre. Second license that
the new deny job surfaced after MPL-2.0/BSL-1.0/Zlib in the prior commit.
@github-actions

Copy link
Copy Markdown

📦 Public API Changes

No public API changes.

Iterating cargo-deny to green (licenses ok, bans ok, sources ok verified
locally; advisories needs CI's network for the DB fetch):

deny.toml [licenses] — add 4 more transitive licenses surfaced after the
prior batch:
  - Unicode-3.0  (idna/url/icu_*/alloy/mongodb/lettre)
  - 0BSD         (quoted_printable via lettre)
  - CC0-1.0      (transitive deps)
  - CDLA-Permissive-2.0 (transitive deps)
Also remove Unicode-DFS-2016 (no longer matched — superseded by Unicode-3.0).

deny.toml [advisories] — sync the ignore list with ci.yml's `audit` job.
Both tools scan the same RustSec DB; previously deny had an empty ignore
list while audit ignored 13 RUSTSEC ids, so deny failed on advisories
that audit already accepted. Now both enforce the same project policy.

Unlicensed crates — hiver-benches and hiver-examples had no license
field (deny flags this as error[unlicensed]). Add
  license = { workspace = true }
inheriting "MIT OR Apache-2.0" from the root, matching every other crate.
@github-actions

Copy link
Copy Markdown

📦 Public API Changes

No public API changes.

ViewWay added 4 commits June 21, 2026 11:14
Adds `hiver_runtime::task::spawn_blocking`, the Hiver equivalent of
`tokio::task::spawn_blocking`. It offloads sync/blocking closures (DB calls,
file I/O, CPU-bound crypto) onto the `blocking` crate's dedicated thread pool
(smol ecosystem, compatible with async-io), preventing them from stalling the
reactor. Returns a `blocking::Task<R>` future that resolves to the closure's
output.

The `blocking = "1"` dependency was already declared but never used (dead
weight); this wires it up. Exported at `hiver_runtime::spawn_blocking` and
`hiver_runtime::task::spawn_blocking`.

Verified: test_spawn_blocking_returns_result passes (closure runs off the
async thread, result returned via Runtime::block_on).
The #[pre_authorize("hasRole('ADMIN')")] macro was a no-op passthrough — it
returned the item unchanged, generating zero enforcement. Now it wraps the
async fn body with a call to hiver_security::check_pre_authorize(&ctx, expr)
before the original body, panicking with "access denied" if the expression
evaluates to false.

The macro creates a SecurityContext::new() (unauthenticated by default) and
evaluates the SpEL-like expression against it. In a full framework integration
the context would be injected from the request (e.g. by JWT auth middleware);
for now the default context correctly denies role-based access for
unauthenticated callers.

Also: hiver-security now re-exports check_pre_authorize, PreAuthorizeOptions
at crate root (they were in the private pre_authorize module).

Verified: smoke test (examples/src/pre_auth_test.rs) confirms an
unauthenticated user is denied by #[pre_authorize("hasRole('ADMIN')")].
… variant

Two channel fixes:

1. **bounded() now enforces capacity** — the old `bounded(cap)` silently
   discarded the capacity (passed it only to VecDeque::with_capacity as an
   allocation hint; send always succeeded synchronously). Now `bounded()`
   delegates to `async_channel::bounded(cap)`, which provides true
   backpressure: `send().await` parks the sender when the buffer is full.
   `try_send` returns `TrySendError::Full` when at capacity. Added
   `async-channel = "2"` as a direct dependency.

2. **try_recv now distinguishes Empty from Closed** — `RecvError` gains an
   `Empty` variant. Previously `try_recv` returned `RecvError::Closed` for
   both "channel empty with senders alive" and "channel closed", making it
   impossible for callers to distinguish the two states. Now it correctly
   returns `Empty` when the channel is temporarily empty.

The hand-written `unbounded()` channel is unchanged (it works correctly).
Verified: 50 runtime lib tests pass.
SimpleGateway::handle previously returned a fake "Routed to: {uri}" string
instead of forwarding the request. Now it builds a real reqwest request to
the upstream URL (route.uri + request path + query), forwards the method,
headers, and body, and returns the upstream's status/headers/body as the
GatewayResponse. Connection errors return 502 Bad Gateway.

reqwest was already a dependency of hiver-cloud. The predicate/filter/rate-
limit/circuit-breaker pipeline is preserved — it runs before the forward.

Verified: 146 cloud tests pass, zero regressions.
@github-actions

Copy link
Copy Markdown

📦 Public API Changes

No public API changes.

@ViewWay ViewWay changed the title fix(runtime): eliminate deadlock via async-executor/async-io backend fix(runtime): eliminate deadlock + comprehensive CI overhaul Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant