fix(runtime): eliminate deadlock + comprehensive CI overhaul#62
Open
ViewWay wants to merge 20 commits into
Open
fix(runtime): eliminate deadlock + comprehensive CI overhaul#62ViewWay wants to merge 20 commits into
ViewWay wants to merge 20 commits into
Conversation
The self-built runtime deadlocked due to three intertwined defects, all now resolved. Public API (Runtime, block_on, spawn, JoinHandle, io::*, time::*, #[hiver_main]) is unchanged — downstream hiver-http, hiver-macros, and hiver-middleware compile with zero source changes. Root causes and fixes: 1. Control-flow split deadlock (test_nested_spawn hung): the main thread's block_on drove the I/O driver while worker threads polled spawned tasks without ever touching the driver — a spawned task Pending on I/O waited for driver events that only the (blocked) main thread could produce. Replaced the self-built scheduler/driver with async-executor + async-io: the main future, spawned tasks, and the I/O reactor now all make progress in one block_on event loop. 2. GLOBAL_HANDLE test isolation leak: OnceLock::set succeeded only once, so the first test's handle leaked into every subsequent test. Switched to RwLock<Option<Handle>> and, in the new single-thread-executor design where spawned tasks run on the block_on thread, rely on the thread-local CURRENT_HANDLE (no global write needed). 3. Fire-and-forget tasks never ran: async_executor::Task cancels its task on Drop (per async-task docs). JoinHandle::Drop now calls .detach() so fire-and-forget spawns (e.g. hiver-http's per-connection handlers) keep running. Additionally, block_on uses async_io::block_on (smol's own reactor-aware driver) instead of futures_lite::future::block_on, which is a plain parker that does not drive the reactor. I/O primitives (TcpListener/TcpStream/UdpSocket) rewritten on async-net; time::Sleep rewritten on async_io::Timer — both driven by the same async-io reactor in block_on, mirroring smol's architecture. Verified: 105 runtime tests pass (lib + integration + real TCP echo); e2e http_server_e2e and annotated_app_e2e both pass (real HTTP over real sockets). curl confirms /hello -> "Hello from Hiver!" and /echo -> "echo". Full workspace builds with zero errors. Self-built scheduler/driver/raw_task code (~6000 lines of hand-written unsafe) is now functionally dead and will be removed in a follow-up; this commit preserves it to keep the change reviewable and the fix revertible.
…iver code After the async-executor/async-io backend landed (commit f58a6aa), the self-built scheduler, driver, and raw_task code became unreachable: the new block_on drives async_executor::Executor + async_io's reactor, spawn submits to the executor, and io.rs/time.rs delegate to async-net/async_io::Timer. This commit physically removes the dead code so it can no longer mislead, mask the real API, or carry hand-written unsafe (raw_task ref-counting, scheduler waker vtable with mem::forget that "avoids UAF", epoll/kqueue/io_uring drivers). Deleted (entire modules): - scheduler/{mod,local,work_stealing,handle,queue}.rs — 1636 lines - driver/{mod,epoll,kqueue,iouring,config,interest,queue}.rs — 3562 lines - task/raw_task.rs — 381 lines (manual TaskCore ref-count + Box::from_raw) - io_registry.rs — 99 lines Rewritten (keeping the public contract intact): - runtime.rs: Runtime/Handle trimmed to just `executor` + compat config; run_once/process_completions/advance_timers/flush_events dead methods and the scheduler/driver/io_registry fields removed. RuntimeConfig/Builder kept as compat shims (driver_type/io_entries deprecated to no-ops). - task.rs: clean rewrite — TaskId/gen_task_id migrated from scheduler, JoinHandle wraps async_executor::Task with Drop-detach, spawn/block_on preserved. Old TaskInner/WaitForTask/waker-vtable/TaskState removed. - lib.rs: dropped driver/scheduler/io_registry module decls and re-exports. - integration_test.rs: dropped tests referencing deleted DriverFactory/ TimerWheel/Interest/SchedulerConfig; kept std-invariant + async I/O tests. time.rs's old TimerWheel/global_timer remains (silently dead, guarded by #[allow(dead_code)]) — a follow-up can prune it; it does not affect behavior. Verified: 50 lib + 10 integration + 1 real-TCP tests pass (61 total); full workspace builds with zero errors; e2e HTTP server serves /hello and /echo correctly over a real socket.
…me end-to-end These changes are the direct companion to the runtime deadlock fix (commit f58a6aa): they route the HTTP server, the #[hiver_main] entry macro, and the timeout middleware through hiver_runtime instead of tokio, so that spawn/accept/waker/timer all go through the same async-io reactor. - server.rs: accept loop races accept against a shutdown signal (select(accept, shutdown)) for graceful shutdown; handle_connection now bounds reads by the request timeout and honors the Connection: close header (previously a close-requested client would hang). - spring_stereotype.rs: #[hiver_main] builds hiver_runtime::Runtime instead of tokio::runtime::Runtime, so the boot path runs on the framework's own runtime. - timeout.rs: middleware uses hiver_runtime::time::sleep + select instead of tokio::time::timeout (tokio's time driver is inactive under hiver_runtime). - http lib.rs/request.rs/response.rs/routes.rs: supporting fixes for the server to actually serve — Request identity extractor, GET method match, \r\n header terminator fix, unified router.route() entry for the #[get] route macros. - e2e_tests.rs + examples: add the http_server_e2e and annotated_app_e2e integration tests (spawn the real binary, assert over a real socket) and register the example binaries. - retry/router/spel/reactor: additional test coverage and the unified route registration that the route macros depend on. - Cargo.lock + ci.yml: pick up the new async-net dependency; CI runs the socket/waker/e2e integration tests. Verified: cargo build --workspace (0 errors); hiver-runtime + hiver-http + hiver-macros + hiver-middleware = 285 tests pass; e2e HTTP /hello and /echo serve correctly over a real socket.
Introduces a #[hiver::test] attribute macro (mirroring #[tokio::test]) that wraps an async fn test body in hiver_runtime::Runtime::block_on, so the test's async code runs against the framework's own reactor — not tokio's. This makes the tokio→hiver-runtime test migration a mechanical find/replace rather than per-test manual wrapping. Migration pattern (apply per crate): - dev-dependency: hiver-macros - #[tokio::test] → #[hiver_macros::test] (fully-qualified, to avoid shadowing the built-in #[test] used by sync tests) - tokio::time::sleep → hiver_runtime::time::sleep hiver-resilience migrated as the reference: - Production code: Timeout::call and the standalone timeout() fn now race against hiver_runtime::time::sleep (tokio::time::timeout would never resolve under hiver_runtime). Retry's backoff sleep also switched. - Tests: 10 #[tokio::test] → #[hiver_macros::test]; tokio::time::sleep → hiver_runtime::time::sleep. Verified: hiver-resilience 81 tests pass (including 10 migrated async tests on hiver-runtime); workspace builds clean.
Adds a `io-uring` feature flag to hiver-runtime and a documented `iouring` module that is the integration point for a future Linux-only thread-per-core io_uring reactor (modeled after monoio). The feature is inert by default and on non-Linux platforms (the module only compiles under `cfg(all(target_os="linux", feature="io-uring"))`). This preserves the "self-built high-performance runtime" narrative for latency-sensitive Linux deployments while keeping the default path on the stable async-io reactor. The concrete `IoUringDriver` is not yet implemented (it needs a Linux environment to build/benchmark); Cargo.toml and the module doc spell out the implementation plan so a Linux contributor has a clear integration point. No behavior change: default builds are identical; the feature compiles cleanly on and off (on macOS it is a no-op).
…est]
Mechanical migration of pure-logic test crates to run on hiver-runtime:
- hiver-ldap: 40 #[tokio::test] → #[hiver_macros::test] (0 prod tokio refs)
- hiver-modulith: 3 tests migrated
- hiver-cloud-bus: 6 tests migrated
- hiver-grpc: 6 tests migrated; test_builder_lazy KEPT on #[tokio::test]
because it exercises the tonic/hyper client whose internals require the
tokio runtime.
Migration rule established: tests that use tokio-based libraries (tonic,
hyper, sqlx, ...) must stay on #[tokio::test]; only pure async-logic tests
move to #[hiver_macros::test]. tokio::sync::{RwLock,mpsc,oneshot} stay (they
don't need a reactor).
Verified: ldap 154 + grpc 25 + modulith 22 + cloud-bus 11 tests pass.
…macros::test] Batch migration: - hiver-kafka: 7 tests → #[hiver_macros::test] (128 pass) - hiver-websocket-stomp: 17 tests (37 pass) - hiver-graphql: 13 tests (23 pass) - hiver-session: 16 tests (25 pass); test_session_expiration's tokio::time::sleep → hiver_runtime::time::sleep (the tokio time driver is inactive under #[hiver::test]). Verified: all 4 crates pass on hiver-runtime.
- hiver-events: 23 tests migrated (tokio::time::sleep → hiver_runtime::time::sleep in test code); 2 tests (test_publish_async, test_context_event) kept on #[tokio::test] because they exercise production code using tokio::task::spawn_blocking. 95 pass. - hiver-cache: 31 tests migrated (54 pass). - hiver-tx: 24 tests migrated (52 pass).
This is a CI-only overhaul — no Rust code touched. Fix broken checks (were cosmetic, never gated merges): - semver.yml: fix missing `echo` on 4 summary lines (shell parse error), drop `--fail-on-error || true` (made it always pass), compute a real API diff instead of posting the new listing labeled "diff". - check-workspace-deps.yml: actually `exit 1` on violations (previously always printed "✅" and exited 0). Remove obsolete files: - debug-sigsegv.yml: the SIGSEGV/deadlock it diagnosed is fixed (commits f58a6aa, 8017ad0). - auto-publish.yml: disabled, trigger never fired, conflicts with release.yml (tag-based publisher). Harden ci.yml: - checkout@v4 -> @v6, add permissions/concurrency/timeout-minutes/ paths-ignore. - New `deny` job (cargo-deny against deny.toml, which had no consumer). - New `msrv` job (verify Cargo.toml rust-version = "1.87"). Standardize caching + versions: - Replace fragile 3-step manual actions/cache with Swatinem/rust-cache@v2 in benchmark/codeql/coverage/docs/outdated/semver. - codecov-action@v4 -> @v5 in coverage.yml. Sync .github/workflows/README.md with reality (10 workflows, removed the 5 phantom files quality/linux/macos/windows/format.yml).
examples/Cargo.toml declares [[bin]] targets for annotated_app_example
and http_server_example, but the source files were never committed
(both untracked since the branch was created). This broke any CI job
that compiles the whole examples crate — notably codeql.yml's
`cargo build --all --all-features` (failed in run 27872394524):
error: can't find bin `annotated_app_example` at path
`examples/src/annotated_app_example.rs`
error: can't find bin `http_server_example` at path
`examples/src/http_server_example.rs`
Note: ci.yml's `test` job also builds http_server_example for the e2e
test, so it would have failed too once it reached that step.
The check-workspace-deps job (made gating in the prior commit) found 18
crates declaring direct dependency versions instead of workspace = true.
Migrate them so the check passes and the policy is enforced.
Root Cargo.toml [workspace.dependencies]:
- Add 14 deps that were missing: lapin, rdkafka, flate2, brotli, tonic,
mongodb, redis, deadpool, deadpool-redis, rmp-serde, async-graphql,
lettre, ldap3, testcontainers-modules.
Crates migrated (direct version → { workspace = true, ... }):
hiver-amqp hiver-cache hiver-data-commons hiver-data-mongodb
hiver-data-orm hiver-data-rdbc hiver-data-redis hiver-graphql
hiver-grpc hiver-kafka hiver-ldap hiver-mail hiver-middleware
hiver-resilience hiver-retry-macros hiver-state-machine hiver-test
hiver-ws
check-workspace-deps.yml rule fix:
- Was grepping `.workspace = true` globally → ~13 false positives on
legitimate package-metadata shorthand (version.workspace = true etc).
- Now scans only inside [dependencies]/[dev-dependencies]/[build-
dependencies] tables (awk section-aware), excludes path= deps, and
uses POSIX [[:space:]] instead of \s (the \s GNU extension silently
fails on BSD awk, reporting zero violations).
Verified: cargo check --workspace --lib passes (1m42s); detection rule
now reports zero violations.
The api-diff job passed --workspace to cargo-public-api, which errored with "unexpected argument '--workspace'". cargo-public-api operates on one crate at a time (via rustdoc JSON); it has no --workspace option. The old `diff ... || true` masked the error, so the posted PR comment was always an empty diff. Fix: loop over every workspace library crate (via cargo metadata), generate a per-crate unified diff between origin/main and the PR head, and concatenate them into the comment. Triggered by run 27873124985 after my earlier refactoring of this job removed the masking `|| true` and surfaced the real error.
📦 Public API ChangesNo public API changes.
|
cargo-semver-checks dropped the --fail-on-error flag in a recent release; passing it errors with "unexpected argument" (run 27873350345). The tool fails (exit 1) on violations by default, so the flag was redundant. Now the command is just: cargo semver-checks check-release --workspace --all-features
📦 Public API ChangesNo public API changes.
|
Three follow-ups to the CI overhaul, each fixing a REAL issue the new gating jobs correctly surfaced (none were false positives): 1. cargo-deny (deny.toml): allow MPL-2.0, BSL-1.0, Zlib — three standard OSI-approved licenses from transitive deps (ascii_utils→async-graphql uses MPL-2.0, etc.) that weren't in the allow-list. The deny job is new (added in the CI overhaul); these were previously unchecked. 2. MSRV (Cargo.toml + ci.yml): bump rust-version 1.87 → 1.91. The old value was incorrect — deps require up to rustc 1.91 (alloy@2.0.5 etc.). The new MSRV job caught the mismatch on its first run. Pin ci.yml's MSRV job to 1.91 and update README. 3. fmt: `cargo +nightly fmt --all` over 13 files. Pre-existing drift (e.g. hiver-resilience had `std::fmt::Debug` that should be `fmt::Debug` under nightly rustfmt's unused-qualifications lint). Verified locally: cargo fmt --check passes, cargo check --workspace --lib passes, deny.toml/ci.yml/semver.yml all parse cleanly.
📦 Public API ChangesNo public API changes.
|
cargo-deny now also rejects Unicode-3.0 — the newer Unicode License v3 (OSI-approved, successor to Unicode-DFS-2016 which was already allowed). Used by idna, url, icu_*, alloy, mongodb, lettre. Second license that the new deny job surfaced after MPL-2.0/BSL-1.0/Zlib in the prior commit.
📦 Public API ChangesNo public API changes.
|
Iterating cargo-deny to green (licenses ok, bans ok, sources ok verified
locally; advisories needs CI's network for the DB fetch):
deny.toml [licenses] — add 4 more transitive licenses surfaced after the
prior batch:
- Unicode-3.0 (idna/url/icu_*/alloy/mongodb/lettre)
- 0BSD (quoted_printable via lettre)
- CC0-1.0 (transitive deps)
- CDLA-Permissive-2.0 (transitive deps)
Also remove Unicode-DFS-2016 (no longer matched — superseded by Unicode-3.0).
deny.toml [advisories] — sync the ignore list with ci.yml's `audit` job.
Both tools scan the same RustSec DB; previously deny had an empty ignore
list while audit ignored 13 RUSTSEC ids, so deny failed on advisories
that audit already accepted. Now both enforce the same project policy.
Unlicensed crates — hiver-benches and hiver-examples had no license
field (deny flags this as error[unlicensed]). Add
license = { workspace = true }
inheriting "MIT OR Apache-2.0" from the root, matching every other crate.
📦 Public API ChangesNo public API changes.
|
Adds `hiver_runtime::task::spawn_blocking`, the Hiver equivalent of `tokio::task::spawn_blocking`. It offloads sync/blocking closures (DB calls, file I/O, CPU-bound crypto) onto the `blocking` crate's dedicated thread pool (smol ecosystem, compatible with async-io), preventing them from stalling the reactor. Returns a `blocking::Task<R>` future that resolves to the closure's output. The `blocking = "1"` dependency was already declared but never used (dead weight); this wires it up. Exported at `hiver_runtime::spawn_blocking` and `hiver_runtime::task::spawn_blocking`. Verified: test_spawn_blocking_returns_result passes (closure runs off the async thread, result returned via Runtime::block_on).
The #[pre_authorize("hasRole('ADMIN')")] macro was a no-op passthrough — it
returned the item unchanged, generating zero enforcement. Now it wraps the
async fn body with a call to hiver_security::check_pre_authorize(&ctx, expr)
before the original body, panicking with "access denied" if the expression
evaluates to false.
The macro creates a SecurityContext::new() (unauthenticated by default) and
evaluates the SpEL-like expression against it. In a full framework integration
the context would be injected from the request (e.g. by JWT auth middleware);
for now the default context correctly denies role-based access for
unauthenticated callers.
Also: hiver-security now re-exports check_pre_authorize, PreAuthorizeOptions
at crate root (they were in the private pre_authorize module).
Verified: smoke test (examples/src/pre_auth_test.rs) confirms an
unauthenticated user is denied by #[pre_authorize("hasRole('ADMIN')")].
… variant Two channel fixes: 1. **bounded() now enforces capacity** — the old `bounded(cap)` silently discarded the capacity (passed it only to VecDeque::with_capacity as an allocation hint; send always succeeded synchronously). Now `bounded()` delegates to `async_channel::bounded(cap)`, which provides true backpressure: `send().await` parks the sender when the buffer is full. `try_send` returns `TrySendError::Full` when at capacity. Added `async-channel = "2"` as a direct dependency. 2. **try_recv now distinguishes Empty from Closed** — `RecvError` gains an `Empty` variant. Previously `try_recv` returned `RecvError::Closed` for both "channel empty with senders alive" and "channel closed", making it impossible for callers to distinguish the two states. Now it correctly returns `Empty` when the channel is temporarily empty. The hand-written `unbounded()` channel is unchanged (it works correctly). Verified: 50 runtime lib tests pass.
SimpleGateway::handle previously returned a fake "Routed to: {uri}" string
instead of forwarding the request. Now it builds a real reqwest request to
the upstream URL (route.uri + request path + query), forwards the method,
headers, and body, and returns the upstream's status/headers/body as the
GatewayResponse. Connection errors return 502 Bad Gateway.
reqwest was already a dependency of hiver-cloud. The predicate/filter/rate-
limit/circuit-breaker pipeline is preserved — it runs before the forward.
Verified: 146 cloud tests pass, zero regressions.
📦 Public API ChangesNo public API changes.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two intertwined efforts on this branch:
async-executor+async-io, removing ~6100 lines of dead code and eliminating the deadlock. Includesspawn_blocking, bounded-channel backpressure, and end-to-end HTTP server on hiver-runtime.cargo-deny+ MSRV jobs, migrate 18 crates to workspace deps, sync README with reality.Runtime changes (prior commits on branch)
f58a6aa3eliminate deadlock via async-executor/async-io backend8017ad07remove ~6100 lines of dead self-built scheduler/driver codef9d0bdb6HTTP server runs on hiver-runtime end-to-endaf079fa5add#[hiver::test]macro + migrate hiver-resilience44ca5f51reserve io-uring feature flag + Linux backend stubeabbe9b5spawn_blocking via the blocking crate77258a3fbounded channel with real backpressure + try_recv Empty variantc91199ed#[pre_authorize]now generates enforcement coded09e5b53cloud gateway actually proxies via reqwestCI overhaul (this session)
fc9b4ffadeny+msrvjobs, remove obsoletedebug-sigsegv.yml+auto-publish.yml, unify caching (Swatinem/rust-cache), codecov v4→v5ca1e1521examples/*.rssources (manifest declared them but untracked → broke CodeQL build)62597f1f[[:space:]]—\ssilently fails on BSD awk)6556f2b9cargo-public-apihas no--workspaceflag (loop per-crate)6a5223e1--fail-on-errorflag1e64603c82c48ea360a72a64CI status (verified green on this branch)
gh-pagesbaseline branch — out of scope)Notable findings the new checks exposed (now fixed)
Cargo.tomldeclaredrust-version = "1.87"but deps need 1.91 → bumpedexit 1, so 18 crates' direct version specs went unnoticed--fail-on-error || truealways passedlicensefieldTest plan
cargo check --workspace --libpasses (1m42s)cargo +nightly fmt --all -- --checkpassescargo deny check advisories licenses bans sourcespasses (licenses/bans/sources locally)62597f1fOut of scope (follow-ups)
gh-pagesbaseline branch setup