From de2e1d9a97f1914db303f5f6026790a8a785dffb Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 29 May 2026 14:46:45 +0800 Subject: [PATCH] docs(sdk): codify FFI panic-safety contract at the napi/PyO3 boundary (#32) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Audit outcome for #32 (FFI panic-unwind safety, napi 2.16 / pyo3 0.23): the boundary is panic-safe today, but the invariant holds by *convention*, not structure, so it can silently regress. Documents the contract in each SDK's module doc where future FFI edits will see it. Key finding (verified against the cargo cache, not memory): napi 2.x does NOT wrap sync #[napi] bodies in catch_unwind by default — a panic there aborts the Node process on Rust >= 1.81, it is not a catchable JS error. Only #[napi] async fns (-> rejected Promise) and #[napi(catch_unwind)] are safe; ThreadsafeFunction callbacks, spawned tasks, Drop, and module init are not. PyO3 0.23 does catch #[pyfunction]/#[pymethods]/module-init bodies (-> PanicException) but not worker-thread with_gil bridges or spawned tasks. No code change: independent verification confirmed the only production panic site in each SDK is the lazy Tokio-runtime .expect(), reached from caught contexts; the genuinely-uncaught paths (spawned tasks, the Python with_gil callback bridges, no Drop impls exist) are panic-free by construction via ? / unwrap_or_else / fail-closed defaults. Converting get_runtime() to a Result would thread through ~75 call sites to defend an OS-exhaustion panic that is already caught — not worth it. --- sdk/node/src/lib.rs | 21 +++++++++++++++++++++ sdk/python/src/lib.rs | 17 +++++++++++++++++ 2 files changed, 38 insertions(+) diff --git a/sdk/node/src/lib.rs b/sdk/node/src/lib.rs index 77afb8c..e6e70fa 100644 --- a/sdk/node/src/lib.rs +++ b/sdk/node/src/lib.rs @@ -13,6 +13,27 @@ //! const result = await session.send('What files handle auth?'); //! console.log(result.text); //! ``` +//! +//! ## Panic safety at the FFI boundary +//! +//! napi 2.x does **not** wrap exported bodies in `catch_unwind` by default. A +//! Rust panic that reaches the `extern "C"` boundary aborts the whole Node +//! process (Rust ≥ 1.81) — it does *not* become a catchable JS error. Only two +//! contexts are panic-safe: a `#[napi]` **async** fn / `impl Future` (panic → +//! rejected Promise) and a sync fn explicitly tagged `#[napi(catch_unwind)]`. +//! Everything else aborts (or silently loses the panic): default **sync** +//! `#[napi]` fns, `ThreadsafeFunction` callbacks (a panic there — or a +//! return-value conversion `Err` — aborts via `napi_fatal_error` under *both* +//! `ErrorStrategy` variants), `tokio::spawn`'d task bodies (panic swallowed, +//! never surfaced), `Drop`/finalizers, and module init. +//! +//! Convention this crate follows so the boundary stays safe: never +//! `.unwrap()` / `.expect()` / `panic!` in those contexts. Propagate with `?` +//! into a `napi::Error`, or fail closed with `unwrap_or_else` inside +//! threadsafe callbacks. (Audited 2026-05: the only production panic site is +//! the lazy Tokio-runtime build in `fallback_runtime()`, reached from within +//! `#[napi]` bodies; the spawned-task and threadsafe-callback paths are +//! panic-free by construction.) #[macro_use] extern crate napi_derive; diff --git a/sdk/python/src/lib.rs b/sdk/python/src/lib.rs index 9a3c876..b76522b 100644 --- a/sdk/python/src/lib.rs +++ b/sdk/python/src/lib.rs @@ -13,6 +13,23 @@ //! result = session.send("What files handle auth?") //! print(result.text) //! ``` +//! +//! ## Panic safety at the FFI boundary +//! +//! PyO3 0.23 wraps `#[pyfunction]` / `#[pymethods]` / `#[pymodule]`-init bodies +//! in `catch_unwind`, so a panic there surfaces as a Python `PanicException` +//! (a `BaseException` subclass) rather than UB. It does **not** cover panics +//! inside `std::thread` / `tokio::spawn` task bodies, or `Python::with_gil` +//! closures invoked from a worker thread *outside* a pyfunction frame — those +//! are silently lost, and a panicking `Drop` during an unwind aborts the +//! process. +//! +//! Convention this crate follows so the boundary stays safe: the Rust→Python +//! bridges that run on tokio worker threads (`PythonCallbackHandler`, +//! `PyBudgetGuard`, `PySlashCommand`) never `.unwrap()` / `panic!`; they use +//! `.ok()` / `unwrap_or_else` and fail closed. (Audited 2026-05: the only +//! production panic site is the lazy Tokio-runtime build in `get_runtime()`, +//! reached only from caught pyfunction frames.) use a3s_code_core::commands::{ CommandContext as RustCommandContext, CommandOutput as RustCommandOutput,