Skip to content

audit: sync code that could be async in fbuild-core + foundational crates (sub-issue of #813) #816

Description

@zackees

Sub-issue of #813. Audit area: the foundational crates that every higher-level crate depends on — fbuild-core, fbuild-paths, fbuild-config, fbuild-header-scan, fbuild-library-select.

Intro

The single highest-leverage change in this audit is in crates/fbuild-core/src/subprocess.rs. Every subprocess fbuild spawns — every compiler invocation, every linker call, every esptool / avrdude / addr2line / picotool / c++filt / clang-format / dumpversion probe — routes through run_command / run_command_with_stdin / run_command_passthrough. Today all three are synchronous wrappers around running_process::NativeProcess, whose start() + wait() calls block the calling thread. From inside the daemon (a tokio runtime) every one of these calls must go through spawn_blocking or risk starving the runtime — and most call sites today do not. Converting run_command to async fn and routing it through tokio::process::Command::output().await is the cascade event that unblocks tokio-console visibility (the #813 goal) across the entire compile pipeline: the daemon's HTTP handlers stop being interleaved with multi-second blocked threads, and every downstream callable becomes async fn by force.

There are 63 call sites of these three functions across 34 files (counted via grep -rE 'run_command(_with_stdin|_passthrough)?\(' crates/). They are concentrated in fbuild-build (compilers, linkers, per-platform orchestrators), fbuild-deploy (flashers), fbuild-daemon (emulator runners), and fbuild-packages (toolchain probes). All of them become async at the leaves of the call tree; the migration is mechanical but wide.

The run_command migration proposal

Today (sync, blocking, owns no runtime)

// crates/fbuild-core/src/subprocess.rs
pub fn run_command(
    args: &[&str],
    cwd: Option<&Path>,
    env: Option<&[(&str, &str)]>,
    timeout: Option<Duration>,
) -> Result<ToolOutput> {
    let config = build_config(args, cwd, env, true, StdinMode::Null)?;
    run_captured(config, args, timeout)  // NativeProcess::start() + wait()
}

NativeProcess::start() spawns reader threads (a process-internal mini-runtime); wait(timeout) blocks until exit. From an async caller, the whole thing pegs a worker thread for the entire compile.

Proposed (async, uses caller's tokio runtime)

// crates/fbuild-core/src/subprocess.rs
pub async fn run_command(
    args: &[&str],
    cwd: Option<&Path>,
    env: Option<&[(&str, &str)]>,
    timeout: Option<Duration>,
) -> Result<ToolOutput> {
    let mut cmd = tokio::process::Command::new(args[0]);
    cmd.args(&args[1..]);
    if let Some(d) = cwd { cmd.current_dir(d); }
    apply_env(&mut cmd, args[0], env);          // existing Windows PATH-prepend + MSYS-strip logic, lifted out of build_config
    cmd.stdin(Stdio::null())
       .stdout(Stdio::piped())
       .stderr(Stdio::piped());

    // Containment: re-uses the existing tokio path
    crate::containment::tokio_spawn::configure(&mut cmd);
    let mut child = cmd.spawn().map_err(|e| spawn_err(args, e.into()))?;
    crate::containment::tokio_spawn::post_spawn(&child)?;

    let fut = child.wait_with_output();
    let output = match timeout {
        Some(d) => match tokio::time::timeout(d, fut).await {
            Ok(r) => r.map_err(|e| FbuildError::Other(format!("command {:?} failed: {}", args, e)))?,
            Err(_) => {
                // child.kill() needs &mut, but wait_with_output consumed it —
                // use the alternate shape: spawn, then drive stdout/stderr/wait in select! with timeout
                return Err(FbuildError::Timeout(format!("command timed out after {}s", d.as_secs())));
            }
        },
        None => fut.await.map_err(|e| FbuildError::Other(format!("command {:?} failed: {}", args, e)))?,
    };
    Ok(ToolOutput {
        stdout: String::from_utf8_lossy(&output.stdout).into_owned(),
        stderr: String::from_utf8_lossy(&output.stderr).into_owned(),
        exit_code: output.status.code().unwrap_or(-1),
    })
}

Containment survives unchanged

crates/fbuild-core/src/containment.rs already has a tokio_spawn submodule with configure() + post_spawn() that does exactly what spawn_contained does for std::process::Command:

  • Unix — installs pre_exec(setpgid(0, 0) + PR_SET_PDEATHSIG(SIGKILL)) via command.pre_exec(...). tokio::process::Command re-exposes CommandExt::pre_exec, so the closure is identical to the std-path one.
  • Windowspost_spawn(&tokio::process::Child) reads child.raw_handle() and calls the same windows_job::assign(...) helper that spawn_contained uses. Same Job Object, same JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE flag, same kill-on-daemon-exit semantics.

In short, the Job Object / pgroup machinery already supports tokio. The migration is wire-only.

Cascade — caller-site count

grep -rE 'run_command(_with_stdin|_passthrough)?\(' crates/ | wc -l
# 63 occurrences across 34 files

Distribution (file → count):

  • fbuild-build/: ~22 sites (compiler, linker, per-platform orchestrators, symbol_analyzer, shrink probe, script_runtime, zccache embedded)
  • fbuild-deploy/: 6 sites (avr, lpc, esp32, teensy)
  • fbuild-daemon/: 3 sites (emulator runners, avr8js npm bootstrap)
  • fbuild-packages/: 4 sites (library compiler, disk cache budget)
  • fbuild-cli/: 6 sites (pio passthrough, daemon command, build)
  • fbuild-serial/: 1 site (crash decoder addr2line)

Every one of these becomes async fn once run_command is async fn. That's the goal of #813 — once propagation reaches the daemon's HTTP/WS handlers, they no longer hide blocked work behind spawn_blocking.

Suggested migration shape: land run_command_async as a parallel function first, migrate fbuild-build/src/compiler.rs + fbuild-build/src/linker.rs (the hot path), validate tokio-console no longer shows blocked workers, then sweep the rest and delete the sync versions. The _passthrough variant only has 6 sites and can be migrated last.

Findings table

Severity Crate File:line Current pattern Proposed async replacement
CRITICAL fbuild-core subprocess.rs:81-89 run_command NativeProcess::new(...).start() + .wait(timeout) — blocking async fn run_command using tokio::process::Command::spawn + wait_with_output().await, containment via existing containment::tokio_spawn::{configure, post_spawn}
CRITICAL fbuild-core subprocess.rs:100-144 run_command_with_stdin Sync process.write_stdin(...) + .wait(...) async fn — use child.stdin.take().unwrap().write_all().await then child.wait_with_output().await. Tokio already drains stdout/stderr concurrently with the stdin write
CRITICAL fbuild-core subprocess.rs:151-174 run_command_passthrough Sync NativeProcess::start + wait async fn using inherited stdio + child.wait().await
HIGH fbuild-header-scan walker.rs:139 std::fs::read_to_string(p) inside par_iter() rayon parallel sync reads — one blocking read per source file, per LDF wave This is the per-compile hot path. Choices: (a) tokio::fs::read_to_string driven by futures::stream::iter(...).buffer_unordered(N), OR (b) keep rayon::par_iter (it's CPU+IO mixed — header tokenization is significant CPU work) and wrap the whole walk_with_state call in tokio::task::spawn_blocking at the async boundary. Recommend (b) — preserves rayon parallelism, just makes the boundary explicit
HIGH fbuild-library-select cache.rs:89,128-131,234-235,289 std::fs::read / std::fs::write / std::fs::rename for K/V cache tokio::fs::read / tokio::fs::write / tokio::fs::rename — called per-resolution and per-seed in cache_key construction. The seed-hash loop at cache.rs:231-241 reads every seed file synchronously to hash it; under tokio this should be futures::future::try_join_all over tokio::fs::read
MEDIUM fbuild-core response_file.rs:128,135,181 write_response_file std::fs::create_dir_all, std::fs::write, plus cleanup_stale_response_files walks read_dir synchronously tokio::fs::create_dir_all + tokio::fs::write. Called once per compile invocation (per-source-file) — same hot path as run_command, so worth migrating together. Cleanup loop is fine to keep sync inside spawn_blocking (rare path)
MEDIUM fbuild-config ini_parser/mod.rs:60 PlatformIOConfig::from_path std::fs::read_to_string async fn from_pathtokio::fs::read_to_string. Called once per build at boundary; lower priority but it's pure I/O and cheap to convert
MEDIUM fbuild-config board/loaders.rs:122 BoardConfig::from_boards_txt std::fs::read_to_string async fn + tokio::fs::read_to_string
MEDIUM fbuild-config board/db.rs:139 project-local board JSON read std::fs::read_to_string for <project>/boards/<id>.json tokio::fs::read_to_string. Called from get_board_defaults_with_project_dir per build
MEDIUM fbuild-config sdkconfig.rs:103,109 from_project_dir std::fs::read_to_string for sdkconfig / sdkconfig.defaults tokio::fs::read_to_string — called per ESP32 build
MEDIUM fbuild-paths lib.rs:184 read_port_from_file std::fs::read_to_string for daemon port discovery tokio::fs::read_to_string. Daemon-port discovery happens at every CLI invocation start — low-frequency but in the boot-time critical path
MEDIUM fbuild-core install_status.rs callback chain Sync Fn(InstallStatus) subscriber called from publisher thread; if subscriber blocks (e.g. acquiring a write lock or doing I/O), publisher blocks See dedicated section below
LOW fbuild-core subprocess.rs:53 link_env_for_build std::fs::create_dir_all for .lto-tmp tokio::fs::create_dir_all once caller is async. One-time-per-link, low cost — only worth converting alongside the larger sweep
LOW fbuild-core usb/data.rs:117,161 install_online_cache* std::fs::read_to_string / std::fs::read of USB vendor overlay JSON / proto.zstd One-shot startup config read for USB device naming. Out-of-scope per audit guidance, but listed for completeness — would become async if loaded from a network source in future
LOW fbuild-core emulator.rs:52 EmulatorArtifactBundle::from_build_dir std::fs::read_dir to discover firmware artifacts One-shot per emulator launch. tokio::fs::read_dir is straightforward but the call is rare and per-launch

What is NOT a finding (already correct or out of scope)

  • containment.rs::spawn_contained — explicitly the sync std-path entry point for non-tokio callers (CLI, tests). Its tokio sibling tokio_spawn::spawn_contained already exists and is what the proposed run_command_async will use.
  • containment.rs Windows Job Object globals (JOB: OnceLock<JobHandle>) — process-lifetime singletons, not a sync I/O hazard.
  • board/db.rs::BOARD_DB: OnceLock<HashMap<...>> — populated from include_dir! (compile-time-embedded data, not runtime I/O). CPU-bound JSON parse on first access; correctly sync.
  • usb/embedded.rs::VENDOR_MAP: OnceLock<HashMap<u16, String>> — same: compile-time-embedded archive, sync extraction is correct.
  • symbol_analysis/ — pure in-memory graph analysis, no I/O.
  • All tests/ and #[cfg(test)] std::fs::* calls in fbuild-paths/lib.rs, fbuild-library-select, fbuild-header-scan/walker.rs, fbuild-core/{emulator,response_file,usb/data} — test fixtures, out of scope per audit guidance.

install_status callback chain (dedicated section)

crates/fbuild-core/src/install_status.rs is the publish/subscribe channel that drives the daemon's WebSocket install-progress stream and the CLI's TTY progress bar. The current shape:

type Subscriber = Arc<dyn Fn(InstallStatus) + Send + Sync + 'static>;
static SUBSCRIBER: OnceLock<RwLock<Option<Subscriber>>> = OnceLock::new();

pub fn set_install_status_subscriber<F>(subscriber: F)
where F: Fn(InstallStatus) + Send + Sync + 'static { ... }

pub fn publish_install_status(status: InstallStatus) {
    // ... synchronously invokes the subscriber on the publisher's thread
    subscriber(status);
}

Problem

The subscriber runs on whatever thread calls publish_install_status. Today the daemon's subscriber typically forwards events into a tokio::sync::broadcast channel via a try_send (non-blocking), which is fine — but the contract doesn't require the callback to be non-blocking, and historically subscribers have done synchronous lock acquisition. As we move toward async-everywhere this becomes a footgun: a future subscriber that does block_on(some_send) from inside publish_install_status would deadlock the same tokio worker that called the publisher. The single-slot RwLock<Option<Subscriber>> is also a fanout limitation — only one consumer can listen at a time.

Recommendation

Replace the subscriber callback with a tokio::sync::broadcast::Sender<InstallStatus> directly. Rationale:

  1. Subscribers are already (in practice) shaped like "forward into a channel" — making the channel the API removes the trampoline.
  2. broadcast::Sender::send is non-blocking by design and returns the lagged-receiver count instead of blocking the publisher.
  3. Multiple async receivers (daemon WS handler, CLI progress bar, test observer) can subscribe independently — today only one subscriber slot exists, gated by RwLock<Option<...>>.
  4. No async fn callback signature is needed — the broadcast channel sidesteps the "async callback in a trait object" Rust pain entirely.

Proposed shape:

static INSTALL_STATUS_TX: OnceLock<tokio::sync::broadcast::Sender<InstallStatus>> = OnceLock::new();

pub fn subscribe_install_status() -> tokio::sync::broadcast::Receiver<InstallStatus> {
    INSTALL_STATUS_TX
        .get_or_init(|| tokio::sync::broadcast::channel(256).0)
        .subscribe()
}

pub fn publish_install_status(status: InstallStatus) {
    let tx = INSTALL_STATUS_TX.get_or_init(|| tokio::sync::broadcast::channel(256).0);
    let _ = tx.send(status);  // ignore RecvError: lagged subscribers are tolerated
}

Migration cost: the existing set_install_status_subscriber callers in the daemon and CLI become tokio::spawn loops over recv().await. Trivial migration, much cleaner contract.

Alternative (lower-cost, lower-payoff): keep Fn(InstallStatus) but stamp the documented contract as "must be non-blocking; send into a channel and return immediately." Subscriber discipline by docs only — strictly worse than the channel approach.

What was searched

  • Crates audited (crates/{fbuild-core,fbuild-paths,fbuild-config,fbuild-header-scan,fbuild-library-select}/src/**/*.rs)
  • Patterns:
    • Command::new / .output() / .wait() — subprocess concerns (only matched inside subprocess.rs and containment.rs themselves; no rogue direct spawns in the foundational crates).
    • std::fs::(read|read_to_string|write|create_dir_all|read_dir|remove_file|rename|canonicalize|metadata) — sync file I/O.
    • OnceLock / OnceCell / Mutex::new / RwLock::new / mpsc:: — sync primitives where async equivalents exist (tokio::sync::*, OnceCell async-init).
    • std::thread::spawn / JoinHandle — manual thread spawning (none found in these crates; rayon is used by fbuild-header-scan instead).
    • block_on — async-from-sync (zero hits in these 5 crates; concentrated in fbuild-python, fbuild-packages/toolchain/*, which are out of scope for this audit area).
    • set_install_status_subscriber — the callback bridge (covered in dedicated section above).
  • Caller-count grep for run_command(_with_stdin|_passthrough)?\( across the whole workspace to size the cascade (63 sites / 34 files).

Out-of-scope notes — CPU-bound work to keep sync

  • crates/fbuild-header-scan/src/scanner.rs::scan() — pure tokenization of source text. CPU-bound. Stays sync; the caller (walker.rs::walk_with_state) can stay rayon-parallel and the whole walk gets wrapped in tokio::task::spawn_blocking at the async boundary.
  • crates/fbuild-config/src/board/db.rs::get_board_db() — JSON parse of compile-time-embedded board entries. CPU-bound. Already correctly OnceLock + sync.
  • crates/fbuild-core/src/usb/embedded.rs — zstd decompress + tar extract of embedded USB vendor archive. CPU-bound. Correctly sync.
  • crates/fbuild-core/src/symbol_analysis/ — in-memory ELF / call-graph analysis. CPU-bound. Correctly sync.
  • crates/fbuild-config/src/ini_parser/parser.rs — INI tokenizer + extends resolution. CPU-bound. The file read should be async (covered in findings table), but the parse itself stays sync.
  • Drop impls in running-process and containment.rs — must stay sync (Rust requires sync Drop).
  • One-shot startup config reads in fbuild-paths::read_port_from_file are arguably also "keep sync, just call from spawn_blocking at the boundary" — flagged MEDIUM rather than CRITICAL because they're not in the per-compile hot path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions