Skip to content

audit: sync code that could be async in fbuild-deploy (sub-issue of #813) #819

Description

@zackees

Intro

This is the fbuild-deploy slice of the whole-app async migration meta tracked
in #813 — the goal is to put every flash-tool subprocess, every USB
re-enumeration poll, and the post-deploy port-recovery loop on the daemon's
shared tokio runtime so tokio-console sees them and so a single shutdown signal
cancels them. This audit COMPLEMENTS #804 (the "blocking operations with no
timeout" audit for the same crate): every flash-tool spawn already has a
timeout via fbuild_core::subprocess::run_command, but the spawn itself is
synchronous and runs on a spawn_blocking worker thread today, not on the
tokio runtime. The two fixes compose; the ideal end state is "async with
deadline".

The end state for fbuild-deploy is: Deployer::deploy(...) returns an
async fn future, every flash tool is invoked through
tokio::process::Command::output().await, every poll loop is
tokio::time::sleep, and the daemon's
crates/fbuild-daemon/src/handlers/operations/deploy.rs drops the
tokio::task::spawn_blocking wrappers it currently uses to bridge to the
sync trait (deploy.rs:418 for deploy(...), deploy.rs:772 for
post_deploy_recovery). Those spawn_blocking calls are the smoking gun
that the entire crate is sync.

The Deployer trait question

crates/fbuild-deploy/src/lib.rs:95-136 defines a sync pub trait Deployer
with fn deploy(...) -> Result<DeploymentResult> plus
fn post_deploy_recovery(&self, port: &str) -> Result<()> whose default impl
is a 3-second serialport::new(...).open() poll with
std::thread::sleep(100ms) between probes.

Should this become async fn? Yes — that's the highest-leverage single
change in this crate.

Pros:

  • Runtime unity: today deploy.rs:418-748 wraps the entire deployer.deploy(...)
    call in tokio::task::spawn_blocking, which means the work is invisible to
    tokio-console (a single non-tokio worker, no per-region task spans, no
    hierarchical trace context). Making the trait async lets every flash region
    write be a tokio::task whose name shows up in the console.
  • The daemon already pays for one runtime; reusing it removes the
    spawn_blocking budget pressure (default tokio blocking pool is 512
    threads, but a stuck flash tool holds one for ~60 s, and 4-8 simultaneous
    flashes on a CI matrix would each block one).
  • Cancellation: today a daemon shutdown can't abort a stuck teensy_loader_cli
    retry loop (teensy/flash.rs:90-175) — spawn_blocking is uncancellable.
    Switching to tokio::process::Command::output().await makes the future
    cancellable, so the daemon's tokio::sync::watch shutdown signal collapses
    in-flight deploys cleanly.
  • post_deploy_recovery's 3s sync serialport::open retry +
    std::thread::sleep poll currently runs on a blocking thread; an async
    version with tokio::time::sleep would yield to other handlers between
    probes.
  • Per-region write_bin_to_flash calls inside the native espflash path
    (esp32_native/write.rs:108-148) could each be a spawn_blocking task with
    a parent span — observable in tokio-console as nested work units.

Cons:

  • Cascade: four sync deployer impls today
    (Esp32Deployer/esp32/deployer.rs:709, AvrDeployer/avr.rs:86,
    LpcDeployer/lpc.rs:165, TeensyDeployer/teensy/mod.rs:148), plus the
    ObservableDeployer and DefaultRecoveryDeployer test types in
    lib.rs:199-247, all need to become async fn. Total: 4 production impls
    • 2 test impls.
  • async_trait vs native: the crate already depends on async-trait via
    Cargo.toml:28. Either route is viable; native async fn in traits is
    stable as of 1.75 (MSRV 1.94.1 satisfies it). Object-safety: the daemon
    uses Box<dyn Deployer> (see deploy.rs:424, deploy.rs:651-652,
    lib.rs:253), which native async fn in traits does not currently make
    object-safe — so async-trait is the realistic choice here unless we
    switch the daemon to a non-dyn dispatch.
  • espflash crate is sync. espflash 4.x exposes
    Flasher::write_bin_to_flash(...) as a fully sync API atop
    serialport::open_native (see esp32_native/write.rs:88-148 and
    esp32_native/verify.rs:60-110). It cannot itself become async — the
    calls inside an async fn deploy(...) must remain wrapped in
    tokio::task::spawn_blocking. That's fine: this audit only asks "could
    this be async on the runtime?" — for espflash the honest answer is "wrap
    the sync call in spawn_blocking from inside the async deploy method,"
    which still gives tokio-console a parent task to attach the work to.
  • fbuild_core::subprocess::run_command is the choke point. It's used by
    every non-espflash deployer (avrdude, esptool, lpc21isp, teensy_loader_cli)
    and is sync (running_process::NativeProcess::wait(),
    fbuild-core/src/subprocess.rs:182). Either give it an
    async fn run_command_async(...) sibling routed through
    tokio::process::Command, or wrap each call in spawn_blocking at the
    deploy callsite. The former is the right end-state and is the same fix
    recommended by the fbuild-core audit slice of audit: go fully async — whole-app tokio runtime sharing for tokio-console (meta) #813.

Recommended sequencing:

  1. Land fbuild_core::subprocess::run_command_async (async sibling using
    tokio::process::Command, same ToolOutput return shape, same timeout
    semantics). This is a no-op for sync callers but unblocks every deployer.
  2. Switch the Deployer trait to #[async_trait] pub trait Deployer with
    async fn deploy(...) and async fn post_deploy_recovery(...).
  3. Migrate each deployer impl to .await the new run_command_async. The
    teensy retry loop (teensy/flash.rs:90-175) becomes
    tokio::time::sleep(backoff).await and .await-driven.
  4. Drop the two spawn_blocking wrappers in
    fbuild-daemon/src/handlers/operations/deploy.rs:418 and :772.
  5. For the native espflash path, keep the inner Flasher::* calls inside a
    per-region spawn_blocking so they're cancellable at region boundaries
    and visible to tokio-console as nested blocking tasks.

Findings

Severity File:line Current pattern Proposed async replacement Tool
CRITICAL avr.rs:123 run_command(&args_ref, None, None, Some(timeout))? — sync subprocess spawn for the full avrdude flash (up to timeout_secs, board JSON default 60s) run_command_async(...).await after Deployer::deploy becomes async fn avrdude
CRITICAL lpc.rs:217 run_command(&args_ref, ..., Some(Duration::from_secs(self.timeout_secs)))? — sync lpc21isp flash (default 60s) run_command_async(...).await lpc21isp
CRITICAL esp32/deployer.rs:330 run_command(&args_ref, ..., Some(Duration::from_secs(30)))? — sync esptool verify-flash (~6-10s) run_command_async(...).await esptool
CRITICAL esp32/deployer.rs:671 run_command(&args_ref, ..., Some(Duration::from_secs(120)))? — sync esptool write-flash for SelectiveFlash (~5-30s) run_command_async(...).await esptool
CRITICAL esp32/deployer.rs:746 run_command(&args_ref, ..., Some(Duration::from_secs(120)))? — sync esptool write-flash for FullFlash (~5-60s) run_command_async(...).await esptool
CRITICAL teensy/flash.rs:122 run_command(&args_ref, None, None, Some(attempt_timeout))? inside a retry loop of up to retries + 1 attempts — blocks one worker for (retries+1) * flash_timeout_secs worst case (default 6 × 30s = 3 min) run_command_async(...).await; surrounding loop becomes async-aware teensy_loader_cli
HIGH teensy/flash.rs:167 std::thread::sleep(Duration::from_millis(backoff_ms)) between retry attempts (default 1500 ms) tokio::time::sleep(...).await (loop control)
HIGH lib.rs:122-132 Deployer::post_deploy_recovery default impl: 3-second sync poll loop using serialport::new(...).open() + std::thread::sleep(Duration::from_millis(100)) Async port-availability probe + tokio::time::sleep(...).await (USB re-enum)
HIGH teensy/port_discovery.rs:97-124 wait_for_new_cdc_port polls serialport::available_ports() with std::thread::sleep(100ms) until timeout — called post-flash to discover the re-enumerated CDC ACM port (budget post_flash_port_discovery_secs, default 5s) tokio::task::spawn_blocking per snapshot + tokio::time::sleep(...).await between polls (USB re-enum)
HIGH teensy/halfkay_probe.rs:42-55 wait_for_disappearance polls available_ports() with std::thread::sleep(75ms) — 3s blocking after baud-134 trigger to confirm device left CDC class Same: blocking enumerate + tokio::time::sleep(...).await between polls (HID enum)
HIGH esp32_native/write.rs:88-148 espflash::flasher::Flasher::connect(...) + per-region flasher.write_bin_to_flash(...) are fully sync; entire native write path runs on a single spawn_blocking worker today Keep sync (espflash 4.x has no async API) but invoke each region inside its own tokio::task::spawn_blocking, so tokio-console sees per-region nested tasks under the parent deploy span espflash (native)
HIGH esp32_native/verify.rs:60-110 Same as write: sync Flasher::connect + sync verify, all on one blocking worker Wrap in spawn_blocking per call site after trait goes async espflash (native)
MEDIUM lib.rs:95-103 Deployer::deploy trait method is sync fn #[async_trait] async fn deploy(...) -> Result<DeploymentResult> — cascades to 4 production impls + 2 test impls + crates/fbuild-daemon/src/handlers/operations/deploy.rs:418 (drop spawn_blocking) (trait)
MEDIUM lib.rs:121-135 Deployer::post_deploy_recovery trait method is sync fn; daemon also wraps it in spawn_blocking (fbuild-daemon/.../deploy.rs:772) async fn post_deploy_recovery(...) -> Result<()> with body using tokio::time::sleep and an async port-availability probe; drop the daemon spawn_blocking (trait)
MEDIUM reset.rs:72,96,101,127,129 std::thread::sleep(Duration::from_millis(50-100)) in DTR/RTS toggle sequences for ESP32/AVR/Teensy/generic reset paths tokio::time::sleep(...).await after wrapping the serialport write calls in spawn_blocking (serialport has no async write_data_terminal_ready) (DTR/RTS)
MEDIUM reset.rs:64-69,84-108,117-129 serialport::new(port, ...).open() + sync write_data_terminal_ready / write_request_to_send calls for device reset Wrap port open + control-line writes in spawn_blocking (no async alternative in serialport), interleave with tokio::time::sleep for the sleep gaps (DTR/RTS)
MEDIUM teensy/first_byte_probe.rs:44-77 serialport::open + sync Read loop with 100ms internal serial.read() timeout for up to first_byte_timeout_secs (default 10s); runs post-flash Use tokio_serial::SerialStream for the open + tokio::io::AsyncReadExt::read with tokio::time::timeout; or keep spawn_blocking wrap with explicit tokio::time::sleep between polls (CDC byte probe)
MEDIUM teensy/soft_reboot.rs:40-69 serialport::new(port, 134).open() + std::thread::sleep(Duration::from_millis(100)) for baud-134 trigger tokio_serial::SerialStream::open or wrap open in spawn_blocking + tokio::time::sleep(...).await for the 100 ms hold (baud-134)
MEDIUM esp32_native/write.rs:69-76, esp32_native/verify.rs:60-65 serialport::new(port, 115_200).flow_control(...).timeout(...).open_native() — sync open_native() call to obtain the handle espflash will consume Cannot become async (espflash needs the sync TTYPort handle); leave but make the surrounding call inside a spawn_blocking from within an async fn so the rest of the deploy can proceed espflash (native)
MEDIUM esp32_native/write.rs:109, esp32_native/verify.rs:102 std::fs::read(&r.path)? for firmware/bootloader/partitions region bytes — can be 1-3 MB per region tokio::fs::read(&r.path).await (issued before entering the spawn_blocking that owns the espflash Flasher) (file I/O)
MEDIUM esp32/image.rs:36,56,260 std::fs::read(firmware_path)?, std::fs::read(elf_path)?, std::fs::File::open(input_path)? — used in QEMU-specific image patching + flash-image assembly tokio::fs::* once the calling deploy method is async; patch loop itself stays sync (CPU-bound) (QEMU image prep)
MEDIUM esp32/image.rs:46-50,243-260 std::fs::OpenOptions::new().write(true).open(...) + seek + write_all for the assembled flash image tokio::fs::OpenOptions + tokio::io::AsyncWriteExt (small overhead but consistent with the runtime) (QEMU image prep)
LOW size_check.rs:65 std::fs::metadata(artifact) (one stat call before deploy) tokio::fs::metadata(artifact).await for consistency once the entry point is async; impact is negligible (pre-deploy size guard)
LOW esp32/qemu.rs:63,66 std::fs::create_dir_all(parent)? + std::fs::File::create(output_path)? for QEMU image output tokio::fs::* for consistency (QEMU image output)

Not flagged (deliberate sync, out-of-scope):

  • teensy/usb_type.rs:91,99fs::read_to_string on the sibling usb_type.txt (~tens of bytes, one-shot read at deploy entry).
  • lib.rs:201,252std::sync::Mutex<Option<String>> used only inside #[cfg(test)] deployer fixtures.
  • teensy/soft_reboot.rs:87static TEST_ENV_LOCK: std::sync::Mutex<()> test-only env serialization mutex.
  • All tests.rs std::fs::read/std::fs::write calls (test fixtures).
  • containment::* interactions in fbuild-core (covered by the fbuild-core audit slice of audit: go fully async — whole-app tokio runtime sharing for tokio-console (meta) #813).

What was searched

Patterns run via Grep against crates/fbuild-deploy/src/:

  • Command::new|std::process|tokio::process — no direct hits in the crate; subprocess work is centralized through fbuild_core::subprocess::run_command (sync) — that's the choke point. No tokio::process use at all.
  • std::thread|tokio::spawn|JoinHandle|block_on|spawn_blocking — 10 hits, all std::thread::sleep (in reset.rs, teensy/*, lib.rs); zero block_on; zero spawn_blocking (the daemon does it externally).
  • serialport::|tokio_serial:: — 11 hits, all serialport::* (sync); zero tokio_serial:: use.
  • espflash::|use espflash — confirmed espflash 4.x sync API is used directly (Flasher::connect, flasher.write_bin_to_flash, no block_on shim, no async path available upstream).
  • std::fs::|tokio::fs:: — 14 production-code hits in esp32/image.rs, esp32_native/write.rs, esp32_native/verify.rs, size_check.rs, esp32/qemu.rs, teensy/usb_type.rs; zero tokio::fs:: use.
  • Mutex|RwLock|mpsc — only test fixtures + a test-env lock; nothing held across .await.
  • reqwest:: — no hits (OTA endpoints don't exist in this crate today).
  • post_deploy_recovery — one trait default in lib.rs:121-135 (sync poll loop); two daemon call sites both wrapped in spawn_blocking.

Per-deployer modules audited:

  • avr.rs — single sync run_command call site (CRITICAL).
  • lpc.rs — single sync run_command call site (CRITICAL).
  • esp32/deployer.rs — three sync run_command call sites: verify, write, selective write (3 × CRITICAL).
  • esp32/image.rs, esp32/qemu.rs — sync std::fs::* for image prep (MEDIUM/LOW).
  • esp32_native/{write,verify}.rs — sync espflash 4.x API (HIGH but bounded by upstream); sync serialport::open_native; sync std::fs::read of region binaries.
  • teensy/mod.rs + sub-modules — sync run_command retry loop (CRITICAL), sync USB re-enumeration polls (2 × HIGH), sync baud-134 + first-byte probe (2 × MEDIUM).
  • reset.rs — sync serialport::open + std::thread::sleep DTR/RTS sequences (MEDIUM).
  • lib.rs — sync Deployer trait + sync default post_deploy_recovery poll (2 × MEDIUM, the cascading items).

Daemon dispatch boundary corroborated:

  • crates/fbuild-daemon/src/handlers/operations/deploy.rs:418 — entire deploy wrapped in tokio::task::spawn_blocking.
  • crates/fbuild-daemon/src/handlers/operations/deploy.rs:772post_deploy_recovery wrapped in a second tokio::task::spawn_blocking.

Both wrappers vanish once the trait goes async.

Platform coverage note

The audit instructions list 10 candidate platforms; fbuild-deploy currently
implements four (AVR, ESP32, LPC, Teensy). RP2040 / picotool, STM32 / dfu-util,
nrf52 / pyocd, sam / bossac, silabs / Simplicity Commander, ch32v / wch-isp
have no Rust deployer in this crate today — the only RP2040 / pyocd / dfu-util
references in the workspace are the pyocd / dfu-util / picotool hook ban
named in CLAUDE.md (#694's scope) and the LPC composite-USB
recovery note in lpc.rs:1-30. New deployer impls under #694 should be filed
as async from day one rather than retrofitted.

Out-of-scope notes

  • The espflash crate is sync upstream (4.x). The honest async story for
    ESP32 native is "an async fn deploy(...) whose body uses spawn_blocking
    per region for the espflash Flasher::write_bin_to_flash call." That's
    still a strict improvement over today (one spawn_blocking for the entire
    deploy in the daemon) because each region becomes a separately cancellable,
    separately spannable tokio task.
  • running_process::NativeProcess::wait() (the underlying engine in
    fbuild_core::subprocess::run_command) is sync. The right fix is to add a
    sibling async runner using tokio::process::Command::output().await and
    port the deploy callers to it — this is also the recommendation for the
    fbuild-core audit slice.
  • Pre-deploy size check (size_check.rs) is a single metadata call —
    converting it is fine but the perf delta is below noise. Flagged LOW only.
  • This audit does not touch crates/fbuild-deploy/src/method_validation.rs
    (pure CPU, no I/O) or crates/fbuild-deploy/src/esp32/parse.rs (pure
    byte parsing).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions