Skip to content

audit: sync code that could be async in fbuild-packages (sub-issue of #813) #821

Description

@zackees

Sub-issue of #813fbuild-packages audit lens: convert sync code to async on the daemon's tokio runtime. Partner audit is #805 ("blocking HTTP without timeout"); this issue covers the same call sites under a different lens (async-ification). Per #813's intro: the fixes compose — the end-state is a single shared async reqwest::Client configured with .timeout(...) and .connect_timeout(...), so #805's CRITICAL findings collapse to "set a timeout on the shared client" once this issue's plumbing lands.

The crate is in surprisingly good shape on the network side — almost every HTTP call already uses async reqwest::get(...).await. The dominant async anti-pattern is not reqwest::blocking, it's the block_on(...) sandwich: every Package::ensure_installed is a sync trait method that owns an async staged_install, so each implementation does tokio::runtime::Handle::block_on(...). That's ~30 call sites in this crate alone, hides every install from tokio-console, and prevents axum from doing anything else on the worker during a 30-second toolchain download.

The shared async Client proposal

Today every HTTP call in the crate uses the convenience reqwest::get(url), which constructs a brand-new reqwest::Client per request — no connection pooling, no shared timeout config, no shared connector. The seven call sites that need this are:

  • downloader.rs:113 (get_with_retry)
  • downloader.rs:178 (open_with_retry for streaming download)
  • toolchain/clang.rs:147 (clang manifest fetch)
  • library/registry.rs:24 / :77 / :189 (PlatformIO registry: 3 calls)
  • library/arduino_api.rs:51 (the only reqwest::blocking::* in the crate)

Proposed: add crates/fbuild-packages/src/http.rs:

use std::time::Duration;
use tokio::sync::OnceCell;
use reqwest::Client;

static HTTP_CLIENT: OnceCell<Client> = OnceCell::const_new();

pub async fn client() -> &'static Client {
    HTTP_CLIENT
        .get_or_init(|| async {
            Client::builder()
                .connect_timeout(Duration::from_secs(10))
                .timeout(Duration::from_secs(60 * 10)) // big toolchains
                .pool_idle_timeout(Duration::from_secs(90))
                .user_agent(concat!("fbuild/", env!("CARGO_PKG_VERSION")))
                .build()
                .expect("reqwest client builder is infallible with our config")
        })
        .await
}

Then every call site becomes http::client().await.get(url).send().await — picks up timeouts uniformly (closes #805) and reuses TCP/TLS connections across the daemon's lifetime. The blocking feature on the workspace reqwest = { features = [..., "blocking"] } can be dropped once arduino_api.rs migrates.

OnceCell (tokio) is the right primitive — OnceLock (std) doesn't have async init, and the existing pattern in disk_cache/lease.rs:15 uses OnceLock for non-async state, so the precedent is "tokio for async init, std for sync init".

Findings table

Severity File:line Current pattern Proposed async replacement Notes
CRITICAL library/arduino_api.rs:30-66 (ensure_arduino_api) sync fn; reqwest::blocking::get(ARDUINO_API_URL) + sync std::fs::write + sync extractor::extract make ensure_arduino_api async; use http::client().await; await response body; wrap extractor::extract in spawn_blocking Only reqwest::blocking left in the crate. Called from arduino_core.rs, attiny_core.rs, arduino_mbed_core.rs framework validators — currently runs inside staged_install's validate closure, which is sync. Either widen the validate signature to FnOnce(&Path) -> BoxFuture<...> or arrange for ArduinoCore-API injection to happen outside the sync validate.
CRITICAL downloader.rs:113, :178 reqwest::get(url).await (creates a new Client per call) http::client().await.get(url).send().await Composes with #805 (timeout). Each Client allocation costs a TLS connector teardown — currently every retry allocates a fresh connector.
CRITICAL lib.rs:60-67 (block_on_package_future) tokio::task::block_in_place(|| handle.block_on(future)) delete; make Package::ensure_installed an async fn (or return BoxFuture) Bridge is invoked from ~30 call sites (see below). block_in_place is a yellow-flag — it shifts the current worker into a blocking pool slot just so it can .block_on. Tokio-console sees a gap during the entire install.
HIGH library/library_manager.rs:290-322 (ensure_libraries_sync) handle.block_on(ensure_libraries(...)) from sync entry delete the _sync wrapper; caller (build orchestrator) is already async or should be Already has an async ensure_libraries. Callers in fbuild-build should .await directly.
HIGH library/library_compiler.rs:202-261 (compile_library_with_jobs parallel path) std::thread::scope + std::sync::Mutex work-stealing queue across N OS threads JoinSet<Result<...>> of spawn_blocking tasks (compiler subprocess is sync), or a Semaphore-bounded FuturesUnordered of tokio::process::Command::output().await futures Subprocess spawn is sync today via fbuild_core::subprocess::run_command. CPU is dominated by gcc/g++ child process, not by fbuild itself, so this can stay rayon/spawn_blocking — but the std::thread::scope ties N tokio-invisible OS threads to a single library compile. The work-queue mutex must remain std::sync::Mutex (never held across .await) on the spawn_blocking path; tokio's Mutex is wrong here.
HIGH All 22 toolchain/*.rs + library/*_core.rs + library/*_framework.rs Package::ensure_installed impls each one: if let Ok(handle) = Handle::try_current() { handle.block_on(staged_install) } else { Runtime::new().block_on(staged_install) } change Package trait: async fn ensure_installed(&self) -> Result<PathBuf> (or fn ensure_installed(&self) -> BoxFuture<...> if dyn-compatibility is needed) The entire duplicated Handle::try_current / fresh-runtime fallback dance disappears. Full call-site list in the "What was searched" section. Each one is a 1-2 hour mechanical conversion once the trait flips.
HIGH lnk/resolver.rs:117 sync fetch_lnk_blob calls block_on_package_future(async { download_file(...) }) make fetch_lnk_blob (and callers up to Resolver::resolve) async; .await directly lnk/resolver.rs is in the link-blob fetch path — runs at link time on the daemon.
HIGH toolchain/esp32_metadata.rs:74-84 (resolve_toolchain_url_sync) crate::block_on_package_future(resolve_toolchain_url(...)) delete _sync wrapper; only call site is toolchain/esp32.rs:79 which is already inside an ensure_installed that should be async Pure transitive consequence of the Package trait flip.
HIGH library/esp32_framework/libs.rs:124, :132, :184, :192 nested handle.block_on(crate::downloader::download_file(...)) inside ESP32 framework lib download make the surrounding function async; await directly Currently the block_on is inside another sync function called from ensure_installed. Doubly anti-pattern.
MEDIUM downloader.rs:299-319 (download_all) tokio::spawn per URL + await each JoinHandle sequentially use futures::stream::FuturesUnordered or tokio::task::JoinSet for first-error fail-fast The current pattern works but doesn't fail-fast: if URL #1 takes 60s and URL #5 errors at 2s, we wait for URL #1 before observing the error. JoinSet::join_next would surface URL #5's error immediately.
MEDIUM library/library_manager.rs:73-93 parallel download collect loop Vec<JoinHandle> collected sequentially via handle.await JoinSet with join_next() for early-error reporting Same fail-fast argument as download_all.
MEDIUM downloader.rs:322-339 (verify_checksum, sync) sync std::fs::read + CPU hashing make callers always use the existing verify_checksum_async at :342 (uses tokio::fs::read) and delete sync version The sync version is called from staged_install (lib.rs:405) inside an async fn — i.e. blocking I/O on the runtime. Easy fix. Hashing itself can stay sync (CPU-bound).
LOW lib.rs:37-54 dir_size (called from record_install_in_disk_cache inside staged_install) sync recursive std::fs::read_dir walk on potentially large extracted toolchains wrap in tokio::task::spawn_blocking Runs once per install (post-extract), not on a hot path. Thousands of stat syscalls on large ESP-IDF tarballs.
LOW lib.rs:354, :376-380, :421, :427, :435 sync std::fs::create_dir_all / remove_dir_all / remove_file / rename / write inside staged_install (which is async) tokio::fs::* equivalents Each call is microseconds (single inode op), so this is genuine LOW. Still: a recursive remove_dir_all of a partial toolchain extract is unbounded in size.

What was searched

Patterns (all under crates/fbuild-packages/src/):

  • reqwest:: — 11 hits, all async except arduino_api.rs:51
  • Client::(new|default|builder) — 0 hits (no shared client today)
  • std::thread::spawn / JoinHandle / thread::scope — 1 hit (library_compiler.rs:206)
  • rayon:: / par_iter — 0 hits
  • std::process::Command::new / Command::new — 0 hits in this crate (subprocess is via fbuild_core::subprocess::run_command, audited under that crate's sub-issue)
  • block_on — 30+ hits across all Package::ensure_installed impls + 4 _sync wrappers + lnk/resolver.rs:117
  • Mutex::new / RwLock::new — only std::sync::Mutex for in-process throttling and rusqlite conn protection; none held across .await
  • mpsc:: — 0 hits
  • OnceLock / OnceCellOnceLock used for nonce/touch throttle (sync init, correct); no OnceCell yet
  • flate2 / tar / zip / xz2 / bzip2 / zstd — out of scope per spec

Modules audited:

  • downloader.rs, extractor.rs (extraction excluded per spec)
  • install_lock.rsalready correct (tokio::time::sleep, no std::thread::sleep)
  • lib.rs (PackageBase::staged_install, block_on_package_future)
  • library/library_manager.rs, library/library_downloader.rs, library/library_compiler.rs
  • library/arduino_api.rs, library/registry.rs
  • toolchain/{clang,avr,arm,arm_gcc8,esp32,esp32_metadata,esp8266,esp_qemu,riscv,rp2040_pqt,teensy_arm}.rs
  • Every library/*_core.rs and library/*_framework.rs ensure_installed impl

Full block_on ensure_installed call-site list (severity HIGH, one trait-flip fixes all):
toolchain/avr.rs:115, toolchain/arm.rs:106, toolchain/arm_gcc8.rs:74, toolchain/esp32.rs:169, toolchain/esp8266.rs:98, toolchain/esp_qemu.rs:172, toolchain/riscv.rs:183, toolchain/rp2040_pqt.rs:104, toolchain/teensy_arm.rs:97, library/apollo3_core.rs:131, library/arduino_core.rs:129, library/arduino_core_lpc8xx.rs:171, library/arduino_mbed_core.rs:180, library/attiny_core.rs:106, library/avr_framework.rs:177, library/ch32v_core.rs:140, library/cmsis_atmel.rs:107, library/cmsis_framework.rs:93, library/esp32_framework/mod.rs:136, library/esp32_platform.rs:178, library/esp8266_framework.rs:193, library/nrf52_core.rs:137, library/renesas_core.rs:161, library/rp2040_core.rs:161, library/sam_core.rs:116, library/samd_core.rs:120, library/silabs_core.rs:107, library/stm32_core.rs:129, library/teensy_core.rs:151.

Out-of-scope notes

  • Archive extraction (extractor.rs, flate2 / tar / zip / xz2 / bzip2 / zstd): sync crates with no async equivalents fbuild already uses. Per spec, do not rewrite — but each call from async context should be wrapped in tokio::task::spawn_blocking to vacate the tokio worker during a multi-GB ESP-IDF tarball decompress. Currently staged_install calls extractor::extract(&archive_path, &staging_path) directly on the runtime worker (lib.rs:418). That's worth a follow-up issue, but doesn't belong in this audit's scope.
  • SHA-256 hashing (sha2::Sha256 in downloader.rs:325, library_compiler.rs:13): CPU-bound, sync is correct. The verify_checksum_async wrapper does the right thing (async file read + sync hash). If invoked from async hot paths, wrap in spawn_blocking if the file is large.
  • Test code: lib.rs:502 (test-only tokio::spawn), avr.rs:364 (gated #[ignore] integration test), install_lock.rs test module — all out of scope.
  • Drop impls: install_lock.rs:163 InstallLockGuard::drop does sync std::fs::remove_dir_all — must stay sync (Drop is not async).
  • In-process Mutex (lib.rs:480-483 mark_package_touch_needed): std::sync::Mutex<HashSet<String>> for in-process throttle, never held across .await. Correct as-is.

Relationship to #805

#805 will land timeouts via the same reqwest::Client::builder() change proposed above. The clean fix is for one of these issues to land the shared http::client() module first (this issue does the lift, #805 then adds the timeout config to the builder) — recommend coordinating so they touch the same file once, not twice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions