Skip to content

fix: harden flox builds for reliable nixpkgs and manifest packaging#21

Merged
limeytexan merged 3 commits into
mainfrom
fix/harden-flox-builds
Jun 16, 2026
Merged

fix: harden flox builds for reliable nixpkgs and manifest packaging#21
limeytexan merged 3 commits into
mainfrom
fix/harden-flox-builds

Conversation

@limeytexan

Copy link
Copy Markdown
Contributor

Summary

  • Pin midline-flush dependency correctly: tests/midline-flush.sh now declares its dependency on the compiled tests/midline-flush binary directly, closing a race condition exposed by parallel Nix builds.
  • Fix midline-flush timing: increased the inter-stream sleep in tests/midline-flush.c from 10µs to 10ms — 10µs was too tight under Nix sandbox scheduling pressure, causing non-deterministic line ordering and intermittent test failures.
  • Fix manifest build flag hygiene: introduce OPTFLAGS ?= -g in the Makefile so packaging builds can override to -O -static-libgcc without changing the dev default; the manifest build now uses this to avoid embedding debug-symbol references to build-time Nix store paths.
  • Add flox-build-test to make test: when flox is in PATH, make test now exercises both flox build t3 and flox build nixpkgs-t3 to catch Nix-sandbox-specific failures that don't surface in local builds; silently skips when flox is absent.

Workaround note

glibc.out has been added to the manifest's runtime-packages to satisfy the package-builder's dependency checker. This is a workaround: the package-builder should treat glibc as an implicit runtime dependency of compiled C binaries without requiring it to be declared explicitly. A follow-up issue should be filed against flox/flox to track this.

Test plan

  • make test passes locally (20/20 runs of midline-flush test)
  • flox build nixpkgs-t3 passes
  • flox build t3 passes (with result-t3-buildCache cleared)
  • make flox-build-test runs both flox builds end-to-end
  • Inside the Nix sandbox, flox-build-test correctly skips with "flox not in PATH"

🤖 Generated with Claude Code

limeytexan and others added 2 commits June 16, 2026 11:35
Fix three related issues that caused `flox build t3` and `flox build
nixpkgs-t3` to fail intermittently or consistently on Linux:

1. Pin midline-flush dependency on correct target (Makefile)

   The compiled `tests/midline-flush` binary must be declared as a
   dependency of `tests/midline-flush.sh` rather than
   `tests/midline-flush/run`, which is the target that invokes it.
   The indirection exposed a race condition in parallel Nix builds.

2. Fix midline-flush timing (tests/midline-flush.c)

   The 10µs sleep between flushing stderr and stdout was too short
   under Nix sandbox scheduling pressure, causing non-deterministic
   line ordering. Increased to 10ms, matching the inter-iteration
   shell sleep already in use.

3. Fix manifest build dependency and flag hygiene (Makefile, manifest)

   Introduce OPTFLAGS (default -g) so packaging builds can override
   to -O -static-libgcc without touching the dev default. The manifest
   build now passes OPTFLAGS="-O -static-libgcc" to avoid embedding
   debug-symbol references to build-time Nix store paths in the binary.
   Add glibc.out to runtime-packages as a workaround pending
   package-builder treating glibc as an implicit C runtime dependency.

   Also add `flox-build-test` to `make test`: when flox is in PATH,
   exercises both `flox build t3` and `flox build nixpkgs-t3` to catch
   Nix-sandbox-specific failures that don't surface in local builds.
   Silently skips when flox is absent (e.g. inside the Nix sandbox
   itself, or in CI environments without flox).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The manifest build command passed OPTFLAGS="-O -static-libgcc" on every
platform, but -static-libgcc is a GCC-only flag and the pure Nix sandbox on
Darwin compiles with clang, which rejects it:

    clang: error: unsupported option '-static-libgcc'

This broke `flox build t3` on macOS, and since flox-build-test is part of
`make test`, it broke `make test` on macOS for anyone with flox in PATH.

Select the flag by platform in the build command: keep "-O -static-libgcc" on
Linux (where it avoids a runtime libgcc dependency, the original intent) and
use plain "-O" elsewhere, where -static-libgcc is both unsupported and
unnecessary. Verified on macOS: `flox build t3`, `flox build nixpkgs-t3`, and a
full `make test` all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@limeytexan limeytexan requested a review from Copilot June 16, 2026 13:21

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@limeytexan

limeytexan commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Reviewed and tested thoroughly on macOS. The non-flox changes are sound (default build, the OPTFLAGS indirection, the midline-flush timing — 20/20 deterministic with golden files unchanged — and the tests/midline-flush.sh dependency), and flox build nixpkgs-t3 passes on Darwin.

One macOS regression found and fixed (pushed as 89c0270):

flox build t3 failed on macOS because the manifest command passed -static-libgcc unconditionally, but the pure sandbox on Darwin compiles with clang, which rejects it:

clang -Wall -O -static-libgcc -DVERSION='"unknown"'  t3.c  -o t3
clang: error: unsupported option '-static-libgcc'

Since flox-build-test is wired into make test, this also broke make test on macOS (I reproduced it — exit 2). It was invisible in a plain make OPTFLAGS=... because the dev cc is gcc; only the sandbox uses clang.

The fix selects the flag by platform in the build command — -O -static-libgcc on Linux (the original intent: avoid a runtime libgcc dependency), plain -O elsewhere where the flag is both unsupported and unnecessary:

case "$(uname -s)" in
  Linux) optflags="-O -static-libgcc" ;;
  *)     optflags="-O" ;;
esac

Re-locked so manifest.lock matches (the only lock change is the command string — no package churn). Verified on macOS after the fix: flox build t3, flox build nixpkgs-t3, and a full make test (with flox-build-test) all pass (exit 0); the nested sandbox builds correctly skip their own flox-build-test ("flox not in PATH"), so no recursion. The Linux flox-build CI job here will confirm -static-libgcc still applies on Linux.

Minor (non-blocking): the lockfile's outputs_to_install for glibc has duplicated "bin" entries — harmless, but a bug tracked in https://github.com/flox/floxhub/issues/729.

@limeytexan

Copy link
Copy Markdown
Contributor Author

PR Review: fix: harden flox builds for reliable nixpkgs and manifest packaging

Reviewed as a single-reviewer pass across Security & Correctness, Performance & Architecture, and Conventions & Tests. This is build/packaging infrastructure (Makefile, flox manifest/lock, a nixpkgs override expression, and a test-helper timing tweak) — no product C logic changes. Overall the change is sound and the stated goals (reproducible packaging builds, a CI-visible flox build gate, deterministic midline-flush ordering) are achieved. Findings below; none are blocking on their own, but per Forge review policy any Minor finding should be acknowledged before merge.

Stage 1: Spec Compliance — Pass

No slice/design doc applies (standalone utility, infra change). The PR body's stated objectives map cleanly onto the diff: OPTFLAGS hook, Linux-only -static-libgcc, glibc runtime-package workaround (with a documented follow-up), the flox-build-test gate, and the usleep timing fix. The test plan documents local make test, both flox build variants, and the in-sandbox skip path. Proceeding to Stage 2.

Stage 2: Code Quality

Critical (C) — Must Fix

None.

Important (I) — Should Fix

I1: make test now transitively invokes flox build, which re-enters make test — expensive and a potential surprise in CI

  • File: Makefile:187 (test: flox-build-test) together with Makefile:175-177
  • Issue: Making flox-build-test a prerequisite of the default test target means that any developer or CI job with flox in PATH running make test will now trigger two full Nix sandbox builds (flox build t3, flox build nixpkgs-t3), each of which runs make ... && make test ... again inside the sandbox. The nested invocation correctly skips (flox absent in sandbox), so there is no infinite recursion, but the wall-clock cost of make test changes dramatically and silently depending on whether flox happens to be on PATH. A contributor running the historically-fast make test loop will suddenly pay for two sandboxed Nix builds.
  • Fix: Consider keeping flox-build-test out of the default test aggregate and wiring it into a dedicated CI target (e.g. make ci or invoke make flox-build-test explicitly in the workflow). If the intent really is to gate every make test, document the cost prominently and consider gating on an opt-in variable (e.g. FLOX_BUILD_TEST=1) rather than mere PATH presence, so the behavior is deterministic rather than environment-dependent.

Minor (M) — Nice to Have

M1: Locked outputs_to_install for glibc is bin, not the requested out

  • File: .flox/env/manifest.lock (glibc entries, both Linux systems) vs .flox/env/manifest.toml:6 (glibc.outputs = [ "out" ])
  • Issue: The manifest requests glibc.outputs = ["out"] (the output carrying libc.so, which is what a compiled C binary needs at runtime), but the lock records outputs_to_install = ["bin", ...] for glibc — the only package in the lock whose installed output diverges from its request. The build passes per the test plan, so runtime-packages resolution evidently still pulls the needed closure, but the lock does not faithfully reflect the manifest's stated intent. Worth a sanity check that the runtime closure of the built t3 actually references glibc.out and not just glibc.bin. This looks like a flox locker behavior rather than an authoring mistake; if confirmed, fold it into the same follow-up issue mentioned in the PR body.

M2: Duplicated entries in outputs_to_install across the lock

  • File: .flox/env/manifest.lock (e.g. coreutils x86_64-linux shows out nine times; glibc x86_64-linux shows bin five times)
  • Issue: Informational, not introduced by this PR — the duplication is pre-existing and affects nearly every package (apple-sdk ["out","out"], gcc ["man","man","out","out","out"], etc.), so it is clearly a flox lock-generation quirk rather than anything the author hand-edited. Calling it out only so a reviewer does not mistake the glibc duplication for a packaging error in this diff. No action required here; belongs in the upstream flox follow-up.

M3: flox-build-test cleans only the t3 build cache, not the nixpkgs-t3 result

  • File: Makefile:175 (rm -f result-t3-buildCache)
  • Issue: The recipe removes result-t3-buildCache before building but leaves any result-* symlinks from flox build nixpkgs-t3 behind. Minor housekeeping; if the goal is a clean, reproducible gate, consider also clearing the nixpkgs-t3 result/symlink (and confirming neither result* artifact is committed). Not a correctness problem.
Confirmed correct
  • tests/midline-flush.sh: tests/midline-flush prerequisite (Makefile:148): Correct and an improvement over the prior tests/midline-flush/run: tests/midline-flush. The compiled helper binary has no explicit build rule and relies on make's implicit %: %.c rule, so under parallel make -j the /run phony could fire before the binary materialized. Attaching the binary as a prerequisite of the .INTERMEDIATE $(test).sh (which $(test)/run depends on) puts the ordering edge on a real file target, which is the more robust place for it. The binary is still built before the test runs via the transitive chain.
  • usleep 10µs → 10ms in tests/midline-flush.c:21,25: Correct fix. The .log golden file asserts strict per-iteration ordering (stderr line before stdout line), so the inter-stream delay is load-bearing; 10µs is well within scheduler jitter under sandbox pressure. The new comments accurately describe the why (sandbox scheduling) and drop the previous comment's inaccurate "only takes a millisecond" claim. This is a behavior-asserting test, not a data test.
  • OPTFLAGS ?= -g indirection (Makefile:6-7): Clean. Preserves the -g dev default while letting packaging override to -O -static-libgcc. OPTFLAGS correctly propagates into CFLAGS, so the implicitly-built test binaries pick up the same flags. Dropping -g in the packaged build to avoid embedding build-time store paths in debug references is a reasonable packaging choice.
  • Linux-only -static-libgcc guard (manifest.toml build command): Correctly scoped via uname -s-static-libgcc is a GCC-only flag and clang (Darwin sandbox) rejects it. The same $optflags is threaded through make, make test, and make install so all three phases compile consistently.
  • .flox/pkgs/nixpkgs-t3.nix: Idiomatic flox expression-package: { t3 }: t3.overrideAttrs (_: { src = ../../.; }) rebuilds the upstream nixpkgs t3 from local source. The .nix suffix is correct — it is a source file processed by a tool, not an executable script entrypoint, so the repo's no-language-suffix-on-entrypoints convention does not apply. flox build nixpkgs-t3 in the test plan confirms it resolves.
  • manifest.toml ↔ manifest.lock consistency for the new glibc install: The dotted-key install syntax (glibc.pkg-path / glibc.outputs / glibc.systems) matches the existing apple-sdk convention, is Linux-scoped, and the lock's embedded manifest.install.glibc mirrors the TOML. The pin is a concrete nixpkgs rev with a derivation hash (9ae611a..., glibc-2.42-61), so the addition is reproducible.
  • No recursion hazard: nested make test inside the sandbox skips flox-build-test (flox not in PATH), so the build gate does not loop.
  • glibc as a documented workaround: The PR body correctly frames runtime-packages = ["glibc"] as a workaround for the package-builder's dependency checker and commits to filing an upstream follow-up — appropriate handling of introduced debt.

Summary

Must fix before merge: None
Should address: I1 (env-dependent cost/behavior of folding flox-build-test into default make test)
Also consider: M1 (glibc locked output is bin, manifest requests out — verify runtime closure), M2 (pre-existing lock duplication, informational), M3 (nixpkgs-t3 result not cleaned)

Verdict

Changes requested — the change is functionally sound and well-documented; please weigh I1 (make the flox build gate opt-in/deterministic rather than PATH-triggered) and confirm M1 (glibc out vs bin in the runtime closure) before merge. The remaining items are minor/informational.


Via Forge (interactive) • 3ca4fd79

@limeytexan

limeytexan commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough pass — and for confirming the -static-libgcc Linux guard, OPTFLAGS indirection, midline-flush timing/dependency, and nixpkgs-t3.nix as correct. Dispositions on the open findings (updated after acting on I1):

I1 — flox-build-test folded into make test: keeping the gate, and closed the recursion edge.
Keeping it in make test is deliberate, and it just earned its keep — the -static-libgcc/clang breakage was invisible to a plain make test and to make OPTFLAGS=… (dev cc is gcc); it only surfaced because make test ran flox build on Darwin. The PATH-gated skip is the feature: the plain CI build job and non-flox contributors get the fast suite, while the flox-build job and flox-equipped devs get the full gate.

You're right, though, that the no-recursion guarantee was resting entirely on the sandbox type. With a pure sandbox flox isn't on PATH so the in-build make test skips and there's no loop — but flip the build to a non-pure sandbox and flox would be present, and the build's make test would re-enter flox build indefinitely. Fixed in db36795: added a DISABLE_FLOX_BUILD_TEST make variable that skips the gate regardless of flox's presence, and the manifest build now passes it to the inner make test. The guarantee is now explicit and independent of the sandbox type. Verified on macOS: flox build t3 and a full make test pass, with the in-build make test skipping via DISABLE_FLOX_BUILD_TEST rather than by flox happening to be absent.

M1 — glibc locked output: the lock is faithful; the divergence is a read of the wrong section.
The lockfile has two relevant parts: manifest.install (the authoritative install spec) and packages (the resolved package descriptors). manifest.install.glibc.outputs is ["out"] — it matches manifest.toml exactly. The outputs_to_install = ["bin", …] in the packages descriptor section is not the install spec; it's the same cosmetic catalog-server data as M2. So the lock faithfully reflects the manifest's intent — no out-vs-bin divergence in the install spec, and nothing to verify in the runtime closure.

M2 — duplicated outputs_to_install across the lock: confirmed upstream, tracked.
It's a catalog-server bug, tracked in flox/floxhub#729 (open since January). Not author/edit-introduced: a from-scratch re-lock (rm manifest.lock + relock) reproduces the lockfile byte-for-byte, duplicates included — the data arrives that way from the catalog-server and the CLI records it faithfully. Nothing this repo can scrub; it clears when #729 is fixed server-side. (Same root cause as the packages-section glibc entries in M1.)

M3 — flox-build-test leaves result-nixpkgs-t3 behind: fair, and harmless.
All result-* artifacts are gitignored (verified), so nothing is committed — this is just a stray working-dir symlink. Happy to add a cleanup of the result-* symlinks to the recipe if you'd like the gate to leave the tree clean; say the word and I'll push it. (The rm result-t3-buildCache up front is specifically to force a non-incremental t3 build; the nixpkgs-t3 expression build has no cache, hence the asymmetry.)

Net: I1's recursion edge is now closed (db36795); M1's install spec is correct (the divergence is a read of the packages section, not the install block); M2 is the upstream catalog bug (#729); M3 is optional tidiness on request.

flox-build-test runs only when flox is on PATH, which a "pure" sandbox does
not provide, so the in-build `make test` skips it and there is no recursion
today. But that safety relies entirely on the sandbox type: switching the
build to a non-pure sandbox would put flox on PATH inside the build, and the
build's `make test` would then re-enter `flox build`, looping indefinitely.

Add a DISABLE_FLOX_BUILD_TEST make variable that skips the gate regardless of
flox's presence, and have the manifest build pass it to the inner `make test`.
This makes the no-recursion guarantee explicit and independent of the sandbox
type. Verified on macOS: `flox build t3` and a full `make test` pass, with the
in-build `make test` now skipping via DISABLE_FLOX_BUILD_TEST rather than by
flox happening to be absent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@limeytexan

Copy link
Copy Markdown
Contributor Author

Verified the two new commits and the dispositions against the tree — all resolved, and the M1 pushback is correct.

I1 — recursion edge closed (db36795) — Good fix, and you're right that the guarantee was previously resting on the sandbox type. DISABLE_FLOX_BUILD_TEST is now a Makefile conditional checked before the FLOX-on-PATH branch (Makefile:183-194), and the manifest's inner make test sets DISABLE_FLOX_BUILD_TEST=1 (manifest.toml:46), so the no-recursion guarantee holds independent of pure-vs-non-pure sandbox. Keeping the gate PATH-triggered for the fast/full split is a reasonable call — and the clang -static-libgcc breakage it caught (89c0270, correctly uname -s-gated) is a good argument for keeping flox build in the loop.

M1 — withdrawn; you're right, our reviewer read the wrong section. Confirmed manifest.install.glibc.outputs in the lock is ["out"] (lock lines 31-33), matching manifest.toml exactly. The outputs_to_install: ["bin","bin"] we cited lives in the packages descriptor block (lock lines 657-660) — catalog-server data, same source as M2, not the install spec. No out-vs-bin divergence and nothing to verify in the runtime closure. Apologies for the noise.

M2 — agreed, upstream. The byte-for-byte reproducible re-lock is conclusive that it's catalog-server data faithfully recorded, not anything this repo can scrub. flox/floxhub#729 is the right home.

M3 — confirmed harmless, no need to act. /result-* is gitignored (verified), so nothing is committed; the stray result-nixpkgs-t3 symlink is just working-dir tidiness. The rm result-t3-buildCache-only asymmetry makes sense given only the t3 build has a cache to force-bust. Not worth a round-trip — leave it unless it bugs you.

No remaining review blockers from my side.


Via Forge (interactive) • 3ca4fd79

@limeytexan limeytexan requested a review from billlevine June 16, 2026 14:14

@billlevine billlevine left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@limeytexan limeytexan merged commit f270d32 into main Jun 16, 2026
4 checks passed
@limeytexan limeytexan deleted the fix/harden-flox-builds branch June 16, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants