Route deployment artifacts through object storage#135
Draft
Anveio wants to merge 150 commits into
Draft
Conversation
7916194 to
9997ea2
Compare
Move the two root //patches files to live beside the components that own the patched dependency, matching the temporal-platform/patches precedent: tigerbeetle-go -> //src/infrastructure-components/tigerbeetle/patches pg-query-go -> //src/tools/dev/patches (sqlc transitive dep) Update the go_deps.module_override labels in MODULE.bazel and drop the now-empty root patches/ package. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… path
Collapse `guardian run <tool>` resolution to a single hermetic path: locate
<root>/<name>/bin/<exe> and exec, where root is GUARDIAN_TOOLS_ROOT (the golden
image) or the Bazel-materialized .verself/tools/root (dev). The mirror fetch,
sha256 re-hash, and admission gate are removed; the cosign-verified image
manifest digest is the trust boundary.
- toolrun: delete fetchMirror/fetchHTTPS/copyFile and the download half of
EnsureExecutable; new Locate (stat + exec-bit assert, fail loud) + Exec.
- toolcatalog: PlatformTool/ResolvedTool collapse to {executable}; the bazel pin
(url+sha256) moves to the Bazel module graph so no digest is authored twice in
the catalog.
- bazel becomes a controller_http_file pin; tools/BUILD.bazel adds a per-tool
layered tools_root pkg_tar; `aspect dev install` materializes it via
stage-zero bazel (so rebuilding guardian can't deadlock its own resolver).
- guardian run <tool> --which/--verify report root + source instead of
ref/digest/admission.
Verified: 3/3 Go test targets pass; with egress blocked (sudo unshare -n,
loopback only) `guardian run bazel -- query //src/... --output=package` resolves
from the root and returns all packages offline; `aspect dev install` + dev
`guardian run bazel -- version` -> 9.1.0 from .verself/tools/root.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erated tool catalog Eliminate the two preflight network leaks and collapse the tool pins to one generated source of truth. No-leaks (TB-4): - podman: pin the full 39-deb noble closure by sha256 (resolved against a clean base) under src/infrastructure-components/podman; runtime_artifact extracts a self-contained prefix. preflight installs it + a CNI containers.conf + systemd podman.socket from the prefix instead of `apt install podman`. - rsync: seed the pinned rsync binary over ssh and drive remote rsync via --rsync-path; drop rsync from the (now deleted) apt task. - ansible-core: a reproducible, offline, controller-side ansible-core 2.20.3 bundle (python + hash-pinned deps + the already-pinned collections) replaces the `uvx --from ansible-core` PyPI fetch in runPreflightPlaybook. Single source of truth (TB-2): - .config/guardian/tools.cue is generated from the Bazel pin graph (tools:catalog.bzl) via gen-tools-catalog, gated by write_source_files + diff_test; the tool root materializes all 34 tools so `guardian run <tool> --` resolves every pinned tool. - bazel_pin_consistency_test fails if the bazel digest drifts across the pin, the stage-zero bootstrap, and the make-skill release lockfile. Verified: podman pulls+runs a container from the prefix (crun+conmon+overlay+ cni); ansible-playbook + a pinned-collection task run offline; tools_root resolves 34 tools offline; all guardian/toolrun/catalog tests + both drift gates pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ventory preflight: stop podman and clear /var/lib/containers/storage (+ /run/containers /storage) before reconfiguring from the pinned prefix, so a stale storage DB left by a prior podman (e.g. apt's, with a different graph driver) cannot crash the podman API service with "database graph driver ... mismatch". The Nomad podman driver then comes up healthy on a clean reconverge. Document the hermetic preflight (pinned podman/rsync/ansible-core, no apt/PyPI), the podman-from-prefix model, the generated single-source tool catalog, and that the postgresql-runtime role drift is resolved by the disaster-recovery path (verified on gamma: role-not-found gone, fresh allocation starts; full postgres convergence then chains on the Cloudflare account-admin operator import). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… gamma) After the OpenBao wipe + Cloudflare account-admin import (R2 recovery produced), postgresql clears the postgresql-runtime role, secret, and R2 gates and reaches its setup prestart for the first time, which then crashes decoding the `postgresql-recovery --action=info` output as utf-8. Record it as an open postgresql-recovery component bug (emit valid utf-8 JSON), distinct from the deploy path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cut PostgreSQL off raw_exec/host orchestration onto the podman driver, matching the profile/distribution-service pattern, with state on durable bind-mounts. - Image: first libc OCI base in the repo — ubuntu:24.04 pinned by digest via rules_oci oci.pull — layered with a computed 16-deb OS closure (krb5/ldap/ python3.12), the existing postgresql_runtime.tar (postgres 16 + pgBackRest + postgresql-recovery), and an identity layer (postgres uid/gid 999). oci_load emits the docker-archive the podman driver loads; nomad var becomes postgresql_image_sha256. - nomad.hcl: setup/server/reconcile now driver="podman", user="postgres", network host, shm_size=512m. Durable host bind-mounts for PGDATA, config, log, socket, pgBackRest spool/log, secrets, and the projected document; server and reconcile run readonly_rootfs + cap_drop=all. Deleted all host-glue: host user creation, runtime-tar extraction + releases/current versioning, LD_LIBRARY_PATH juggling, runuser indirection, and the openssl shell-out (now python stdlib secrets). The image is the runtime; runtime root is the fixed /opt/verself/postgresql prefix. - recovery binary: the info action now sanitizes pgBackRest output to valid utf-8 (strings.ToValidUTF8) before stdout, so a non-utf8 byte can't crash the consumer's decode; other actions keep raw passthrough. Tests added. - Drop the now-dead runtimeArtifact/runtimeRoot CRD fields from the schema and the gamma instance (the image supersedes them); no raw_exec trace remains. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two gaps the fresh-reimaged gamma node exposed (the previous node masked them with leftover apt podman): - catatonit (podman's init binary, used by `init = true` tasks) is a podman Recommends that --no-install-recommends dropped from the pinned closure. Pin it and point containers.conf init_path at the prefix copy; container creation failed with "lookup init binary: catatonit not found in $PATH" without it. - podman/crun/conmon are dynamically linked against prefix-only libs (libyajl, libgpgme, ...) and podman does not propagate LD_LIBRARY_PATH to the crun it spawns, so on a clean host crun died with "libyajl.so.2: cannot open shared object file" (crun start exit 127). Register the prefix lib dirs via /etc/ld.so.conf.d + ldconfig so every prefix binary resolves its libs; also put the prefix bin on the podman service PATH. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…efix machinery) Replace the dpkg-deb-x-into-/opt-prefix approach with shipping the pinned podman .deb closure as a bundle and installing it on the node with an offline `dpkg -i ./*.deb`. dpkg owns placement, ldconfig, /usr/libexec/podman/catatonit, helper dirs, the stock podman.socket/.service unit, and the stock /etc/containers config — so the "could not load shared library / not found in PATH" failure class (catatonit missing, crun libyajl.so.2) becomes structurally impossible. Hermeticity is preserved: the .deb bytes are the same pinned content-addressed artifacts, installed with no apt-archive fetch. - BUILD: runtime_artifact now tars the .deb files (podman-debs.tar) instead of an extracted prefix. - preflight.yml: the prefix block (extract, ld.so.conf+ldconfig, custom containers.conf/storage.conf/CNI, LD_LIBRARY_PATH/init_path units, storage clear) is replaced by: unpack the bundle + `dpkg -i ./*.deb` + enable the deb's stock podman.socket. - main.go: target path -> podman-debs.tar; podmanPrefix const and guardian_podman_prefix var removed. Verified offline in a clean ubuntu:24.04 container (network severed): dpkg -i installs the closure, podman 4.9.3 runs, ldd crun resolves libyajl, catatonit is at /usr/libexec/podman/catatonit, and `podman run --network none` executes a container. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on, postmaster, R2 TLS) Fixes the chain of blockers that kept the postgresql Nomad job from reaching steady-state green on gamma under the podman driver: - provision-host-dirs: a raw_exec root prestart creates the durable + socket bind sources (/var/lib/postgresql/16/verself, /var/run/postgresql). The podman driver does not auto-create bind-mount sources (unlike docker -v), and /var/run is tmpfs (wiped each boot), so this runs per-alloc. - server/reconcile gain real restart stanzas (mode=delay) and the group a self-healing reschedule, so a slow first start no longer fails the reconcile sidecar and SIGKILLs the alloc (was Exit 143, "Sibling reconcile failed"). - clear_stale_postmaster replaces wait_for_previous_postmaster: under init=true the wrapper is itself pid 2, so the old os.kill(pid,0) liveness check self-aliased and never cleared a stale postmaster.pid. The socket is the namespace-independent authority for whether a postmaster is serving. - wait_for_ready deadline 60s -> 300s to absorb crash recovery. - stanza-create is unconditional + idempotent; the missing-stanza string heuristic skipped it for a fresh empty-list repo and broke check. - pgBackRest TLS to R2: mount host /etc/ssl/certs and set repo1-s3-ca-file in the generated config. The minimal image has no CA roots and OpenSSL's default path is not /etc/ssl/certs; the config setting covers the recovery commands and postgres's archive_command alike. CRD ephemeral paths (config/log/report/pgbackrest) move to /alloc so only the durable dataDir and cross-component socketDir remain host bind-mounts. Verified on gamma: report.json status=healthy, 16 service databases + 17 roles created, pgBackRest full backup 20260608-091822F status ok. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The new-binary skill classifies how every binary is pinned, declared, and run (source axis x run axis; Level 1 containerized vs Level 2 native). This adds the governing rule for recover binaries: they own domain-state convergence only. Install, host provisioning, and process supervision belong to the OCI image, the golden-image build, and Nomad respectively, and are deleted (not relocated, and never pushed into Ansible) when a component goes Level 1. Stateless components delete their recover binary outright; stateful components collapse it to the domain FSM (postgresql/internal/recoveryfsm is the reference). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The PostgreSQL inventory row and the info-decode observed-error row were stale. Recon showed the info-decode UnicodeDecodeError already resolved (setup exits 0), which exposed a downstream chain — postmaster PID-namespace staleness under init=true, the reconcile/restart supervision race (Exit 143), host-dir provisioning, and pgBackRest R2 CA trust — now fixed in nomad.hcl and verified: report.json status=healthy, 16 databases + 17 roles, full backup ok. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
object-storage-servicewrite sessions instead of direct Cloudflare/R2 control-plane coupling.cloudflare-r2-control-planecomponent and keep Cloudflare provider authority behind the Cloudflare integration/object-storage boundary.aspect integrations cloudflare-control-plane --action=import-admin-pairwith0600token files and provider verification before OpenBao persistence.Validation
aspect tidyaspect checkbazelisk test //src/tools/deployment/internal/sitebootstrap:sitebootstrap_test //src/integrations/cloudflare/control-plane/cmd/cloudflare-control-plane:cloudflare-control-plane_test //src/integrations/cloudflare/control-plane/internal/r2control:r2control_testbazelisk test //src/services/deployment-service/deploycontract:deploycontract_testLive gamma status
aspect site bootstrap-deploy --site=gamma --sha=$(git rev-parse HEAD) --openbao-site-root-token-file=/tmp/verself-bootstrap/gamma-openbao-site-root.tokenreaches Nomad/OpenBao/Postgres bootstrap reconciliation, then correctly stops because object-storage R2 runtime credentials are not imported:Current imported Cloudflare account-admin authority is insufficient. Live provider verification fails with Cloudflare 403 because token A cannot read account token metadata and needs
Account API Tokens Readon accountc3eaeffaadf7d4847684d4775c16d598.Host cleanup was checked after bootstrap attempts:
/run/verself/bootstrap/openbao-site-root.tokenand/run/verself/bootstrap/openbao-root.tokenare absent.Remaining before ready
import-admin-pair.rotate-object-storage-providerto create and persist the four object-storage R2 runtime credentials.aspect deploy --site=gammafor the full exit condition.