Skip to content

Route deployment artifacts through object storage#135

Draft
Anveio wants to merge 150 commits into
mainfrom
codex/object-storage-deployment-artifacts
Draft

Route deployment artifacts through object storage#135
Anveio wants to merge 150 commits into
mainfrom
codex/object-storage-deployment-artifacts

Conversation

@Anveio

@Anveio Anveio commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Route deployment artifacts through object-storage-service write sessions instead of direct Cloudflare/R2 control-plane coupling.
  • Delete the old cloudflare-r2-control-plane component and keep Cloudflare provider authority behind the Cloudflare integration/object-storage boundary.
  • Keep first bootstrap explicit: local build, SSH artifact copy, loopback-only artifact server, OpenBao/Postgres reconciliation, then Nomad job registration.
  • Add explicit Cloudflare account-admin import through aspect integrations cloudflare-control-plane --action=import-admin-pair with 0600 token files and provider verification before OpenBao persistence.
  • Improve bootstrap diagnostics by listing every missing active external OpenBao runtime secret in one failure.

Validation

  • aspect tidy
  • aspect check
  • bazelisk test //src/tools/deployment/internal/sitebootstrap:sitebootstrap_test //src/integrations/cloudflare/control-plane/cmd/cloudflare-control-plane:cloudflare-control-plane_test //src/integrations/cloudflare/control-plane/internal/r2control:r2control_test
  • bazelisk test //src/services/deployment-service/deploycontract:deploycontract_test

Live gamma status

aspect site bootstrap-deploy --site=gamma --sha=$(git rev-parse HEAD) --openbao-site-root-token-file=/tmp/verself-bootstrap/gamma-openbao-site-root.token reaches Nomad/OpenBao/Postgres bootstrap reconciliation, then correctly stops because object-storage R2 runtime credentials are not imported:

object-storage-service.r2.admin_access_key_id
object-storage-service.r2.admin_secret_access_key
object-storage-service.r2.proxy_access_key_id
object-storage-service.r2.proxy_secret_access_key

Current imported Cloudflare account-admin authority is insufficient. Live provider verification fails with Cloudflare 403 because token A cannot read account token metadata and needs Account API Tokens Read on account c3eaeffaadf7d4847684d4775c16d598.

Host cleanup was checked after bootstrap attempts: /run/verself/bootstrap/openbao-site-root.token and /run/verself/bootstrap/openbao-root.token are absent.

Remaining before ready

  • Import two valid Cloudflare account-admin API token files through import-admin-pair.
  • Run rotate-object-storage-provider to create and persist the four object-storage R2 runtime credentials.
  • Rerun gamma bootstrap and then aspect deploy --site=gamma for the full exit condition.

@Anveio Anveio force-pushed the codex/object-storage-deployment-artifacts branch 2 times, most recently from 7916194 to 9997ea2 Compare June 7, 2026 09:04
Anveio and others added 26 commits June 7, 2026 03:00
Move the two root //patches files to live beside the components that own
the patched dependency, matching the temporal-platform/patches precedent:

  tigerbeetle-go -> //src/infrastructure-components/tigerbeetle/patches
  pg-query-go    -> //src/tools/dev/patches (sqlc transitive dep)

Update the go_deps.module_override labels in MODULE.bazel and drop the
now-empty root patches/ package.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… path

Collapse `guardian run <tool>` resolution to a single hermetic path: locate
<root>/<name>/bin/<exe> and exec, where root is GUARDIAN_TOOLS_ROOT (the golden
image) or the Bazel-materialized .verself/tools/root (dev). The mirror fetch,
sha256 re-hash, and admission gate are removed; the cosign-verified image
manifest digest is the trust boundary.

- toolrun: delete fetchMirror/fetchHTTPS/copyFile and the download half of
  EnsureExecutable; new Locate (stat + exec-bit assert, fail loud) + Exec.
- toolcatalog: PlatformTool/ResolvedTool collapse to {executable}; the bazel pin
  (url+sha256) moves to the Bazel module graph so no digest is authored twice in
  the catalog.
- bazel becomes a controller_http_file pin; tools/BUILD.bazel adds a per-tool
  layered tools_root pkg_tar; `aspect dev install` materializes it via
  stage-zero bazel (so rebuilding guardian can't deadlock its own resolver).
- guardian run <tool> --which/--verify report root + source instead of
  ref/digest/admission.

Verified: 3/3 Go test targets pass; with egress blocked (sudo unshare -n,
loopback only) `guardian run bazel -- query //src/... --output=package` resolves
from the root and returns all packages offline; `aspect dev install` + dev
`guardian run bazel -- version` -> 9.1.0 from .verself/tools/root.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erated tool catalog

Eliminate the two preflight network leaks and collapse the tool pins to one
generated source of truth.

No-leaks (TB-4):
- podman: pin the full 39-deb noble closure by sha256 (resolved against a clean
  base) under src/infrastructure-components/podman; runtime_artifact extracts a
  self-contained prefix. preflight installs it + a CNI containers.conf + systemd
  podman.socket from the prefix instead of `apt install podman`.
- rsync: seed the pinned rsync binary over ssh and drive remote rsync via
  --rsync-path; drop rsync from the (now deleted) apt task.
- ansible-core: a reproducible, offline, controller-side ansible-core 2.20.3
  bundle (python + hash-pinned deps + the already-pinned collections) replaces
  the `uvx --from ansible-core` PyPI fetch in runPreflightPlaybook.

Single source of truth (TB-2):
- .config/guardian/tools.cue is generated from the Bazel pin graph
  (tools:catalog.bzl) via gen-tools-catalog, gated by write_source_files +
  diff_test; the tool root materializes all 34 tools so `guardian run <tool> --`
  resolves every pinned tool.
- bazel_pin_consistency_test fails if the bazel digest drifts across the pin,
  the stage-zero bootstrap, and the make-skill release lockfile.

Verified: podman pulls+runs a container from the prefix (crun+conmon+overlay+
cni); ansible-playbook + a pinned-collection task run offline; tools_root
resolves 34 tools offline; all guardian/toolrun/catalog tests + both drift gates
pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ventory

preflight: stop podman and clear /var/lib/containers/storage (+ /run/containers
/storage) before reconfiguring from the pinned prefix, so a stale storage DB left
by a prior podman (e.g. apt's, with a different graph driver) cannot crash the
podman API service with "database graph driver ... mismatch". The Nomad podman
driver then comes up healthy on a clean reconverge.

Document the hermetic preflight (pinned podman/rsync/ansible-core, no apt/PyPI),
the podman-from-prefix model, the generated single-source tool catalog, and that
the postgresql-runtime role drift is resolved by the disaster-recovery path
(verified on gamma: role-not-found gone, fresh allocation starts; full postgres
convergence then chains on the Cloudflare account-admin operator import).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… gamma)

After the OpenBao wipe + Cloudflare account-admin import (R2 recovery produced),
postgresql clears the postgresql-runtime role, secret, and R2 gates and reaches
its setup prestart for the first time, which then crashes decoding the
`postgresql-recovery --action=info` output as utf-8. Record it as an open
postgresql-recovery component bug (emit valid utf-8 JSON), distinct from the
deploy path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cut PostgreSQL off raw_exec/host orchestration onto the podman driver, matching
the profile/distribution-service pattern, with state on durable bind-mounts.

- Image: first libc OCI base in the repo — ubuntu:24.04 pinned by digest via
  rules_oci oci.pull — layered with a computed 16-deb OS closure (krb5/ldap/
  python3.12), the existing postgresql_runtime.tar (postgres 16 + pgBackRest +
  postgresql-recovery), and an identity layer (postgres uid/gid 999). oci_load
  emits the docker-archive the podman driver loads; nomad var becomes
  postgresql_image_sha256.
- nomad.hcl: setup/server/reconcile now driver="podman", user="postgres",
  network host, shm_size=512m. Durable host bind-mounts for PGDATA, config, log,
  socket, pgBackRest spool/log, secrets, and the projected document; server and
  reconcile run readonly_rootfs + cap_drop=all. Deleted all host-glue: host
  user creation, runtime-tar extraction + releases/current versioning,
  LD_LIBRARY_PATH juggling, runuser indirection, and the openssl shell-out
  (now python stdlib secrets). The image is the runtime; runtime root is the
  fixed /opt/verself/postgresql prefix.
- recovery binary: the info action now sanitizes pgBackRest output to valid
  utf-8 (strings.ToValidUTF8) before stdout, so a non-utf8 byte can't crash the
  consumer's decode; other actions keep raw passthrough. Tests added.
- Drop the now-dead runtimeArtifact/runtimeRoot CRD fields from the schema and
  the gamma instance (the image supersedes them); no raw_exec trace remains.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two gaps the fresh-reimaged gamma node exposed (the previous node masked them
with leftover apt podman):

- catatonit (podman's init binary, used by `init = true` tasks) is a podman
  Recommends that --no-install-recommends dropped from the pinned closure. Pin
  it and point containers.conf init_path at the prefix copy; container creation
  failed with "lookup init binary: catatonit not found in $PATH" without it.
- podman/crun/conmon are dynamically linked against prefix-only libs (libyajl,
  libgpgme, ...) and podman does not propagate LD_LIBRARY_PATH to the crun it
  spawns, so on a clean host crun died with "libyajl.so.2: cannot open shared
  object file" (crun start exit 127). Register the prefix lib dirs via
  /etc/ld.so.conf.d + ldconfig so every prefix binary resolves its libs; also
  put the prefix bin on the podman service PATH.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…efix machinery)

Replace the dpkg-deb-x-into-/opt-prefix approach with shipping the pinned podman
.deb closure as a bundle and installing it on the node with an offline
`dpkg -i ./*.deb`. dpkg owns placement, ldconfig, /usr/libexec/podman/catatonit,
helper dirs, the stock podman.socket/.service unit, and the stock
/etc/containers config — so the "could not load shared library / not found in
PATH" failure class (catatonit missing, crun libyajl.so.2) becomes structurally
impossible. Hermeticity is preserved: the .deb bytes are the same pinned
content-addressed artifacts, installed with no apt-archive fetch.

- BUILD: runtime_artifact now tars the .deb files (podman-debs.tar) instead of an
  extracted prefix.
- preflight.yml: the prefix block (extract, ld.so.conf+ldconfig, custom
  containers.conf/storage.conf/CNI, LD_LIBRARY_PATH/init_path units, storage
  clear) is replaced by: unpack the bundle + `dpkg -i ./*.deb` + enable the
  deb's stock podman.socket.
- main.go: target path -> podman-debs.tar; podmanPrefix const and
  guardian_podman_prefix var removed.

Verified offline in a clean ubuntu:24.04 container (network severed): dpkg -i
installs the closure, podman 4.9.3 runs, ldd crun resolves libyajl, catatonit is
at /usr/libexec/podman/catatonit, and `podman run --network none` executes a
container.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on, postmaster, R2 TLS)

Fixes the chain of blockers that kept the postgresql Nomad job from reaching
steady-state green on gamma under the podman driver:

- provision-host-dirs: a raw_exec root prestart creates the durable + socket
  bind sources (/var/lib/postgresql/16/verself, /var/run/postgresql). The
  podman driver does not auto-create bind-mount sources (unlike docker -v), and
  /var/run is tmpfs (wiped each boot), so this runs per-alloc.
- server/reconcile gain real restart stanzas (mode=delay) and the group a
  self-healing reschedule, so a slow first start no longer fails the reconcile
  sidecar and SIGKILLs the alloc (was Exit 143, "Sibling reconcile failed").
- clear_stale_postmaster replaces wait_for_previous_postmaster: under init=true
  the wrapper is itself pid 2, so the old os.kill(pid,0) liveness check
  self-aliased and never cleared a stale postmaster.pid. The socket is the
  namespace-independent authority for whether a postmaster is serving.
- wait_for_ready deadline 60s -> 300s to absorb crash recovery.
- stanza-create is unconditional + idempotent; the missing-stanza string
  heuristic skipped it for a fresh empty-list repo and broke check.
- pgBackRest TLS to R2: mount host /etc/ssl/certs and set repo1-s3-ca-file in
  the generated config. The minimal image has no CA roots and OpenSSL's default
  path is not /etc/ssl/certs; the config setting covers the recovery commands
  and postgres's archive_command alike.

CRD ephemeral paths (config/log/report/pgbackrest) move to /alloc so only the
durable dataDir and cross-component socketDir remain host bind-mounts.

Verified on gamma: report.json status=healthy, 16 service databases + 17 roles
created, pgBackRest full backup 20260608-091822F status ok.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The new-binary skill classifies how every binary is pinned, declared, and run
(source axis x run axis; Level 1 containerized vs Level 2 native). This adds the
governing rule for recover binaries: they own domain-state convergence only.

Install, host provisioning, and process supervision belong to the OCI image,
the golden-image build, and Nomad respectively, and are deleted (not relocated,
and never pushed into Ansible) when a component goes Level 1. Stateless
components delete their recover binary outright; stateful components collapse it
to the domain FSM (postgresql/internal/recoveryfsm is the reference).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The PostgreSQL inventory row and the info-decode observed-error row were stale.
Recon showed the info-decode UnicodeDecodeError already resolved (setup exits 0),
which exposed a downstream chain — postmaster PID-namespace staleness under
init=true, the reconcile/restart supervision race (Exit 143), host-dir
provisioning, and pgBackRest R2 CA trust — now fixed in nomad.hcl and verified:
report.json status=healthy, 16 databases + 17 roles, full backup ok.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant