feat(postgres): resource runtime, bindings, and Local controller#89
Conversation
d54094b to
d0ad57a
Compare
Greptile SummaryThis PR introduces the Postgres resource for the Alien platform, covering the full stack from cloud/local controllers through bindings, the workload SDK, and CI for shipping a pgvector binary. The DB password runs through two carefully separated channels: the synced
Confidence Score: 5/5Safe to merge — the critical password-leak fix is correct and well-tested across both the Rust and TypeScript runtimes. The core security invariant — password never reaching persisted/synced state — is implemented correctly in
|
| Filename | Overview |
|---|---|
| crates/alien-local/src/postgres_manager.rs | New file: manages embedded Postgres processes. Password stored in 0600 metadata; pgvector install now caches correctly. Lock-per-database in restart_exited is correct. Minor: pgvector archive downloaded without a checksum guard. |
| crates/alien-infra/src/postgres/local.rs | New file: LocalPostgresController — create/ready/update/delete flows. Password correctly stripped in get_binding_params and re-resolved from 0600 metadata in resolve_binding_params. #[serde(skip)] binding re-populated lazily in ready handler. |
| packages/sdk/src/bindings/postgres.ts | New file: getPostgresConnection resolves cloud secrets lazily. GCP path uses getProjectId() (workload project only); failures mis-attributed to accessSecretVersion. Encoding parity with Rust resolver is well-tested. |
| crates/alien-core/src/bindings/postgres.rs | New file: PostgresBinding enum. Local/External carry inline passwords with redacting Debug impls. Cloud variants carry only secret locators. Serialization tests are thorough. |
| crates/alien-bindings/src/traits.rs | Adds Postgres trait, PostgresConnectionParams, SslMode, and encode_userinfo. Password redacted in Debug. encode_userinfo matches the TS encodeUserinfo byte-for-byte. |
| crates/alien-infra/src/core/environment_variables.rs | Switches to async resolve_binding_params so Local Postgres can re-read password from manager metadata after deserialize. Adds ExternalBinding::Postgres serialization. Correct. |
| crates/alien-bindings/src/providers/postgres/local.rs | New file: LocalPostgres resolver for Local/External variants; cloud variants rejected by design. Cross-runtime encoding tests pin the TS–Rust contract. |
| .github/workflows/release-pgvector.yml | New workflow: builds and publishes per-(PG-major × target) pgvector zips via OIDC. No published checksum artifact for consumer verification. |
| crates/alien-local/src/local_bindings_provider.rs | Wires LocalPostgresManager into LocalBindingsProvider; spawns monitor task and registers for graceful shutdown. |
| crates/alien-core/src/resources/postgres.rs | New file: Postgres resource with validate_update enforcing immutable id/backend and monotonic storage/version. Defaults and outputs well-formed. |
Sequence Diagram
%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant C as LocalPostgresController
participant M as LocalPostgresManager
participant E as EnvironmentVariableBuilder
participant W as Workload (env var)
participant SDK as TS SDK (getPostgresConnection)
Note over C,M: CREATE flow
C->>M: start_postgres(id, version)
M->>M: load_or_init_metadata (password generated once, 0600 metadata.json)
M->>M: boot() setup, start, create_database, install_pgvector
C->>M: get_binding(id)
M-->>C: PostgresBinding::Local with password
C->>C: "self.binding = Some(binding)"
Note over C,E: SYNCED channel (no password)
C->>E: get_binding_params()
Note right of C: strips password key before returning
E-->>E: remote_binding_params password-free to control plane
Note over C,W: WORKER-ENV channel (full binding)
C->>E: resolve_binding_params(ctx, id)
Note right of C: re-reads manager metadata if binding is None
E->>W: "ALIEN_NAME_BINDING = JSON with password"
Note over W,SDK: Workload connects
W->>SDK: getPostgresConnection(my-db)
SDK->>SDK: reads env var, parses JSON
SDK-->>W: PostgresConnection with host port user password ssl connectionString
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant C as LocalPostgresController
participant M as LocalPostgresManager
participant E as EnvironmentVariableBuilder
participant W as Workload (env var)
participant SDK as TS SDK (getPostgresConnection)
Note over C,M: CREATE flow
C->>M: start_postgres(id, version)
M->>M: load_or_init_metadata (password generated once, 0600 metadata.json)
M->>M: boot() setup, start, create_database, install_pgvector
C->>M: get_binding(id)
M-->>C: PostgresBinding::Local with password
C->>C: "self.binding = Some(binding)"
Note over C,E: SYNCED channel (no password)
C->>E: get_binding_params()
Note right of C: strips password key before returning
E-->>E: remote_binding_params password-free to control plane
Note over C,W: WORKER-ENV channel (full binding)
C->>E: resolve_binding_params(ctx, id)
Note right of C: re-reads manager metadata if binding is None
E->>W: "ALIEN_NAME_BINDING = JSON with password"
Note over W,SDK: Workload connects
W->>SDK: getPostgresConnection(my-db)
SDK->>SDK: reads env var, parses JSON
SDK-->>W: PostgresConnection with host port user password ssl connectionString
Reviews (2): Last reviewed commit: "fix(postgres): address review feedback o..." | Re-trigger Greptile
| Ok(result) | ||
| } | ||
|
|
||
| async fn load_postgres(&self, binding_name: &str) -> Result<Arc<dyn Postgres>> { | ||
| if let Some(cached) = self | ||
| .get_cached::<Arc<dyn Postgres>>("postgres", binding_name) | ||
| .await | ||
| { | ||
| return Ok(cached); | ||
| } | ||
|
|
||
| let binding_json = self.bindings.get(binding_name).ok_or_else(|| { | ||
| AlienError::new(ErrorData::BindingConfigInvalid { | ||
| binding_name: binding_name.to_string(), | ||
| reason: "Binding not found".to_string(), | ||
| }) | ||
| })?; | ||
|
|
||
| let binding: PostgresBinding = serde_json::from_value(binding_json.clone()) | ||
| .into_alien_error() | ||
| .context(ErrorData::BindingConfigInvalid { | ||
| binding_name: binding_name.to_string(), | ||
| reason: "Failed to parse Postgres binding".to_string(), | ||
| })?; | ||
|
|
||
| let result: Arc<dyn Postgres> = Arc::new( | ||
| crate::providers::postgres::local::LocalPostgres::from_binding(binding_name, &binding)?, | ||
| ); | ||
|
|
||
| self.put_cache("postgres", binding_name, result.clone()) | ||
| .await; | ||
| Ok(result) | ||
| } | ||
|
|
||
| async fn load_queue(&self, binding_name: &str) -> Result<Arc<dyn Queue>> { | ||
| if let Some(cached) = self |
There was a problem hiding this comment.
Cloud-variant
load_postgres silently broken for Rust cloud workers
load_postgres in the production BindingsProvider always delegates to LocalPostgres::from_binding, which explicitly rejects Aurora, CloudSql, and FlexibleServer variants with an error at runtime. The same applies to GrpcBindingsProvider::load_postgres. Any Rust workload linked to a cloud-backed Postgres resource and calling the Postgres binding API will receive:
BindingConfigInvalid: "Aurora (AWS) Postgres bindings are resolved in-process by the workload SDK, not this Rust provider"
This is documented as intentional, but the error arrives at runtime with no compile-time signal. The BindingsProviderApi trait forces implementors to provide a load_postgres method, making it look like a fully-supported operation when it is not for cloud backends. A clarifying note in the trait doc (or a separate trait gate) would surface this constraint to Rust workload authors before they reach a production error.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/alien-bindings/src/provider.rs
Line: 1197-1232
Comment:
**Cloud-variant `load_postgres` silently broken for Rust cloud workers**
`load_postgres` in the production `BindingsProvider` always delegates to `LocalPostgres::from_binding`, which explicitly rejects `Aurora`, `CloudSql`, and `FlexibleServer` variants with an error at runtime. The same applies to `GrpcBindingsProvider::load_postgres`. Any Rust workload linked to a cloud-backed Postgres resource and calling the Postgres binding API will receive:
> `BindingConfigInvalid: "Aurora (AWS) Postgres bindings are resolved in-process by the workload SDK, not this Rust provider"`
This is documented as intentional, but the error arrives at runtime with no compile-time signal. The `BindingsProviderApi` trait forces implementors to provide a `load_postgres` method, making it look like a fully-supported operation when it is not for cloud backends. A clarifying note in the trait doc (or a separate trait gate) would surface this constraint to Rust workload authors before they reach a production error.
How can I resolve this? If you propose a fix, please make it concise.|
Addressed the review feedback in 5a0b31f (and ran our internal final-review on the changes):
|
Four pre-existing Azure/GCP infrastructure bugs, surfaced while getting the Postgres cloud e2e green. They are independent of the Postgres feature but sit underneath it, so the runtime (#89) and setup (#90) stack on this PR. Splitting them out lets them land and be reviewed on their own. ## What was broken, and what I did Four small, self-contained fixes: - **GCP network import dropped the subnet name.** The importer built everything from the subnet but never recorded `subnetwork_name`, so VPC egress to a private PSC Cloud SQL had nothing to resolve against. It now parses the name out of the subnet self-link on import. - **Azure build read a frozen import as fatal drift.** An imported (frozen) build arrives with its managed environment, identity, and `resource_prefix` unset; the controller treated those as drift and failed. The heartbeat now resolves all three from their dependencies, so an imported build can submit jobs without waiting for an update. - **Azure worker pointed the DNS CNAME at the wrong host.** It targeted the public display FQDN, which can equal the record name and make the record point at itself; the provider rejects that as a loop and the worker hangs waiting for DNS. It now targets the Container App's own ingress host. - **Azure Terraform left the Container Apps environment outside the VNet.** Added the VNet integration (and the matching network emitter) so the environment lands in the stack VNet. ## Files touched - `crates/alien-infra/src/network/gcp_import.rs` — subnet name on import - `crates/alien-infra/src/build/azure.rs` — heartbeat resolves env, identity, prefix - `crates/alien-infra/src/worker/azure.rs` (+ `azure_import.rs`) — CNAME targets the ingress host - `crates/alien-terraform/src/emitters/azure/*` — Container Apps environment VNet integration - `crates/alien-infra/tests/importers.rs` + the azure generator/snapshot tests — coverage ## How I tested - `cargo test -p alien-infra` and the `alien-terraform` azure generator + snapshot tests pass. - Validated end to end as part of the full Postgres cloud e2e on AWS, GCP, and Azure (the stack on top of this PR provisions, connects, and runs pgvector). Base of the stacked Postgres work; #89 (runtime) and #90 (setup) build on it. Supersedes the infrastructure portion of the original combined PR.
…ng-params channel
…s, encoding-parity test, comment discipline)
…nced channel get_binding_params feeds the synced, persisted remote_binding_params, so it strips the Local password. A linked out-of-process worker still needs the live connection, so resolve_binding_params delivers the full binding on the worker-env channel, which is set on the dependent's resource and never persisted in Alien's synced state.
- install_pgvector skips the networked install when the pinned pgvector version is already present, so recovering an existing database stays off the release host; a version bump still falls through and reinstalls the pin. - restart_exited collects exited IDs under a brief lock then re-acquires per-database, so the monitor no longer holds the runtimes lock across the whole restart batch. - Document that load_postgres is local-only (cloud Postgres is resolved by the TypeScript SDK; a Rust worker that requests it gets a runtime error). - Document that the external binding's ssl is always false (node-postgres has no prefer mode).
biome check flagged pre-existing formatting (the discriminated-union schema and the makeConnection objects on single lines, percentEncode using string concatenation). Mechanical biome check --write fixes; no behavior change.
theseus-rs tarballs unpack into a top-level postgresql-<ver>-<triple>/ dir, so the pgdist/bin/pg_config path (and PGROOT=pgdist on Windows) never resolved and the build step failed. release-pgvector is dispatch-only and had never run, so the bug was never caught.
…r metadata A worker or daemon linked to a local Postgres received its binding with the generated password inline, and the local manager persisted the whole env to a world-readable (0644) metadata.json. This contradicted the feature's own design, where the controller strips the password from synced state and the Postgres manager writes its own metadata 0600. The binding now reaches the running process but is re-resolved live on every (re)start instead of being persisted. A runtime-only BindingsProviderApi channel resolves the local Postgres binding from the manager's 0600 metadata; the worker and daemon managers strip it from the persisted env and merge it into the live process env; only the non-secret link names persist, so recovery re-resolves. The metadata file is written 0600 as defense in depth. External (Remote Access) Postgres inlines a password with no local manager to re-resolve it, so it is rejected on the local platform rather than leaked.
GCP and Azure pick the smallest tier that satisfies both cpu and memory; AWS (Aurora Serverless v2) sizes from memory and ignores cpu.
a13f0ea to
8d624b0
Compare
Local Postgres delivered a linked worker's password through a dedicated resolve_binding_params method on the base ResourceController trait, overridden only by Local Postgres. That duplicated the runtime-only binding path the worker/daemon managers already use to re-resolve the password live and keep it out of persisted metadata, so the method was dead weight. Delete the trait method, its macro plumbing, and the Local override; the env-builder now uses the plain get_binding_params, so the local Postgres password never even enters the deploy-computed env. Worker and daemon still receive it live via the runtime-only binding path; behavior is unchanged.
Local Postgres is an embedded native process bound to 127.0.0.1, so worker/daemon (host processes) connected fine but a local container could not reach it (a container's host.docker.internal lands on the docker bridge gateway, not the host loopback). Have pg also listen on the docker bridge gateway when present, found by inspecting the default bridge, restricting it to the private 172.16.0.0/12 range, and bind-testing it, so pg never lands on a routable interface and stays loopback-only on Docker Desktop and when Docker is absent. A pg_hba rule scopes that subnet to scram auth. Never 0.0.0.0, so pg stays off every public/LAN interface. Re-add the container binding delivery and the 127.0.0.1 to host.docker.internal rewrite so a linked container receives the full binding. Adds an ignored container-to-Postgres integration test asserting a same-stack container connects, a wrong password is refused, and the LAN interface is refused.
Postgres setup, stacked on the runtime (#89): the cloud permission sets, the preflights that prepare each cloud, Azure private networking, the secret redaction on cloud create requests, and the end-to-end coverage that exercises Postgres on all three clouds. What happens when a stack with a Postgres provisions: 1. Preflights enable the required cloud services (Azure / GCP) and provision the secrets vault for the connection password. 2. The permission sets grant the management identity a prefix-scoped set of management-plane actions. **← the heart: management-plane only, never DB-contents access** 3. On Azure, private networking creates the dedicated Private Endpoint subnet (and the importer records it) so the database is reachable only from inside the stack. 4. The end-to-end apps then provision, connect, and run a `pgvector` query against the real cloud database. This PR adds the setup half of the resource: it turns the runtime model and bindings (#89) into something that can actually be provisioned, secured, and torn down on a real cloud account. ## What's in the layer - **Permission sets** (`alien-permissions/permission-sets/postgres/*`) — `data-access` (the worker's scoped read of the connection secret), `provision`, `management`, `heartbeat`. The management / provision / heartbeat sets are management-plane only and never grant access to the DB contents. - **Preflights** (`alien-preflights`) — enable the required cloud services, provision the secrets vault, and assert the network prerequisites before provisioning starts. - **Azure private networking** (`alien-infra/src/network/azure*`) — the dedicated Private Endpoint subnet and the importer that records it. - **Cloud create-request redaction** (`alien-client-core/src/request_utils.rs`) — the master password rides the cloud create-request body; it is scrubbed before that body can land in an error chain or synced state. - **End-to-end coverage** (`alien-test`, `tests/e2e/test-apps/*`) — the comprehensive Rust and TypeScript apps now bind a Postgres and run a query. ## How I tested - `cargo test` across `alien-permissions` (incl. the AWS ABAC + permission-set validation tests), `alien-preflights`, `alien-infra`, `alien-terraform` (azure generator/snapshots), and `alien-test` — all pass. - End to end on AWS, GCP, and Azure: provision the database → worker connects → `CREATE EXTENSION vector` + a `pgvector` query succeeds → teardown. Run from the comprehensive Rust and TypeScript test apps. Security walk (this PR touches permissions and the secret): - The management-plane permission sets grant no DB-contents access — the sensitive-data validation tests assert this. - The master password on a cloud create request is redacted before it can reach a persisted error or synced state. - Permission scopes are pinned to the stack / resource prefix, never broad cloud admin. - Nothing turned up.
The Postgres resource runtime, stacked on the infra fixes (#88): the resource model and bindings, the Local (developer) controller, the cloud client SDKs for the managed Postgres backends, and the TypeScript SDK surface. The part worth careful review is how the generated DB password is handled.
How the local DB password flows
The password is generated once and has to reach a linked worker for a direct connection, without ever landing in control-plane state. So it runs through two separate channels:
Step 1 is the property that matters, and review caught a real gap there: the password was reaching the synced channel. Fixed by stripping it in
get_binding_params— the#[serde(skip)]on the field alone was not enough.What's in the layer
alien-core) — the Postgres resource, its binding shapes, the heartbeat data.alien-local+alien-infra/src/postgres/local.rs) — runs an embedded Postgres for local development.alien-aws-clients/gcp/azure) — thin wrappers over the managed cloud Postgres APIs (Aurora, Cloud SQL, Flexible Server).packages/core,packages/sdk) — the generated schemas and the TypeScript binding.How I tested
cargo testacross the touched crates (alien-core,alien-bindings,alien-local,alien-infra), including the binding round-trip and encoding-parity tests.alien-local).Security walk for the password (this PR touches the secret):