Skip to content

feat(postgres): resource runtime, bindings, and Local controller#89

Merged
alongubkin merged 18 commits into
mainfrom
feat/alien-35-oss-2-foundation
Jul 2, 2026
Merged

feat(postgres): resource runtime, bindings, and Local controller#89
alongubkin merged 18 commits into
mainfrom
feat/alien-35-oss-2-foundation

Conversation

@ItamarZand88

@ItamarZand88 ItamarZand88 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

The Postgres resource runtime, stacked on the infra fixes (#88): the resource model and bindings, the Local (developer) controller, the cloud client SDKs for the managed Postgres backends, and the TypeScript SDK surface. The part worth careful review is how the generated DB password is handled.

How the local DB password flows

The password is generated once and has to reach a linked worker for a direct connection, without ever landing in control-plane state. So it runs through two separate channels:

  1. It is stripped from the synced binding params and never written to serialized controller state — so it cannot reach control-plane storage or status responses.
  2. It is handed to the worker at runtime through the worker's environment, resolved per request, never persisted.

Step 1 is the property that matters, and review caught a real gap there: the password was reaching the synced channel. Fixed by stripping it in get_binding_params — the #[serde(skip)] on the field alone was not enough.

What's in the layer

  • Resource model + bindings (alien-core) — the Postgres resource, its binding shapes, the heartbeat data.
  • The Local controller (alien-local + alien-infra/src/postgres/local.rs) — runs an embedded Postgres for local development.
  • Cloud client SDKs (alien-aws-clients / gcp / azure) — thin wrappers over the managed cloud Postgres APIs (Aurora, Cloud SQL, Flexible Server).
  • SDK surface (packages/core, packages/sdk) — the generated schemas and the TypeScript binding.

How I tested

  • cargo test across the touched crates (alien-core, alien-bindings, alien-local, alien-infra), including the binding round-trip and encoding-parity tests.
  • The local embedded-Postgres integration test (alien-local).
  • Exercised end to end in the full Postgres cloud e2e (the setup layer, feat(postgres): setup permissions, private networking, and e2e #90, stacks on this).

Security walk for the password (this PR touches the secret):

  • Synced and persisted state never carry the password — the round-trip test asserts it is absent from the serialized binding params.
  • The runtime worker-env delivery is per request and not persisted.
  • Errors from the secret path are redacted (request body scrubbed before it can reach an error chain).
  • The one gap that existed (password on the synced channel) is the one this PR fixes. Nothing else turned up.

@ItamarZand88 ItamarZand88 force-pushed the feat/alien-35-oss-2-foundation branch from d54094b to d0ad57a Compare June 26, 2026 19:58
@ItamarZand88 ItamarZand88 marked this pull request as ready for review June 26, 2026 20:17
@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown

Greptile Summary

This PR introduces the Postgres resource for the Alien platform, covering the full stack from cloud/local controllers through bindings, the workload SDK, and CI for shipping a pgvector binary. The DB password runs through two carefully separated channels: the synced get_binding_params strips it before it reaches control-plane state, while resolve_binding_params (never synced) delivers the full binding to each linked workload's environment variable.

  • Rust side: LocalPostgresManager (embedded process lifecycle, pgvector install, monitor-and-recover loop), LocalPostgresController (create/ready/update/delete flows), LocalPostgres binding resolver, and new cloud client stubs for RDS, Cloud SQL, and Azure Flexible Server.
  • TypeScript SDK: getPostgresConnection resolves local/external bindings directly and fetches cloud secrets lazily using the workload's own identity; encoding parity between the Rust and TS resolvers is pinned by cross-runtime tests.
  • CI: release-pgvector.yml builds and publishes per-(PG-major × target) zips to releases.alien.dev; PGVECTOR_VERSION is pinned in lockstep between the workflow and the manager constant.

Confidence Score: 5/5

Safe to merge — the critical password-leak fix is correct and well-tested across both the Rust and TypeScript runtimes.

The core security invariant — password never reaching persisted/synced state — is implemented correctly in get_binding_params (strip) and resolve_binding_params (re-read from manager metadata). The two-channel design is clearly documented, encoding parity between Rust and TS is pinned by cross-runtime tests, and the pgvector install-cache guard correctly skips the network download on recovery. The two findings are hardening suggestions that do not affect correctness.

packages/sdk/src/bindings/postgres.ts (GCP same-project assumption) and crates/alien-local/src/postgres_manager.rs (no archive checksum) are worth a second look before expanding to multi-project GCP deployments.

Important Files Changed

Filename Overview
crates/alien-local/src/postgres_manager.rs New file: manages embedded Postgres processes. Password stored in 0600 metadata; pgvector install now caches correctly. Lock-per-database in restart_exited is correct. Minor: pgvector archive downloaded without a checksum guard.
crates/alien-infra/src/postgres/local.rs New file: LocalPostgresController — create/ready/update/delete flows. Password correctly stripped in get_binding_params and re-resolved from 0600 metadata in resolve_binding_params. #[serde(skip)] binding re-populated lazily in ready handler.
packages/sdk/src/bindings/postgres.ts New file: getPostgresConnection resolves cloud secrets lazily. GCP path uses getProjectId() (workload project only); failures mis-attributed to accessSecretVersion. Encoding parity with Rust resolver is well-tested.
crates/alien-core/src/bindings/postgres.rs New file: PostgresBinding enum. Local/External carry inline passwords with redacting Debug impls. Cloud variants carry only secret locators. Serialization tests are thorough.
crates/alien-bindings/src/traits.rs Adds Postgres trait, PostgresConnectionParams, SslMode, and encode_userinfo. Password redacted in Debug. encode_userinfo matches the TS encodeUserinfo byte-for-byte.
crates/alien-infra/src/core/environment_variables.rs Switches to async resolve_binding_params so Local Postgres can re-read password from manager metadata after deserialize. Adds ExternalBinding::Postgres serialization. Correct.
crates/alien-bindings/src/providers/postgres/local.rs New file: LocalPostgres resolver for Local/External variants; cloud variants rejected by design. Cross-runtime encoding tests pin the TS–Rust contract.
.github/workflows/release-pgvector.yml New workflow: builds and publishes per-(PG-major × target) pgvector zips via OIDC. No published checksum artifact for consumer verification.
crates/alien-local/src/local_bindings_provider.rs Wires LocalPostgresManager into LocalBindingsProvider; spawns monitor task and registers for graceful shutdown.
crates/alien-core/src/resources/postgres.rs New file: Postgres resource with validate_update enforcing immutable id/backend and monotonic storage/version. Defaults and outputs well-formed.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant C as LocalPostgresController
    participant M as LocalPostgresManager
    participant E as EnvironmentVariableBuilder
    participant W as Workload (env var)
    participant SDK as TS SDK (getPostgresConnection)

    Note over C,M: CREATE flow
    C->>M: start_postgres(id, version)
    M->>M: load_or_init_metadata (password generated once, 0600 metadata.json)
    M->>M: boot() setup, start, create_database, install_pgvector
    C->>M: get_binding(id)
    M-->>C: PostgresBinding::Local with password
    C->>C: "self.binding = Some(binding)"

    Note over C,E: SYNCED channel (no password)
    C->>E: get_binding_params()
    Note right of C: strips password key before returning
    E-->>E: remote_binding_params password-free to control plane

    Note over C,W: WORKER-ENV channel (full binding)
    C->>E: resolve_binding_params(ctx, id)
    Note right of C: re-reads manager metadata if binding is None
    E->>W: "ALIEN_NAME_BINDING = JSON with password"

    Note over W,SDK: Workload connects
    W->>SDK: getPostgresConnection(my-db)
    SDK->>SDK: reads env var, parses JSON
    SDK-->>W: PostgresConnection with host port user password ssl connectionString
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant C as LocalPostgresController
    participant M as LocalPostgresManager
    participant E as EnvironmentVariableBuilder
    participant W as Workload (env var)
    participant SDK as TS SDK (getPostgresConnection)

    Note over C,M: CREATE flow
    C->>M: start_postgres(id, version)
    M->>M: load_or_init_metadata (password generated once, 0600 metadata.json)
    M->>M: boot() setup, start, create_database, install_pgvector
    C->>M: get_binding(id)
    M-->>C: PostgresBinding::Local with password
    C->>C: "self.binding = Some(binding)"

    Note over C,E: SYNCED channel (no password)
    C->>E: get_binding_params()
    Note right of C: strips password key before returning
    E-->>E: remote_binding_params password-free to control plane

    Note over C,W: WORKER-ENV channel (full binding)
    C->>E: resolve_binding_params(ctx, id)
    Note right of C: re-reads manager metadata if binding is None
    E->>W: "ALIEN_NAME_BINDING = JSON with password"

    Note over W,SDK: Workload connects
    W->>SDK: getPostgresConnection(my-db)
    SDK->>SDK: reads env var, parses JSON
    SDK-->>W: PostgresConnection with host port user password ssl connectionString
Loading

Reviews (2): Last reviewed commit: "fix(postgres): address review feedback o..." | Re-trigger Greptile

Comment thread crates/alien-local/src/postgres_manager.rs
Comment thread crates/alien-local/src/postgres_manager.rs
Comment on lines 1197 to 1232
Ok(result)
}

async fn load_postgres(&self, binding_name: &str) -> Result<Arc<dyn Postgres>> {
if let Some(cached) = self
.get_cached::<Arc<dyn Postgres>>("postgres", binding_name)
.await
{
return Ok(cached);
}

let binding_json = self.bindings.get(binding_name).ok_or_else(|| {
AlienError::new(ErrorData::BindingConfigInvalid {
binding_name: binding_name.to_string(),
reason: "Binding not found".to_string(),
})
})?;

let binding: PostgresBinding = serde_json::from_value(binding_json.clone())
.into_alien_error()
.context(ErrorData::BindingConfigInvalid {
binding_name: binding_name.to_string(),
reason: "Failed to parse Postgres binding".to_string(),
})?;

let result: Arc<dyn Postgres> = Arc::new(
crate::providers::postgres::local::LocalPostgres::from_binding(binding_name, &binding)?,
);

self.put_cache("postgres", binding_name, result.clone())
.await;
Ok(result)
}

async fn load_queue(&self, binding_name: &str) -> Result<Arc<dyn Queue>> {
if let Some(cached) = self

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Cloud-variant load_postgres silently broken for Rust cloud workers

load_postgres in the production BindingsProvider always delegates to LocalPostgres::from_binding, which explicitly rejects Aurora, CloudSql, and FlexibleServer variants with an error at runtime. The same applies to GrpcBindingsProvider::load_postgres. Any Rust workload linked to a cloud-backed Postgres resource and calling the Postgres binding API will receive:

BindingConfigInvalid: "Aurora (AWS) Postgres bindings are resolved in-process by the workload SDK, not this Rust provider"

This is documented as intentional, but the error arrives at runtime with no compile-time signal. The BindingsProviderApi trait forces implementors to provide a load_postgres method, making it look like a fully-supported operation when it is not for cloud backends. A clarifying note in the trait doc (or a separate trait gate) would surface this constraint to Rust workload authors before they reach a production error.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/alien-bindings/src/provider.rs
Line: 1197-1232

Comment:
**Cloud-variant `load_postgres` silently broken for Rust cloud workers**

`load_postgres` in the production `BindingsProvider` always delegates to `LocalPostgres::from_binding`, which explicitly rejects `Aurora`, `CloudSql`, and `FlexibleServer` variants with an error at runtime. The same applies to `GrpcBindingsProvider::load_postgres`. Any Rust workload linked to a cloud-backed Postgres resource and calling the Postgres binding API will receive:

> `BindingConfigInvalid: "Aurora (AWS) Postgres bindings are resolved in-process by the workload SDK, not this Rust provider"`

This is documented as intentional, but the error arrives at runtime with no compile-time signal. The `BindingsProviderApi` trait forces implementors to provide a `load_postgres` method, making it look like a fully-supported operation when it is not for cloud backends. A clarifying note in the trait doc (or a separate trait gate) would surface this constraint to Rust workload authors before they reach a production error.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread packages/sdk/src/bindings/postgres.ts
@ItamarZand88

Copy link
Copy Markdown
Contributor Author

Addressed the review feedback in 5a0b31f (and ran our internal final-review on the changes):

  1. pgvector on boot: install now skips the networked download when the pinned version is already installed (version-aware, so a PGVECTOR_VERSION bump still falls through and reinstalls), so recovering an existing database no longer depends on the release host being reachable.
  2. restart_exited mutex: collects exited IDs under a brief lock then re-acquires per-database, so the monitor no longer holds runtimes across the whole restart batch.
  3. cloud load_postgres: added a trait doc note that cloud Postgres is TypeScript-SDK-only (runtime-rejected, no compile gate). Kept by-design.
  4. external ssl: added a PostgresConnection.ssl JSDoc note that it is always false for external (node-postgres has no prefer mode). Kept as the documented v1 limitation.

alongubkin pushed a commit that referenced this pull request Jul 1, 2026
Four pre-existing Azure/GCP infrastructure bugs, surfaced while getting
the Postgres cloud e2e green. They are independent of the Postgres
feature but sit underneath it, so the runtime (#89) and setup (#90)
stack on this PR. Splitting them out lets them land and be reviewed on
their own.

## What was broken, and what I did
Four small, self-contained fixes:

- **GCP network import dropped the subnet name.** The importer built
everything from the subnet but never recorded `subnetwork_name`, so VPC
egress to a private PSC Cloud SQL had nothing to resolve against. It now
parses the name out of the subnet self-link on import.
- **Azure build read a frozen import as fatal drift.** An imported
(frozen) build arrives with its managed environment, identity, and
`resource_prefix` unset; the controller treated those as drift and
failed. The heartbeat now resolves all three from their dependencies, so
an imported build can submit jobs without waiting for an update.
- **Azure worker pointed the DNS CNAME at the wrong host.** It targeted
the public display FQDN, which can equal the record name and make the
record point at itself; the provider rejects that as a loop and the
worker hangs waiting for DNS. It now targets the Container App's own
ingress host.
- **Azure Terraform left the Container Apps environment outside the
VNet.** Added the VNet integration (and the matching network emitter) so
the environment lands in the stack VNet.

## Files touched
- `crates/alien-infra/src/network/gcp_import.rs` — subnet name on import
- `crates/alien-infra/src/build/azure.rs` — heartbeat resolves env,
identity, prefix
- `crates/alien-infra/src/worker/azure.rs` (+ `azure_import.rs`) — CNAME
targets the ingress host
- `crates/alien-terraform/src/emitters/azure/*` — Container Apps
environment VNet integration
- `crates/alien-infra/tests/importers.rs` + the azure generator/snapshot
tests — coverage

## How I tested
- `cargo test -p alien-infra` and the `alien-terraform` azure generator
+ snapshot tests pass.
- Validated end to end as part of the full Postgres cloud e2e on AWS,
GCP, and Azure (the stack on top of this PR provisions, connects, and
runs pgvector).

Base of the stacked Postgres work; #89 (runtime) and #90 (setup) build
on it. Supersedes the infrastructure portion of the original combined
PR.
Base automatically changed from feat/alien-35-oss-1-fixes to main July 1, 2026 09:57
…s, encoding-parity test, comment discipline)
…nced channel

get_binding_params feeds the synced, persisted remote_binding_params, so it strips the
Local password. A linked out-of-process worker still needs the live connection, so
resolve_binding_params delivers the full binding on the worker-env channel, which is set
on the dependent's resource and never persisted in Alien's synced state.
- install_pgvector skips the networked install when the pinned pgvector version is already present, so recovering an existing database stays off the release host; a version bump still falls through and reinstalls the pin.
- restart_exited collects exited IDs under a brief lock then re-acquires per-database, so the monitor no longer holds the runtimes lock across the whole restart batch.
- Document that load_postgres is local-only (cloud Postgres is resolved by the TypeScript SDK; a Rust worker that requests it gets a runtime error).
- Document that the external binding's ssl is always false (node-postgres has no prefer mode).
biome check flagged pre-existing formatting (the discriminated-union schema and the makeConnection objects on single lines, percentEncode using string concatenation). Mechanical biome check --write fixes; no behavior change.
theseus-rs tarballs unpack into a top-level postgresql-<ver>-<triple>/ dir, so the pgdist/bin/pg_config path (and PGROOT=pgdist on Windows) never resolved and the build step failed. release-pgvector is dispatch-only and had never run, so the bug was never caught.
…r metadata

A worker or daemon linked to a local Postgres received its binding with the generated password inline, and the local manager persisted the whole env to a world-readable (0644) metadata.json. This contradicted the feature's own design, where the controller strips the password from synced state and the Postgres manager writes its own metadata 0600.

The binding now reaches the running process but is re-resolved live on every (re)start instead of being persisted. A runtime-only BindingsProviderApi channel resolves the local Postgres binding from the manager's 0600 metadata; the worker and daemon managers strip it from the persisted env and merge it into the live process env; only the non-secret link names persist, so recovery re-resolves. The metadata file is written 0600 as defense in depth.

External (Remote Access) Postgres inlines a password with no local manager to re-resolve it, so it is rejected on the local platform rather than leaked.
GCP and Azure pick the smallest tier that satisfies both cpu and memory; AWS
(Aurora Serverless v2) sizes from memory and ignores cpu.
@ItamarZand88 ItamarZand88 force-pushed the feat/alien-35-oss-2-foundation branch from a13f0ea to 8d624b0 Compare July 1, 2026 10:36
Local Postgres delivered a linked worker's password through a dedicated
resolve_binding_params method on the base ResourceController trait,
overridden only by Local Postgres. That duplicated the runtime-only
binding path the worker/daemon managers already use to re-resolve the
password live and keep it out of persisted metadata, so the method was
dead weight.

Delete the trait method, its macro plumbing, and the Local override; the
env-builder now uses the plain get_binding_params, so the local Postgres
password never even enters the deploy-computed env. Worker and daemon
still receive it live via the runtime-only binding path; behavior is
unchanged.
Local Postgres is an embedded native process bound to 127.0.0.1, so
worker/daemon (host processes) connected fine but a local container
could not reach it (a container's host.docker.internal lands on the
docker bridge gateway, not the host loopback).

Have pg also listen on the docker bridge gateway when present, found by
inspecting the default bridge, restricting it to the private
172.16.0.0/12 range, and bind-testing it, so pg never lands on a
routable interface and stays loopback-only on Docker Desktop and when
Docker is absent. A pg_hba rule scopes that subnet to scram auth. Never
0.0.0.0, so pg stays off every public/LAN interface. Re-add the
container binding delivery and the 127.0.0.1 to host.docker.internal
rewrite so a linked container receives the full binding.

Adds an ignored container-to-Postgres integration test asserting a
same-stack container connects, a wrong password is refused, and the LAN
interface is refused.
@alongubkin alongubkin merged commit 25ec806 into main Jul 2, 2026
14 checks passed
@alongubkin alongubkin deleted the feat/alien-35-oss-2-foundation branch July 2, 2026 18:10
alongubkin pushed a commit that referenced this pull request Jul 3, 2026
Postgres setup, stacked on the runtime (#89): the cloud permission sets,
the preflights that prepare each cloud, Azure private networking, the
secret redaction on cloud create requests, and the end-to-end coverage
that exercises Postgres on all three clouds.

What happens when a stack with a Postgres provisions:

1. Preflights enable the required cloud services (Azure / GCP) and
provision the secrets vault for the connection password.
2. The permission sets grant the management identity a prefix-scoped set
of management-plane actions. **← the heart: management-plane only, never
DB-contents access**
3. On Azure, private networking creates the dedicated Private Endpoint
subnet (and the importer records it) so the database is reachable only
from inside the stack.
4. The end-to-end apps then provision, connect, and run a `pgvector`
query against the real cloud database.

This PR adds the setup half of the resource: it turns the runtime model
and bindings (#89) into something that can actually be provisioned,
secured, and torn down on a real cloud account.

## What's in the layer
- **Permission sets** (`alien-permissions/permission-sets/postgres/*`) —
`data-access` (the worker's scoped read of the connection secret),
`provision`, `management`, `heartbeat`. The management / provision /
heartbeat sets are management-plane only and never grant access to the
DB contents.
- **Preflights** (`alien-preflights`) — enable the required cloud
services, provision the secrets vault, and assert the network
prerequisites before provisioning starts.
- **Azure private networking** (`alien-infra/src/network/azure*`) — the
dedicated Private Endpoint subnet and the importer that records it.
- **Cloud create-request redaction**
(`alien-client-core/src/request_utils.rs`) — the master password rides
the cloud create-request body; it is scrubbed before that body can land
in an error chain or synced state.
- **End-to-end coverage** (`alien-test`, `tests/e2e/test-apps/*`) — the
comprehensive Rust and TypeScript apps now bind a Postgres and run a
query.

## How I tested
- `cargo test` across `alien-permissions` (incl. the AWS ABAC +
permission-set validation tests), `alien-preflights`, `alien-infra`,
`alien-terraform` (azure generator/snapshots), and `alien-test` — all
pass.
- End to end on AWS, GCP, and Azure: provision the database → worker
connects → `CREATE EXTENSION vector` + a `pgvector` query succeeds →
teardown. Run from the comprehensive Rust and TypeScript test apps.

Security walk (this PR touches permissions and the secret):
- The management-plane permission sets grant no DB-contents access — the
sensitive-data validation tests assert this.
- The master password on a cloud create request is redacted before it
can reach a persisted error or synced state.
- Permission scopes are pinned to the stack / resource prefix, never
broad cloud admin.
- Nothing turned up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants