feat(postgres): setup permissions, private networking, and e2e#90
Conversation
d54094b to
d0ad57a
Compare
3717d5d to
d919d77
Compare
Greptile SummaryThis PR wires the Postgres resource into the full provisioning lifecycle: cloud permission sets, preflights (service activation + secrets vault), Azure Private Endpoint subnet provisioning, master-password redaction on cloud create requests, and end-to-end test coverage across the Rust and TypeScript test apps.
Confidence Score: 5/5Safe to merge; the changes are well-scoped, thoroughly tested, and the security-sensitive paths (secret redaction, binding-params gate, permission scopes) are each covered by targeted unit tests. All three cloud permission paths are action-scoped to stack/resource prefixes and explicitly validated by the existing permission-set validation tests. The executor change that prevents inline secrets from reaching synced state has a dedicated binding-sync test. The ARM The four postgres permission-set files under
|
| Filename | Overview |
|---|---|
| crates/alien-infra/src/network/azure.rs | Adds CreatingPrivateEndpointSubnet / WaitingForPrivateEndpointSubnet states plus BYO-VNet validation; PE subnet uses /24 index-3 CIDR (non-overlapping, tested); fail-fast guard in create_start if stack has Postgres but no PE subnet name on BYO-VNet. |
| crates/alien-infra/src/core/executor.rs | Gates remote_binding_params sync on the resource's remote_access flag, preventing Local Postgres passwords (and other inline-secret bindings) from reaching persisted control-plane state; test coverage in binding_sync_tests.rs confirms both sides. |
| crates/alien-permissions/permission-sets/postgres/provision.jsonc | Comprehensive provision permissions across AWS/GCP/Azure; previously-flagged issues (subnet group ARN gap, missing Azure PE/DNS read actions) are noted in prior threads. |
| crates/alien-permissions/permission-sets/postgres/heartbeat.jsonc | AWS and GCP heartbeat grants are action-scoped; Azure stack binding uses the broad built-in Reader role at RG scope rather than explicit action-level grants, inconsistent with the management set and other cloud heartbeat entries. |
| crates/alien-azure-clients/src/azure/flexible_server.rs | Fixes storageSizeGB serde rename — ARM uses capital "GB" while rename_all = camelCase would emit storageSizeGb, silently breaking GET response deserialization; pinned with both a serialization and a deserialization unit test. |
| crates/alien-preflights/src/mutations/secrets_vault.rs | Changes auto-created secrets vault from remote_access: false to true; needed so sync_secrets_to_vault can resolve the vault locator from synced state after the executor starts gating remote_binding_params on remote_access. |
| crates/alien-infra/src/postgres/local.rs | Adds version field to controller state; update_start now rejects in-place major-version changes with a clear error rather than a silent no-op, with unit tests for both the rejection and the cpu/memory no-op path. |
| crates/alien-deploy-cli/src/commands/up.rs | Two "warn-and-continue" paths in push_initial_setup are converted to fail-fast; release fetch uses ConfigurationError, environment-info collection uses retryable DeploymentFailed. Also adds private_endpoint_subnet_name: None for CLI-YAML BYO-VNet, which will correctly trigger the controller's validation error if a Postgres is present. |
| crates/alien-permissions/permission-sets/postgres/data-access.jsonc | Grants only secretsmanager:GetSecretValue (AWS) and secretmanager.versions.access + secretmanager.secrets.get (GCP) with stack/resource prefix conditions; Azure is intentionally empty (shared vault already covers it) with the design rationale documented. |
| crates/alien-aws-clients/src/aws/rds.rs | Extracts modify_db_cluster_form helper, adds Serverless v2 scaling config (ACU ceiling resize) with unit tests; ServerlessV2ScalingConfiguration deserializes from DescribeDBClusters XML for day-2 memory-change detection. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Stack with Postgres] --> B{Platform?}
B -- Azure --> C[AzureNetworkController create_start]
C --> D{BYO-VNet?}
D -- Yes, no PE subnet --> E[❌ Fail fast: set privateEndpointSubnetName]
D -- Yes, PE subnet named --> F[Resolve customer PE subnet]
D -- No, managed --> G[CreatingVnet]
G --> G1[CreatingPublicSubnet]
G1 --> G2[CreatingPrivateSubnet]
G2 --> G3[CreatingApplicationGatewaySubnet]
G3 --> H[CreatingPrivateEndpointSubnet NEW]
H --> I[CreatingPublicIp → NAT → NSG → Running]
F --> I
B -- AWS --> J[Preflights: SecretsVaultMutation]
J --> K[Aurora Serverless v2 cluster]
K --> L[data-access: secretsmanager:GetSecretValue scoped to prefix]
B -- GCP --> M[Preflights: sqladmin + compute + secretmanager APIs]
M --> N[Cloud SQL + PSC endpoint]
N --> O[data-access: secretmanager.versions.access scoped to prefix]
B -- Local --> P[LocalPostgresController: reject version change in update_start]
P --> Q[Embedded pgvector, NoTls, TEMP table round-trip in e2e]
I --> R[Executor gates remote_binding_params on remote_access flag]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[Stack with Postgres] --> B{Platform?}
B -- Azure --> C[AzureNetworkController create_start]
C --> D{BYO-VNet?}
D -- Yes, no PE subnet --> E[❌ Fail fast: set privateEndpointSubnetName]
D -- Yes, PE subnet named --> F[Resolve customer PE subnet]
D -- No, managed --> G[CreatingVnet]
G --> G1[CreatingPublicSubnet]
G1 --> G2[CreatingPrivateSubnet]
G2 --> G3[CreatingApplicationGatewaySubnet]
G3 --> H[CreatingPrivateEndpointSubnet NEW]
H --> I[CreatingPublicIp → NAT → NSG → Running]
F --> I
B -- AWS --> J[Preflights: SecretsVaultMutation]
J --> K[Aurora Serverless v2 cluster]
K --> L[data-access: secretsmanager:GetSecretValue scoped to prefix]
B -- GCP --> M[Preflights: sqladmin + compute + secretmanager APIs]
M --> N[Cloud SQL + PSC endpoint]
N --> O[data-access: secretmanager.versions.access scoped to prefix]
B -- Local --> P[LocalPostgresController: reject version change in update_start]
P --> Q[Embedded pgvector, NoTls, TEMP table round-trip in e2e]
I --> R[Executor gates remote_binding_params on remote_access flag]
Reviews (3): Last reviewed commit: "chore(core): regenerate stack schemas fo..." | Re-trigger Greptile
d919d77 to
fab6e64
Compare
|
Addressed the review feedback in 7b8b829 (and ran our internal review on the changes):
|
cb62fa8 to
6bae5f4
Compare
Four pre-existing Azure/GCP infrastructure bugs, surfaced while getting the Postgres cloud e2e green. They are independent of the Postgres feature but sit underneath it, so the runtime (#89) and setup (#90) stack on this PR. Splitting them out lets them land and be reviewed on their own. ## What was broken, and what I did Four small, self-contained fixes: - **GCP network import dropped the subnet name.** The importer built everything from the subnet but never recorded `subnetwork_name`, so VPC egress to a private PSC Cloud SQL had nothing to resolve against. It now parses the name out of the subnet self-link on import. - **Azure build read a frozen import as fatal drift.** An imported (frozen) build arrives with its managed environment, identity, and `resource_prefix` unset; the controller treated those as drift and failed. The heartbeat now resolves all three from their dependencies, so an imported build can submit jobs without waiting for an update. - **Azure worker pointed the DNS CNAME at the wrong host.** It targeted the public display FQDN, which can equal the record name and make the record point at itself; the provider rejects that as a loop and the worker hangs waiting for DNS. It now targets the Container App's own ingress host. - **Azure Terraform left the Container Apps environment outside the VNet.** Added the VNet integration (and the matching network emitter) so the environment lands in the stack VNet. ## Files touched - `crates/alien-infra/src/network/gcp_import.rs` — subnet name on import - `crates/alien-infra/src/build/azure.rs` — heartbeat resolves env, identity, prefix - `crates/alien-infra/src/worker/azure.rs` (+ `azure_import.rs`) — CNAME targets the ingress host - `crates/alien-terraform/src/emitters/azure/*` — Container Apps environment VNet integration - `crates/alien-infra/tests/importers.rs` + the azure generator/snapshot tests — coverage ## How I tested - `cargo test -p alien-infra` and the `alien-terraform` azure generator + snapshot tests pass. - Validated end to end as part of the full Postgres cloud e2e on AWS, GCP, and Azure (the stack on top of this PR provisions, connects, and runs pgvector). Base of the stacked Postgres work; #89 (runtime) and #90 (setup) build on it. Supersedes the infrastructure portion of the original combined PR.
a13f0ea to
8d624b0
Compare
27b116a to
e483726
Compare
The Postgres resource runtime, stacked on the infra fixes (#88): the resource model and bindings, the Local (developer) controller, the cloud client SDKs for the managed Postgres backends, and the TypeScript SDK surface. The part worth careful review is how the generated DB password is handled. ## How the local DB password flows The password is generated once and has to reach a linked worker for a direct connection, without ever landing in control-plane state. So it runs through two separate channels: 1. **It is stripped from the synced binding params and never written to serialized controller state** — so it cannot reach control-plane storage or status responses. 2. It is handed to the worker at runtime through the worker's environment, resolved per request, never persisted. Step 1 is the property that matters, and review caught a real gap there: the password was reaching the synced channel. Fixed by stripping it in `get_binding_params` — the `#[serde(skip)]` on the field alone was not enough. ## What's in the layer - Resource model + bindings (`alien-core`) — the Postgres resource, its binding shapes, the heartbeat data. - The Local controller (`alien-local` + `alien-infra/src/postgres/local.rs`) — runs an embedded Postgres for local development. - Cloud client SDKs (`alien-aws-clients` / `gcp` / `azure`) — thin wrappers over the managed cloud Postgres APIs (Aurora, Cloud SQL, Flexible Server). - SDK surface (`packages/core`, `packages/sdk`) — the generated schemas and the TypeScript binding. ## How I tested - `cargo test` across the touched crates (`alien-core`, `alien-bindings`, `alien-local`, `alien-infra`), including the binding round-trip and encoding-parity tests. - The local embedded-Postgres integration test (`alien-local`). - Exercised end to end in the full Postgres cloud e2e (the setup layer, #90, stacks on this). Security walk for the password (this PR touches the secret): - Synced and persisted state never carry the password — the round-trip test asserts it is absent from the serialized binding params. - The runtime worker-env delivery is per request and not persisted. - Errors from the secret path are redacted (request body scrubbed before it can reach an error chain). - The one gap that existed (password on the synced channel) is the one this PR fixes. Nothing else turned up.
… networking Adds the postgres permission sets and preflight checks, the model's private-endpoint network field and heartbeat data, Azure private networking in the Terraform emitter, the registration data importers, and redaction of the cloud create-request bodies that carry the master password.
- Scope the AWS DB subnet group ARN to the resource binding, matching the per-resource subnet group name, so the resource-scoped role can manage its subnet group lifecycle (previously only the stack binding carried it). - Document that an imported Azure network reports a None PE subnet (import data does not carry it) and the Postgres controller fails fast on it. - Document that the Azure provision role omits /read on the network resources by design (the controller confirms via the LRO header, not a GET).
biome check flagged pre-existing array formatting (single-line resource arrays, one multi-line permissions array) in data-access.jsonc and provision.jsonc. Mechanical biome check --write; no semantic change.
The generated stack schemas now carry the Azure private-endpoint subnet field from the Postgres setup layer, alongside main's compute settings that a stale conflict resolution would otherwise have dropped.
e483726 to
dfc7c2d
Compare
Postgres setup, stacked on the runtime (#89): the cloud permission sets, the preflights that prepare each cloud, Azure private networking, the secret redaction on cloud create requests, and the end-to-end coverage that exercises Postgres on all three clouds.
What happens when a stack with a Postgres provisions:
pgvectorquery against the real cloud database.This PR adds the setup half of the resource: it turns the runtime model and bindings (#89) into something that can actually be provisioned, secured, and torn down on a real cloud account.
What's in the layer
alien-permissions/permission-sets/postgres/*) —data-access(the worker's scoped read of the connection secret),provision,management,heartbeat. The management / provision / heartbeat sets are management-plane only and never grant access to the DB contents.alien-preflights) — enable the required cloud services, provision the secrets vault, and assert the network prerequisites before provisioning starts.alien-infra/src/network/azure*) — the dedicated Private Endpoint subnet and the importer that records it.alien-client-core/src/request_utils.rs) — the master password rides the cloud create-request body; it is scrubbed before that body can land in an error chain or synced state.alien-test,tests/e2e/test-apps/*) — the comprehensive Rust and TypeScript apps now bind a Postgres and run a query.How I tested
cargo testacrossalien-permissions(incl. the AWS ABAC + permission-set validation tests),alien-preflights,alien-infra,alien-terraform(azure generator/snapshots), andalien-test— all pass.CREATE EXTENSION vector+ apgvectorquery succeeds → teardown. Run from the comprehensive Rust and TypeScript test apps.Security walk (this PR touches permissions and the secret):