Skip to content

feat(postgres): setup permissions, private networking, and e2e#90

Merged
alongubkin merged 9 commits into
mainfrom
feat/alien-35-oss-3-setup
Jul 3, 2026
Merged

feat(postgres): setup permissions, private networking, and e2e#90
alongubkin merged 9 commits into
mainfrom
feat/alien-35-oss-3-setup

Conversation

@ItamarZand88

@ItamarZand88 ItamarZand88 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Postgres setup, stacked on the runtime (#89): the cloud permission sets, the preflights that prepare each cloud, Azure private networking, the secret redaction on cloud create requests, and the end-to-end coverage that exercises Postgres on all three clouds.

What happens when a stack with a Postgres provisions:

  1. Preflights enable the required cloud services (Azure / GCP) and provision the secrets vault for the connection password.
  2. The permission sets grant the management identity a prefix-scoped set of management-plane actions. ← the heart: management-plane only, never DB-contents access
  3. On Azure, private networking creates the dedicated Private Endpoint subnet (and the importer records it) so the database is reachable only from inside the stack.
  4. The end-to-end apps then provision, connect, and run a pgvector query against the real cloud database.

This PR adds the setup half of the resource: it turns the runtime model and bindings (#89) into something that can actually be provisioned, secured, and torn down on a real cloud account.

What's in the layer

  • Permission sets (alien-permissions/permission-sets/postgres/*) — data-access (the worker's scoped read of the connection secret), provision, management, heartbeat. The management / provision / heartbeat sets are management-plane only and never grant access to the DB contents.
  • Preflights (alien-preflights) — enable the required cloud services, provision the secrets vault, and assert the network prerequisites before provisioning starts.
  • Azure private networking (alien-infra/src/network/azure*) — the dedicated Private Endpoint subnet and the importer that records it.
  • Cloud create-request redaction (alien-client-core/src/request_utils.rs) — the master password rides the cloud create-request body; it is scrubbed before that body can land in an error chain or synced state.
  • End-to-end coverage (alien-test, tests/e2e/test-apps/*) — the comprehensive Rust and TypeScript apps now bind a Postgres and run a query.

How I tested

  • cargo test across alien-permissions (incl. the AWS ABAC + permission-set validation tests), alien-preflights, alien-infra, alien-terraform (azure generator/snapshots), and alien-test — all pass.
  • End to end on AWS, GCP, and Azure: provision the database → worker connects → CREATE EXTENSION vector + a pgvector query succeeds → teardown. Run from the comprehensive Rust and TypeScript test apps.

Security walk (this PR touches permissions and the secret):

  • The management-plane permission sets grant no DB-contents access — the sensitive-data validation tests assert this.
  • The master password on a cloud create request is redacted before it can reach a persisted error or synced state.
  • Permission scopes are pinned to the stack / resource prefix, never broad cloud admin.
  • Nothing turned up.

@ItamarZand88 ItamarZand88 force-pushed the feat/alien-35-oss-2-foundation branch from d54094b to d0ad57a Compare June 26, 2026 19:58
@ItamarZand88 ItamarZand88 force-pushed the feat/alien-35-oss-3-setup branch 2 times, most recently from 3717d5d to d919d77 Compare June 26, 2026 22:47
@ItamarZand88 ItamarZand88 marked this pull request as ready for review June 26, 2026 22:48
@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown

Greptile Summary

This PR wires the Postgres resource into the full provisioning lifecycle: cloud permission sets, preflights (service activation + secrets vault), Azure Private Endpoint subnet provisioning, master-password redaction on cloud create requests, and end-to-end test coverage across the Rust and TypeScript test apps.

  • Permission sets (data-access, heartbeat, management, provision) are added for all three clouds, scoped to stack/resource prefixes and management-plane only; data-access for Azure is intentionally empty because the shared secrets vault already covers secret reads there.
  • Azure network gains a dedicated PE-subnet state machine (CreatingPrivateEndpointSubnet / WaitingForPrivateEndpointSubnet), a CIDR at index 3 (non-overlapping with public/private/appgw), and a fail-fast guard in create_start for BYO-VNet stacks that declare a Postgres without a privateEndpointSubnetName.
  • Security gate in executor now suppresses remote_binding_params sync for remote_access: false resources, preventing inline secrets (e.g. Local Postgres passwords) from reaching persisted control-plane state; the secrets vault preflight is updated to remote_access: true so its locator reference continues to be synced as required.

Confidence Score: 5/5

Safe to merge; the changes are well-scoped, thoroughly tested, and the security-sensitive paths (secret redaction, binding-params gate, permission scopes) are each covered by targeted unit tests.

All three cloud permission paths are action-scoped to stack/resource prefixes and explicitly validated by the existing permission-set validation tests. The executor change that prevents inline secrets from reaching synced state has a dedicated binding-sync test. The ARM storageSizeGB key fix is pinned by both a serialization and deserialization test. The only non-blocking finding is that the Azure heartbeat stack-level binding uses the built-in Reader role at RG scope rather than explicit action-level grants, which is broader than necessary but read-only and scoped to a single resource group.

The four postgres permission-set files under alien-permissions/permission-sets/postgres/ and the heartbeat.jsonc in particular — the Azure stack binding grant is broader than the equivalent AWS/GCP entries.

Important Files Changed

Filename Overview
crates/alien-infra/src/network/azure.rs Adds CreatingPrivateEndpointSubnet / WaitingForPrivateEndpointSubnet states plus BYO-VNet validation; PE subnet uses /24 index-3 CIDR (non-overlapping, tested); fail-fast guard in create_start if stack has Postgres but no PE subnet name on BYO-VNet.
crates/alien-infra/src/core/executor.rs Gates remote_binding_params sync on the resource's remote_access flag, preventing Local Postgres passwords (and other inline-secret bindings) from reaching persisted control-plane state; test coverage in binding_sync_tests.rs confirms both sides.
crates/alien-permissions/permission-sets/postgres/provision.jsonc Comprehensive provision permissions across AWS/GCP/Azure; previously-flagged issues (subnet group ARN gap, missing Azure PE/DNS read actions) are noted in prior threads.
crates/alien-permissions/permission-sets/postgres/heartbeat.jsonc AWS and GCP heartbeat grants are action-scoped; Azure stack binding uses the broad built-in Reader role at RG scope rather than explicit action-level grants, inconsistent with the management set and other cloud heartbeat entries.
crates/alien-azure-clients/src/azure/flexible_server.rs Fixes storageSizeGB serde rename — ARM uses capital "GB" while rename_all = camelCase would emit storageSizeGb, silently breaking GET response deserialization; pinned with both a serialization and a deserialization unit test.
crates/alien-preflights/src/mutations/secrets_vault.rs Changes auto-created secrets vault from remote_access: false to true; needed so sync_secrets_to_vault can resolve the vault locator from synced state after the executor starts gating remote_binding_params on remote_access.
crates/alien-infra/src/postgres/local.rs Adds version field to controller state; update_start now rejects in-place major-version changes with a clear error rather than a silent no-op, with unit tests for both the rejection and the cpu/memory no-op path.
crates/alien-deploy-cli/src/commands/up.rs Two "warn-and-continue" paths in push_initial_setup are converted to fail-fast; release fetch uses ConfigurationError, environment-info collection uses retryable DeploymentFailed. Also adds private_endpoint_subnet_name: None for CLI-YAML BYO-VNet, which will correctly trigger the controller's validation error if a Postgres is present.
crates/alien-permissions/permission-sets/postgres/data-access.jsonc Grants only secretsmanager:GetSecretValue (AWS) and secretmanager.versions.access + secretmanager.secrets.get (GCP) with stack/resource prefix conditions; Azure is intentionally empty (shared vault already covers it) with the design rationale documented.
crates/alien-aws-clients/src/aws/rds.rs Extracts modify_db_cluster_form helper, adds Serverless v2 scaling config (ACU ceiling resize) with unit tests; ServerlessV2ScalingConfiguration deserializes from DescribeDBClusters XML for day-2 memory-change detection.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Stack with Postgres] --> B{Platform?}

    B -- Azure --> C[AzureNetworkController create_start]
    C --> D{BYO-VNet?}
    D -- Yes, no PE subnet --> E[❌ Fail fast: set privateEndpointSubnetName]
    D -- Yes, PE subnet named --> F[Resolve customer PE subnet]
    D -- No, managed --> G[CreatingVnet]
    G --> G1[CreatingPublicSubnet]
    G1 --> G2[CreatingPrivateSubnet]
    G2 --> G3[CreatingApplicationGatewaySubnet]
    G3 --> H[CreatingPrivateEndpointSubnet NEW]
    H --> I[CreatingPublicIp → NAT → NSG → Running]
    F --> I

    B -- AWS --> J[Preflights: SecretsVaultMutation]
    J --> K[Aurora Serverless v2 cluster]
    K --> L[data-access: secretsmanager:GetSecretValue scoped to prefix]

    B -- GCP --> M[Preflights: sqladmin + compute + secretmanager APIs]
    M --> N[Cloud SQL + PSC endpoint]
    N --> O[data-access: secretmanager.versions.access scoped to prefix]

    B -- Local --> P[LocalPostgresController: reject version change in update_start]
    P --> Q[Embedded pgvector, NoTls, TEMP table round-trip in e2e]

    I --> R[Executor gates remote_binding_params on remote_access flag]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Stack with Postgres] --> B{Platform?}

    B -- Azure --> C[AzureNetworkController create_start]
    C --> D{BYO-VNet?}
    D -- Yes, no PE subnet --> E[❌ Fail fast: set privateEndpointSubnetName]
    D -- Yes, PE subnet named --> F[Resolve customer PE subnet]
    D -- No, managed --> G[CreatingVnet]
    G --> G1[CreatingPublicSubnet]
    G1 --> G2[CreatingPrivateSubnet]
    G2 --> G3[CreatingApplicationGatewaySubnet]
    G3 --> H[CreatingPrivateEndpointSubnet NEW]
    H --> I[CreatingPublicIp → NAT → NSG → Running]
    F --> I

    B -- AWS --> J[Preflights: SecretsVaultMutation]
    J --> K[Aurora Serverless v2 cluster]
    K --> L[data-access: secretsmanager:GetSecretValue scoped to prefix]

    B -- GCP --> M[Preflights: sqladmin + compute + secretmanager APIs]
    M --> N[Cloud SQL + PSC endpoint]
    N --> O[data-access: secretmanager.versions.access scoped to prefix]

    B -- Local --> P[LocalPostgresController: reject version change in update_start]
    P --> Q[Embedded pgvector, NoTls, TEMP table round-trip in e2e]

    I --> R[Executor gates remote_binding_params on remote_access flag]
Loading

Reviews (3): Last reviewed commit: "chore(core): regenerate stack schemas fo..." | Re-trigger Greptile

Comment thread crates/alien-permissions/permission-sets/postgres/provision.jsonc
Comment thread crates/alien-infra/src/network/azure_import.rs Outdated
Comment thread crates/alien-permissions/permission-sets/postgres/provision.jsonc
@ItamarZand88 ItamarZand88 force-pushed the feat/alien-35-oss-3-setup branch from d919d77 to fab6e64 Compare June 26, 2026 23:03
@ItamarZand88

Copy link
Copy Markdown
Contributor Author

Addressed the review feedback in 7b8b829 (and ran our internal review on the changes):

  1. AWS resource binding subgrp ARN: added subgrp:${stackPrefix}-${resourceName}-* to the resource binding, matching the per-resource subnet group name ({prefix}-{id}-subnets); the resource-scoped role can now manage its subnet group lifecycle.
  2. Imported network None PE subnet: this is not silent. The Postgres controller reads it with .ok_or_else and fails fast with an actionable error (set privateEndpointSubnetName, or let Alien manage the VNet). Clarified the importer comment; carrying the configured PE subnet through import is a follow-up.
  3. Azure provision missing read: documented the intent. The /read omission is deliberate (the controller confirms each PE/DNS-zone op via the Azure-AsyncOperation LRO header, not a GET), kept minimal per least-privilege.

@ItamarZand88 ItamarZand88 force-pushed the feat/alien-35-oss-3-setup branch from cb62fa8 to 6bae5f4 Compare June 28, 2026 17:17
alongubkin pushed a commit that referenced this pull request Jul 1, 2026
Four pre-existing Azure/GCP infrastructure bugs, surfaced while getting
the Postgres cloud e2e green. They are independent of the Postgres
feature but sit underneath it, so the runtime (#89) and setup (#90)
stack on this PR. Splitting them out lets them land and be reviewed on
their own.

## What was broken, and what I did
Four small, self-contained fixes:

- **GCP network import dropped the subnet name.** The importer built
everything from the subnet but never recorded `subnetwork_name`, so VPC
egress to a private PSC Cloud SQL had nothing to resolve against. It now
parses the name out of the subnet self-link on import.
- **Azure build read a frozen import as fatal drift.** An imported
(frozen) build arrives with its managed environment, identity, and
`resource_prefix` unset; the controller treated those as drift and
failed. The heartbeat now resolves all three from their dependencies, so
an imported build can submit jobs without waiting for an update.
- **Azure worker pointed the DNS CNAME at the wrong host.** It targeted
the public display FQDN, which can equal the record name and make the
record point at itself; the provider rejects that as a loop and the
worker hangs waiting for DNS. It now targets the Container App's own
ingress host.
- **Azure Terraform left the Container Apps environment outside the
VNet.** Added the VNet integration (and the matching network emitter) so
the environment lands in the stack VNet.

## Files touched
- `crates/alien-infra/src/network/gcp_import.rs` — subnet name on import
- `crates/alien-infra/src/build/azure.rs` — heartbeat resolves env,
identity, prefix
- `crates/alien-infra/src/worker/azure.rs` (+ `azure_import.rs`) — CNAME
targets the ingress host
- `crates/alien-terraform/src/emitters/azure/*` — Container Apps
environment VNet integration
- `crates/alien-infra/tests/importers.rs` + the azure generator/snapshot
tests — coverage

## How I tested
- `cargo test -p alien-infra` and the `alien-terraform` azure generator
+ snapshot tests pass.
- Validated end to end as part of the full Postgres cloud e2e on AWS,
GCP, and Azure (the stack on top of this PR provisions, connects, and
runs pgvector).

Base of the stacked Postgres work; #89 (runtime) and #90 (setup) build
on it. Supersedes the infrastructure portion of the original combined
PR.
@ItamarZand88 ItamarZand88 force-pushed the feat/alien-35-oss-2-foundation branch from a13f0ea to 8d624b0 Compare July 1, 2026 10:36
@ItamarZand88 ItamarZand88 force-pushed the feat/alien-35-oss-3-setup branch 4 times, most recently from 27b116a to e483726 Compare July 2, 2026 11:15
alongubkin pushed a commit that referenced this pull request Jul 2, 2026
The Postgres resource runtime, stacked on the infra fixes (#88): the
resource model and bindings, the Local (developer) controller, the cloud
client SDKs for the managed Postgres backends, and the TypeScript SDK
surface. The part worth careful review is how the generated DB password
is handled.

## How the local DB password flows
The password is generated once and has to reach a linked worker for a
direct connection, without ever landing in control-plane state. So it
runs through two separate channels:

1. **It is stripped from the synced binding params and never written to
serialized controller state** — so it cannot reach control-plane storage
or status responses.
2. It is handed to the worker at runtime through the worker's
environment, resolved per request, never persisted.

Step 1 is the property that matters, and review caught a real gap there:
the password was reaching the synced channel. Fixed by stripping it in
`get_binding_params` — the `#[serde(skip)]` on the field alone was not
enough.

## What's in the layer
- Resource model + bindings (`alien-core`) — the Postgres resource, its
binding shapes, the heartbeat data.
- The Local controller (`alien-local` +
`alien-infra/src/postgres/local.rs`) — runs an embedded Postgres for
local development.
- Cloud client SDKs (`alien-aws-clients` / `gcp` / `azure`) — thin
wrappers over the managed cloud Postgres APIs (Aurora, Cloud SQL,
Flexible Server).
- SDK surface (`packages/core`, `packages/sdk`) — the generated schemas
and the TypeScript binding.

## How I tested
- `cargo test` across the touched crates (`alien-core`,
`alien-bindings`, `alien-local`, `alien-infra`), including the binding
round-trip and encoding-parity tests.
- The local embedded-Postgres integration test (`alien-local`).
- Exercised end to end in the full Postgres cloud e2e (the setup layer,
#90, stacks on this).

Security walk for the password (this PR touches the secret):
- Synced and persisted state never carry the password — the round-trip
test asserts it is absent from the serialized binding params.
- The runtime worker-env delivery is per request and not persisted.
- Errors from the secret path are redacted (request body scrubbed before
it can reach an error chain).
- The one gap that existed (password on the synced channel) is the one
this PR fixes. Nothing else turned up.
Base automatically changed from feat/alien-35-oss-2-foundation to main July 2, 2026 18:10
… networking

Adds the postgres permission sets and preflight checks, the model's private-endpoint
network field and heartbeat data, Azure private networking in the Terraform emitter, the
registration data importers, and redaction of the cloud create-request bodies that carry
the master password.
- Scope the AWS DB subnet group ARN to the resource binding, matching the per-resource subnet group name, so the resource-scoped role can manage its subnet group lifecycle (previously only the stack binding carried it).
- Document that an imported Azure network reports a None PE subnet (import data does not carry it) and the Postgres controller fails fast on it.
- Document that the Azure provision role omits /read on the network resources by design (the controller confirms via the LRO header, not a GET).
biome check flagged pre-existing array formatting (single-line resource arrays, one multi-line permissions array) in data-access.jsonc and provision.jsonc. Mechanical biome check --write; no semantic change.
The generated stack schemas now carry the Azure private-endpoint subnet
field from the Postgres setup layer, alongside main's compute settings
that a stale conflict resolution would otherwise have dropped.
@ItamarZand88 ItamarZand88 force-pushed the feat/alien-35-oss-3-setup branch from e483726 to dfc7c2d Compare July 2, 2026 19:56
@alongubkin alongubkin merged commit 42a1fbd into main Jul 3, 2026
15 checks passed
@alongubkin alongubkin deleted the feat/alien-35-oss-3-setup branch July 3, 2026 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants