Skip to content

fix: Azure/GCP infrastructure fixes (Postgres groundwork)#88

Merged
alongubkin merged 5 commits into
mainfrom
feat/alien-35-oss-1-fixes
Jul 1, 2026
Merged

fix: Azure/GCP infrastructure fixes (Postgres groundwork)#88
alongubkin merged 5 commits into
mainfrom
feat/alien-35-oss-1-fixes

Conversation

@ItamarZand88

@ItamarZand88 ItamarZand88 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Four pre-existing Azure/GCP infrastructure bugs, surfaced while getting the Postgres cloud e2e green. They are independent of the Postgres feature but sit underneath it, so the runtime (#89) and setup (#90) stack on this PR. Splitting them out lets them land and be reviewed on their own.

What was broken, and what I did

Four small, self-contained fixes:

  • GCP network import dropped the subnet name. The importer built everything from the subnet but never recorded subnetwork_name, so VPC egress to a private PSC Cloud SQL had nothing to resolve against. It now parses the name out of the subnet self-link on import.
  • Azure build read a frozen import as fatal drift. An imported (frozen) build arrives with its managed environment, identity, and resource_prefix unset; the controller treated those as drift and failed. The heartbeat now resolves all three from their dependencies, so an imported build can submit jobs without waiting for an update.
  • Azure worker pointed the DNS CNAME at the wrong host. It targeted the public display FQDN, which can equal the record name and make the record point at itself; the provider rejects that as a loop and the worker hangs waiting for DNS. It now targets the Container App's own ingress host.
  • Azure Terraform left the Container Apps environment outside the VNet. Added the VNet integration (and the matching network emitter) so the environment lands in the stack VNet.

Files touched

  • crates/alien-infra/src/network/gcp_import.rs — subnet name on import
  • crates/alien-infra/src/build/azure.rs — heartbeat resolves env, identity, prefix
  • crates/alien-infra/src/worker/azure.rs (+ azure_import.rs) — CNAME targets the ingress host
  • crates/alien-terraform/src/emitters/azure/* — Container Apps environment VNet integration
  • crates/alien-infra/tests/importers.rs + the azure generator/snapshot tests — coverage

How I tested

  • cargo test -p alien-infra and the alien-terraform azure generator + snapshot tests pass.
  • Validated end to end as part of the full Postgres cloud e2e on AWS, GCP, and Azure (the stack on top of this PR provisions, connects, and runs pgvector).

Base of the stacked Postgres work; #89 (runtime) and #90 (setup) build on it. Supersedes the infrastructure portion of the original combined PR.

@ItamarZand88 ItamarZand88 marked this pull request as ready for review June 26, 2026 16:46
@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown

Greptile Summary

This PR fixes several Azure and GCP infrastructure issues that were blocking the upcoming Postgres work. All changes are targeted bug fixes with accompanying regression tests.

  • GCP: The network importer now extracts subnetwork_name from the subnet self-link so imported workers can configure Direct VPC egress; without this, Cloud Run services couldn't reach private Cloud SQL PSC endpoints.
  • Azure Build: The heartbeat handler resolves managed_environment_id, managed_identity_id, and resource_prefix for imported (Frozen) builds instead of raising a non-retryable RESOURCE_DRIFT error — also fixes get_binding_params returning None for imported builds.
  • Azure Worker: A new container_app_url field captures the Container App's own ingress host so DNS CNAME records target the correct host even when url is overridden to a public FQDN.
  • Azure Terraform: The Container Apps environment is now VNet-integrated when a Network resource is present, and the private subnet is delegated to Microsoft.App/environments.

Confidence Score: 5/5

All four fixes are self-contained and well-tested; no change widens an attack surface or modifies shared state in a way that could affect unrelated resources.

Each fix addresses a clearly described regression with a targeted change and a companion regression test. The GCP subnet name extraction, the Azure build heartbeat resolution, the Azure worker CNAME split, and the Terraform VNet wiring all follow existing patterns in the codebase and the snapshot diffs confirm the expected Terraform output.

No files require special attention.

Important Files Changed

Filename Overview
crates/alien-infra/src/build/azure.rs Heartbeat now lazily resolves resource_prefix, managed_environment_id, and managed_identity_id for imported builds instead of raising a non-retryable RESOURCE_DRIFT error; regression test verifies binding params are non-None after the first tick.
crates/alien-infra/src/network/gcp_import.rs Extracts subnetwork_name from the subnet self-link via rsplit('/'); empty-string guard handles trailing slashes; mirrors the Azure importer's pattern.
crates/alien-infra/src/worker/azure.rs Adds container_app_url to store the raw Container App ingress host separately from the potentially-overridden public url; build_outputs now targets container_app_url with fallback to url, and the field is set at create, update, and heartbeat.
crates/alien-infra/src/worker/azure_import.rs Initialises container_app_url: None in the importer; the heartbeat rebuilds it on the first tick.
crates/alien-terraform/src/emitters/azure/container_apps_environment.rs Adds VNet integration (infrastructure_subnet_id, internal_load_balancer_enabled=false) when a Network resource is present; infrastructure_subnet_id correctly handles both create/use-default (managed resource) and ByoVnetAzure (data source) modes.
crates/alien-terraform/src/emitters/azure/network.rs Adds Microsoft.App/environments delegation block to the private subnet in create_topology (UseDefault and Create modes); BYO mode expects the pre-existing subnet to already be delegated.
crates/alien-infra/tests/importers.rs New gcp_network_import_derives_subnetwork_name test verifies the importer reconstructs subnetworkName from the self-link URL.
crates/alien-terraform/tests/generator/azure_full_stack_tests.rs Adds whitespace-normalised literal assertions for VNet wiring (infrastructure_subnet_id, internal_load_balancer_enabled, delegation) on top of the existing snapshot.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph GCP["GCP Network Import Fix"]
        GI["GcpNetworkImporter"] -->|"subnet_self_links[0]"| SL["rsplit('/').next()"]
        SL -->|"subnetwork_name"| GNC["GcpNetworkController\n(subnetwork_name set)"]
        GNC --> VPC["get_vpc_access()\nDirect VPC Egress ✓"]
    end

    subgraph AzureBuild["Azure Build Heartbeat Fix"]
        IB["Imported Build\nReady state\n(resource_prefix=None)"] --> HB["ready() heartbeat"]
        HB -->|"is_none()"| RP["resource_prefix resolved\nfrom ctx"]
        HB -->|"is_none()"| ME["managed_environment_id\nresolved"]
        HB -->|"is_none()"| MI["managed_identity_id\nresolved"]
        RP --> BP["get_binding_params()\nreturns Some(…) ✓"]
        ME --> BP
        MI --> BP
    end

    subgraph AzureWorker["Azure Worker DNS Fix"]
        CAP["Container App\ncreate/update"] -->|"extract_url"| CAU["container_app_url\n(ingress host)"]
        PU["public_urls override"] --> URL["url\n(public FQDN)"]
        CAU --> BO["build_outputs()"]
        URL -->|"fallback only"| BO
        BO --> LB["LoadBalancerEndpoint\ndns_name = container_app_url ✓"]
    end

    subgraph TF["Azure Terraform VNet Wiring"]
        NS["azurerm_subnet\n(private)"] -->|"delegation"| DEL["Microsoft.App/environments"]
        CAE["azurerm_container_app_environment"] -->|"infrastructure_subnet_id"| NS
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    subgraph GCP["GCP Network Import Fix"]
        GI["GcpNetworkImporter"] -->|"subnet_self_links[0]"| SL["rsplit('/').next()"]
        SL -->|"subnetwork_name"| GNC["GcpNetworkController\n(subnetwork_name set)"]
        GNC --> VPC["get_vpc_access()\nDirect VPC Egress ✓"]
    end

    subgraph AzureBuild["Azure Build Heartbeat Fix"]
        IB["Imported Build\nReady state\n(resource_prefix=None)"] --> HB["ready() heartbeat"]
        HB -->|"is_none()"| RP["resource_prefix resolved\nfrom ctx"]
        HB -->|"is_none()"| ME["managed_environment_id\nresolved"]
        HB -->|"is_none()"| MI["managed_identity_id\nresolved"]
        RP --> BP["get_binding_params()\nreturns Some(…) ✓"]
        ME --> BP
        MI --> BP
    end

    subgraph AzureWorker["Azure Worker DNS Fix"]
        CAP["Container App\ncreate/update"] -->|"extract_url"| CAU["container_app_url\n(ingress host)"]
        PU["public_urls override"] --> URL["url\n(public FQDN)"]
        CAU --> BO["build_outputs()"]
        URL -->|"fallback only"| BO
        BO --> LB["LoadBalancerEndpoint\ndns_name = container_app_url ✓"]
    end

    subgraph TF["Azure Terraform VNet Wiring"]
        NS["azurerm_subnet\n(private)"] -->|"delegation"| DEL["Microsoft.App/environments"]
        CAE["azurerm_container_app_environment"] -->|"infrastructure_subnet_id"| NS
    end
Loading

Reviews (2): Last reviewed commit: "fix(azure-build): resolve resource_prefi..." | Re-trigger Greptile

Comment thread crates/alien-infra/src/build/azure.rs
…uilds

Heartbeat resolved managed_environment_id and managed_identity_id for imported (Frozen) builds but left resource_prefix None, so get_binding_params returned None and the build could not submit jobs until an update ran. Resolve it alongside the others; the test now asserts binding params become non-None.
@alongubkin alongubkin merged commit 5ae268c into main Jul 1, 2026
14 checks passed
@alongubkin alongubkin deleted the feat/alien-35-oss-1-fixes branch July 1, 2026 09:57
alongubkin pushed a commit that referenced this pull request Jul 2, 2026
The Postgres resource runtime, stacked on the infra fixes (#88): the
resource model and bindings, the Local (developer) controller, the cloud
client SDKs for the managed Postgres backends, and the TypeScript SDK
surface. The part worth careful review is how the generated DB password
is handled.

## How the local DB password flows
The password is generated once and has to reach a linked worker for a
direct connection, without ever landing in control-plane state. So it
runs through two separate channels:

1. **It is stripped from the synced binding params and never written to
serialized controller state** — so it cannot reach control-plane storage
or status responses.
2. It is handed to the worker at runtime through the worker's
environment, resolved per request, never persisted.

Step 1 is the property that matters, and review caught a real gap there:
the password was reaching the synced channel. Fixed by stripping it in
`get_binding_params` — the `#[serde(skip)]` on the field alone was not
enough.

## What's in the layer
- Resource model + bindings (`alien-core`) — the Postgres resource, its
binding shapes, the heartbeat data.
- The Local controller (`alien-local` +
`alien-infra/src/postgres/local.rs`) — runs an embedded Postgres for
local development.
- Cloud client SDKs (`alien-aws-clients` / `gcp` / `azure`) — thin
wrappers over the managed cloud Postgres APIs (Aurora, Cloud SQL,
Flexible Server).
- SDK surface (`packages/core`, `packages/sdk`) — the generated schemas
and the TypeScript binding.

## How I tested
- `cargo test` across the touched crates (`alien-core`,
`alien-bindings`, `alien-local`, `alien-infra`), including the binding
round-trip and encoding-parity tests.
- The local embedded-Postgres integration test (`alien-local`).
- Exercised end to end in the full Postgres cloud e2e (the setup layer,
#90, stacks on this).

Security walk for the password (this PR touches the secret):
- Synced and persisted state never carry the password — the round-trip
test asserts it is absent from the serialized binding params.
- The runtime worker-env delivery is per request and not persisted.
- Errors from the secret path are redacted (request body scrubbed before
it can reach an error chain).
- The one gap that existed (password on the synced channel) is the one
this PR fixes. Nothing else turned up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants