infra: two-host k3s topology for horizontal web/worker scaling

## Context

The current production cluster is a single-node k3s deployment on a \$12 DigitalOcean droplet. The stateless workloads (\`glaze-web\`, \`glaze-worker\`) can already scale horizontally, but two blockers prevent scheduling them across multiple nodes:

1. **Local-path storage** — Postgres (10Gi) and Redis (1Gi) PVCs are pinned to the host's local disk.
2. **Single-host firewall** — nftables rules assume all pod traffic is intra-node; flannel inter-node traffic is not permitted.

This issue captures the proposed two-host topology and the monitoring thresholds that should trigger acting on it.

---

## Proposed Topology

### Node layout

| Node | Role | Droplet size | Cost |
|---|---|---|---|
| `glaze-prod` (existing) | k3s server + workloads | \$12 (2 vCPU, 2 GB) | \$12/mo |
| `glaze-worker-1` (new) | k3s agent + workloads | \$12 (2 vCPU, 2 GB) | \$12/mo |
| DO Managed Postgres | External, removes storage blocker | Basic 1 GB | ~\$15/mo |
| DO Managed Redis *(optional)* | External, or keep self-hosted w/ Sentinel | Basic | ~\$15/mo |

**Baseline cost delta:** ~\$27–42/mo vs. current \$12/mo.

### What changes

| Component | Change |
|---|---|
| **Postgres** | Migrate from `StatefulSet` + `local-path` PVC → DO Managed Postgres. Remove `glaze-postgres` Helm release. Update `DATABASE_URL` secret in Infisical. |
| **Redis** | Option A: Migrate to DO Managed Redis. Option B: Keep self-hosted, add Redis Sentinel for HA. Update `REDIS_URL` secret. |
| **Storage class** | Drop `local-path` from all remaining PVCs (currently only Postgres + Redis need it). |
| **Firewall** | Open UDP 8472 (flannel VXLAN) between the two droplet private IPs in nftables config. |
| **k3s join** | Update `ensure_cluster.sh` to provision and join the agent node (k3s agent join token, server URL). |
| **k3s config** | Loosen `max-requests-inflight`, `max-mutating-requests-inflight`, and controller sync concurrency limits — they were throttled for single-core. |
| **Web/Worker** | No changes needed — already stateless Deployments with no affinity constraints. HPA or manual replica bumps work immediately after the above. |
| **Traefik** | No changes — already a LoadBalancer service, routes across nodes transparently. |
| **Tailscale operator** | Verify multi-node awareness; proxy pods should reschedule without config changes. |
| **Rollout strategy** | Remove the sequential web-then-worker pause in `cd.yml` — it exists only to avoid memory spikes on a single core. |
| **cert-manager / ESO / external-dns** | No changes expected — cluster-scoped, node-agnostic. |

### What stays the same

- Traefik ingress (public + Tailscale paths)
- cert-manager + Let's Encrypt
- External Secrets Operator → Infisical
- GitHub Actions CI/CD flow
- Helm release structure

---

## Scaling trigger metrics

Monitor these in Grafana Cloud (already wired via `glaze-otelcol`). Pull the trigger on this issue when **two or more** thresholds are sustained for >30 minutes during normal usage hours.

### CPU
| Metric | Watch | Act |
|---|---|---|
| Node CPU utilization | >60% sustained | >80% sustained |
| Web pod CPU throttling (`container_cpu_cfs_throttled_periods_total`) | Any throttling | Frequent throttling |

### Memory
| Metric | Watch | Act |
|---|---|---|
| Node memory utilization | >75% | >85% |
| Web/worker OOMKill events | Any | — |

### Request latency
| Metric | Watch | Act |
|---|---|---|
| p95 response time (`traefik_service_request_duration_seconds`) | >500ms | >1s |
| p99 response time | >1s | >2s |

### Database
| Metric | Watch | Act |
|---|---|---|
| Postgres active connections | >40 (of max 60) | >50 |
| Postgres query latency p95 | >100ms | >300ms |

### Queue depth (Celery / Redis)
| Metric | Watch | Act |
|---|---|---|
| Redis memory utilization | >60% of 1Gi | >80% |
| Celery queue depth (default queue) | >20 tasks pending | >50 tasks pending |
| Celery task age (oldest pending) | >30s | >2min |

### Availability
| Metric | Watch | Act |
|---|---|---|
| HTTP 5xx rate | >0.1% of requests | >1% |
| Pod restart count | Any unexpected restarts | Frequent restarts |

---

## Prerequisites before acting

- [ ] Grafana Cloud dashboards expose the above metrics (verify OTel collector scrape config covers Postgres + Redis exporters)
- [ ] DO Managed Postgres migration runbook written and dry-run tested against a local restore
- [ ] Backup/restore verified for managed Postgres (current hourly Dropbox backup strategy may need adaptation)
- [ ] `ensure_cluster.sh` updated to support agent node provisioning
- [ ] Firewall config tested in staging or documented as a one-liner

---

## Deliberately deferred

No action needed until monitoring thresholds are hit. This issue exists to have a ready plan when invited users generate real load.

Component	Change
Postgres	Migrate from `StatefulSet` + `local-path` PVC → DO Managed Postgres. Remove `glaze-postgres` Helm release. Update `DATABASE_URL` secret in Infisical.
Redis	Option A: Migrate to DO Managed Redis. Option B: Keep self-hosted, add Redis Sentinel for HA. Update `REDIS_URL` secret.
Storage class	Drop `local-path` from all remaining PVCs (currently only Postgres + Redis need it).
Firewall	Open UDP 8472 (flannel VXLAN) between the two droplet private IPs in nftables config.
k3s join	Update `ensure_cluster.sh` to provision and join the agent node (k3s agent join token, server URL).
k3s config	Loosen `max-requests-inflight`, `max-mutating-requests-inflight`, and controller sync concurrency limits — they were throttled for single-core.
Web/Worker	No changes needed — already stateless Deployments with no affinity constraints. HPA or manual replica bumps work immediately after the above.
Traefik	No changes — already a LoadBalancer service, routes across nodes transparently.
Tailscale operator	Verify multi-node awareness; proxy pods should reschedule without config changes.
Rollout strategy	Remove the sequential web-then-worker pause in `cd.yml` — it exists only to avoid memory spikes on a single core.
cert-manager / ESO / external-dns	No changes expected — cluster-scoped, node-agnostic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infra: two-host k3s topology for horizontal web/worker scaling #802

Context

Proposed Topology

Node layout

What changes

What stays the same

Scaling trigger metrics

CPU

Memory

Request latency

Database

Queue depth (Celery / Redis)

Availability

Prerequisites before acting

Deliberately deferred

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Node	Role	Droplet size	Cost
`glaze-prod` (existing)	k3s server + workloads	$12 (2 vCPU, 2 GB)	$12/mo
`glaze-worker-1` (new)	k3s agent + workloads	$12 (2 vCPU, 2 GB)	$12/mo
DO Managed Postgres	External, removes storage blocker	Basic 1 GB	~$15/mo
DO Managed Redis (optional)	External, or keep self-hosted w/ Sentinel	Basic	~$15/mo

Metric	Watch	Act
Node CPU utilization	>60% sustained	>80% sustained
Web pod CPU throttling (`container_cpu_cfs_throttled_periods_total`)	Any throttling	Frequent throttling

Metric	Watch	Act
Node memory utilization	>75%	>85%
Web/worker OOMKill events	Any	—

Metric	Watch	Act
p95 response time (`traefik_service_request_duration_seconds`)	>500ms	>1s
p99 response time	>1s	>2s

Metric	Watch	Act
Postgres active connections	>40 (of max 60)	>50
Postgres query latency p95	>100ms	>300ms

Metric	Watch	Act
Redis memory utilization	>60% of 1Gi	>80%
Celery queue depth (default queue)	>20 tasks pending	>50 tasks pending
Celery task age (oldest pending)	>30s	>2min

Metric	Watch	Act
HTTP 5xx rate	>0.1% of requests	>1%
Pod restart count	Any unexpected restarts	Frequent restarts

infra: two-host k3s topology for horizontal web/worker scaling #802

Description

Context

Proposed Topology

Node layout

What changes

What stays the same

Scaling trigger metrics

CPU

Memory

Request latency

Database

Queue depth (Celery / Redis)

Availability

Prerequisites before acting

Deliberately deferred

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions