Skip to content

infra: two-host k3s topology for horizontal web/worker scaling #802

@shaoster

Description

@shaoster

Context

The current production cluster is a single-node k3s deployment on a $12 DigitalOcean droplet. The stateless workloads (`glaze-web`, `glaze-worker`) can already scale horizontally, but two blockers prevent scheduling them across multiple nodes:

  1. Local-path storage — Postgres (10Gi) and Redis (1Gi) PVCs are pinned to the host's local disk.
  2. Single-host firewall — nftables rules assume all pod traffic is intra-node; flannel inter-node traffic is not permitted.

This issue captures the proposed two-host topology and the monitoring thresholds that should trigger acting on it.


Proposed Topology

Node layout

Node Role Droplet size Cost
glaze-prod (existing) k3s server + workloads $12 (2 vCPU, 2 GB) $12/mo
glaze-worker-1 (new) k3s agent + workloads $12 (2 vCPU, 2 GB) $12/mo
DO Managed Postgres External, removes storage blocker Basic 1 GB ~$15/mo
DO Managed Redis (optional) External, or keep self-hosted w/ Sentinel Basic ~$15/mo

Baseline cost delta: ~$27–42/mo vs. current $12/mo.

What changes

Component Change
Postgres Migrate from StatefulSet + local-path PVC → DO Managed Postgres. Remove glaze-postgres Helm release. Update DATABASE_URL secret in Infisical.
Redis Option A: Migrate to DO Managed Redis. Option B: Keep self-hosted, add Redis Sentinel for HA. Update REDIS_URL secret.
Storage class Drop local-path from all remaining PVCs (currently only Postgres + Redis need it).
Firewall Open UDP 8472 (flannel VXLAN) between the two droplet private IPs in nftables config.
k3s join Update ensure_cluster.sh to provision and join the agent node (k3s agent join token, server URL).
k3s config Loosen max-requests-inflight, max-mutating-requests-inflight, and controller sync concurrency limits — they were throttled for single-core.
Web/Worker No changes needed — already stateless Deployments with no affinity constraints. HPA or manual replica bumps work immediately after the above.
Traefik No changes — already a LoadBalancer service, routes across nodes transparently.
Tailscale operator Verify multi-node awareness; proxy pods should reschedule without config changes.
Rollout strategy Remove the sequential web-then-worker pause in cd.yml — it exists only to avoid memory spikes on a single core.
cert-manager / ESO / external-dns No changes expected — cluster-scoped, node-agnostic.

What stays the same

  • Traefik ingress (public + Tailscale paths)
  • cert-manager + Let's Encrypt
  • External Secrets Operator → Infisical
  • GitHub Actions CI/CD flow
  • Helm release structure

Scaling trigger metrics

Monitor these in Grafana Cloud (already wired via glaze-otelcol). Pull the trigger on this issue when two or more thresholds are sustained for >30 minutes during normal usage hours.

CPU

Metric Watch Act
Node CPU utilization >60% sustained >80% sustained
Web pod CPU throttling (container_cpu_cfs_throttled_periods_total) Any throttling Frequent throttling

Memory

Metric Watch Act
Node memory utilization >75% >85%
Web/worker OOMKill events Any

Request latency

Metric Watch Act
p95 response time (traefik_service_request_duration_seconds) >500ms >1s
p99 response time >1s >2s

Database

Metric Watch Act
Postgres active connections >40 (of max 60) >50
Postgres query latency p95 >100ms >300ms

Queue depth (Celery / Redis)

Metric Watch Act
Redis memory utilization >60% of 1Gi >80%
Celery queue depth (default queue) >20 tasks pending >50 tasks pending
Celery task age (oldest pending) >30s >2min

Availability

Metric Watch Act
HTTP 5xx rate >0.1% of requests >1%
Pod restart count Any unexpected restarts Frequent restarts

Prerequisites before acting

  • Grafana Cloud dashboards expose the above metrics (verify OTel collector scrape config covers Postgres + Redis exporters)
  • DO Managed Postgres migration runbook written and dry-run tested against a local restore
  • Backup/restore verified for managed Postgres (current hourly Dropbox backup strategy may need adaptation)
  • ensure_cluster.sh updated to support agent node provisioning
  • Firewall config tested in staging or documented as a one-liner

Deliberately deferred

No action needed until monitoring thresholds are hit. This issue exists to have a ready plan when invited users generate real load.

Metadata

Metadata

Assignees

No one assigned

    Labels

    blockedWaiting on external blockertech-debtCode quality or architectural debt to address over time

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions