Context
The current production cluster is a single-node k3s deployment on a $12 DigitalOcean droplet. The stateless workloads (`glaze-web`, `glaze-worker`) can already scale horizontally, but two blockers prevent scheduling them across multiple nodes:
- Local-path storage — Postgres (10Gi) and Redis (1Gi) PVCs are pinned to the host's local disk.
- Single-host firewall — nftables rules assume all pod traffic is intra-node; flannel inter-node traffic is not permitted.
This issue captures the proposed two-host topology and the monitoring thresholds that should trigger acting on it.
Proposed Topology
Node layout
| Node |
Role |
Droplet size |
Cost |
glaze-prod (existing) |
k3s server + workloads |
$12 (2 vCPU, 2 GB) |
$12/mo |
glaze-worker-1 (new) |
k3s agent + workloads |
$12 (2 vCPU, 2 GB) |
$12/mo |
| DO Managed Postgres |
External, removes storage blocker |
Basic 1 GB |
~$15/mo |
| DO Managed Redis (optional) |
External, or keep self-hosted w/ Sentinel |
Basic |
~$15/mo |
Baseline cost delta: ~$27–42/mo vs. current $12/mo.
What changes
| Component |
Change |
| Postgres |
Migrate from StatefulSet + local-path PVC → DO Managed Postgres. Remove glaze-postgres Helm release. Update DATABASE_URL secret in Infisical. |
| Redis |
Option A: Migrate to DO Managed Redis. Option B: Keep self-hosted, add Redis Sentinel for HA. Update REDIS_URL secret. |
| Storage class |
Drop local-path from all remaining PVCs (currently only Postgres + Redis need it). |
| Firewall |
Open UDP 8472 (flannel VXLAN) between the two droplet private IPs in nftables config. |
| k3s join |
Update ensure_cluster.sh to provision and join the agent node (k3s agent join token, server URL). |
| k3s config |
Loosen max-requests-inflight, max-mutating-requests-inflight, and controller sync concurrency limits — they were throttled for single-core. |
| Web/Worker |
No changes needed — already stateless Deployments with no affinity constraints. HPA or manual replica bumps work immediately after the above. |
| Traefik |
No changes — already a LoadBalancer service, routes across nodes transparently. |
| Tailscale operator |
Verify multi-node awareness; proxy pods should reschedule without config changes. |
| Rollout strategy |
Remove the sequential web-then-worker pause in cd.yml — it exists only to avoid memory spikes on a single core. |
| cert-manager / ESO / external-dns |
No changes expected — cluster-scoped, node-agnostic. |
What stays the same
- Traefik ingress (public + Tailscale paths)
- cert-manager + Let's Encrypt
- External Secrets Operator → Infisical
- GitHub Actions CI/CD flow
- Helm release structure
Scaling trigger metrics
Monitor these in Grafana Cloud (already wired via glaze-otelcol). Pull the trigger on this issue when two or more thresholds are sustained for >30 minutes during normal usage hours.
CPU
| Metric |
Watch |
Act |
| Node CPU utilization |
>60% sustained |
>80% sustained |
Web pod CPU throttling (container_cpu_cfs_throttled_periods_total) |
Any throttling |
Frequent throttling |
Memory
| Metric |
Watch |
Act |
| Node memory utilization |
>75% |
>85% |
| Web/worker OOMKill events |
Any |
— |
Request latency
| Metric |
Watch |
Act |
p95 response time (traefik_service_request_duration_seconds) |
>500ms |
>1s |
| p99 response time |
>1s |
>2s |
Database
| Metric |
Watch |
Act |
| Postgres active connections |
>40 (of max 60) |
>50 |
| Postgres query latency p95 |
>100ms |
>300ms |
Queue depth (Celery / Redis)
| Metric |
Watch |
Act |
| Redis memory utilization |
>60% of 1Gi |
>80% |
| Celery queue depth (default queue) |
>20 tasks pending |
>50 tasks pending |
| Celery task age (oldest pending) |
>30s |
>2min |
Availability
| Metric |
Watch |
Act |
| HTTP 5xx rate |
>0.1% of requests |
>1% |
| Pod restart count |
Any unexpected restarts |
Frequent restarts |
Prerequisites before acting
Deliberately deferred
No action needed until monitoring thresholds are hit. This issue exists to have a ready plan when invited users generate real load.
Context
The current production cluster is a single-node k3s deployment on a $12 DigitalOcean droplet. The stateless workloads (`glaze-web`, `glaze-worker`) can already scale horizontally, but two blockers prevent scheduling them across multiple nodes:
This issue captures the proposed two-host topology and the monitoring thresholds that should trigger acting on it.
Proposed Topology
Node layout
glaze-prod(existing)glaze-worker-1(new)Baseline cost delta: ~$27–42/mo vs. current $12/mo.
What changes
StatefulSet+local-pathPVC → DO Managed Postgres. Removeglaze-postgresHelm release. UpdateDATABASE_URLsecret in Infisical.REDIS_URLsecret.local-pathfrom all remaining PVCs (currently only Postgres + Redis need it).ensure_cluster.shto provision and join the agent node (k3s agent join token, server URL).max-requests-inflight,max-mutating-requests-inflight, and controller sync concurrency limits — they were throttled for single-core.cd.yml— it exists only to avoid memory spikes on a single core.What stays the same
Scaling trigger metrics
Monitor these in Grafana Cloud (already wired via
glaze-otelcol). Pull the trigger on this issue when two or more thresholds are sustained for >30 minutes during normal usage hours.CPU
container_cpu_cfs_throttled_periods_total)Memory
Request latency
traefik_service_request_duration_seconds)Database
Queue depth (Celery / Redis)
Availability
Prerequisites before acting
ensure_cluster.shupdated to support agent node provisioningDeliberately deferred
No action needed until monitoring thresholds are hit. This issue exists to have a ready plan when invited users generate real load.