Skip to content

Fix Target Allocator startup crashes, PrometheusCR watcher, and Prometheus config pod restart#386

Open
musa-asad wants to merge 3 commits into
aws:mainfrom
musa-asad:ta-register-prometheus-cr-watcher-flag
Open

Fix Target Allocator startup crashes, PrometheusCR watcher, and Prometheus config pod restart#386
musa-asad wants to merge 3 commits into
aws:mainfrom
musa-asad:ta-register-prometheus-cr-watcher-flag

Conversation

@musa-asad

@musa-asad musa-asad commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Multiple defects prevent the Target Allocator (TA) from working correctly in both Helm and EKS Add-On deployments:

  1. TA CrashLoopBackOff: The TA binary never registered the --enable-prometheus-cr-watcher flag the operator passes, so pflag.ExitOnError calls os.Exit(2) immediately.
  2. Three latent defects (unreachable until fix §1): synthetic Prometheus empty-namespace panic, empty evaluation_interval parse failure, nil SD metrics → 0 targets discovered.
  3. Prometheus config changes don't restart agent pods: Only spec.config changes triggered restarts — spec.prometheus changes were ignored.
  4. scrape_protocols regression risk: No test guarded that the Prometheus dependency defaults scrape_protocols on config load.

Changes

TA binary (commit 1):

  • config/flags.go — register enable-prometheus-cr-watcher Bool flag + getter
  • config/config.go — OR the flag into Config.PrometheusCR.Enabled
  • watcher/promOperator.go — set ObjectMeta.Namespace from OTELCOL_NAMESPACE env; set EvaluationInterval to ScrapeInterval
  • main.go — initialize sdMetrics with discovery.CreateAndRegisterSDMetrics()

Pod restart (commit 3):

  • internal/manifests/collector/annotations.goconfigHashInput() appends serialized Spec.Prometheus to the hash, gated by Prometheus.IsEmpty()

Regression test (commit 2):

  • config/scrape_protocols_regression_test.goTestScrapeProtocolsDefaultedOnLoad

Test Output

A/B Comparison — Upstream OTel vs CWA Helm (fixed) vs CWA Add-On (fixed)

All 3 environments deployed on the same EKS cluster, same OTEL version, same scrape targets. The fixed images were deployed on both the Helm chart path AND the EKS managed Add-On path (add-on installed via aws eks create-addon, then TA/operator deployments patched to the fixed images).

Check upstream-otel cwa-helm (fixed) cwa-addon (fixed)
TA pod Running Running 14h, 0 restarts Running 21m, 0 restarts Running 11h, 0 restarts
/livez 200 OK 200 (mTLS) 200 (mTLS)
/readyz 200 OK 200 (mTLS) 200 (mTLS)
Collectors registered both collector-0 & -1 receive targets both agent-0 & -1 receive targets both agent-0 & -1 receive targets
Jobs discovered 4 (2 static + 2 prometheusCR) 5 (3 static + 2 prometheusCR) 4 (2 static + 2 prometheusCR)
scrape_protocols present present on all 4 jobs, 0 MISSING present on all 5 jobs, 0 MISSING present on all 4 jobs, 0 MISSING
Agent pods Running (0 restarts) 2/2 Running, 0 restarts 2/2 Running, 0 restarts 2/2 Running, 0 restarts
Error strings (unknown flag / scrape_protocols / Failed to apply / See you next time) 0/0/0/0 0/0/0/0 0/0/0/0
Config change rolls pods N/A YES — both UIDs changed YES — both UIDs changed; TA also rolled

Conclusion: Both fixed paths match upstream on every check. The --enable-prometheus-cr-watcher flag is accepted (no crash), prometheusCR jobs are discovered, scrape_protocols is present, targets are sharded across collectors, and prometheus config changes trigger pod restarts.

Raw output — Helm path (StatefulSet, 2 replicas, TA + prometheusCR enabled)

$ kubectl get pods -n <test-ns-helm>
cloudwatch-agent-0                                                1/1     Running   0          2m38s
cloudwatch-agent-1                                                1/1     Running   0          2m38s
cloudwatch-agent-target-allocator-5c5b6bd49d-smsb5                1/1     Running   0          2m38s

$ kubectl logs <ta-pod> --tail=10 | grep -E 'flag|error|Starting'
{"level":"info","ts":...,"msg":"Starting target allocator"}
{"level":"info","ts":...,"msg":"Starting target watcher","strategy":"consistent-hashing"}

# TA args confirm flag accepted:
$ kubectl get pod <ta-pod> -o jsonpath='{.spec.containers[0].args}'
["--enable-prometheus-cr-watcher"]

# Error string counts (all 0):
$ for p in cloudwatch-agent-0 cloudwatch-agent-1; do echo "$p:"; kubectl logs $p | grep -c 'unknown flag'; kubectl logs $p | grep -c 'scrape_protocols cannot be empty'; done
cloudwatch-agent-0: 0 0
cloudwatch-agent-1: 0 0

# Config change pod restart proof:
BEFORE: pod UIDs c9fa...  cbe0...
(changed prometheus relabel value)
AFTER:  pod UIDs 066f...  1310...   (ALL NEW — pods rolled)

Raw output — EKS Add-On path (add-on installed, images patched)

$ kubectl get pods -n <test-ns-addon>
amazon-cloudwatch-observability-controller-manager-f6cdf4fqs4bp   1/1     Running   0          21m
cloudwatch-agent-0                                                1/1     Running   0          20m
cloudwatch-agent-1                                                1/1     Running   0          20m
cloudwatch-agent-target-allocator-77b76f576d-r688k                1/1     Running   0          11h

# TA args confirm flag accepted on add-on path:
$ kubectl get pod <ta-pod> -n <test-ns-addon> -o jsonpath='{.spec.containers[0].args}'
["--enable-prometheus-cr-watcher"]

# Error string counts (all 0):
$ for p in cloudwatch-agent-0 cloudwatch-agent-1; do echo "$p:"; kubectl logs -n <test-ns-addon> $p | grep -c 'unknown flag'; kubectl logs -n <test-ns-addon> $p | grep -c 'scrape_protocols cannot be empty'; done
cloudwatch-agent-0: 0 0
cloudwatch-agent-1: 0 0

# Config change pod restart proof:
BEFORE: pod UIDs ad31...  912d...
(changed prometheus relabel value)
AFTER:  pod UIDs 9586...  814a...   (ALL NEW — pods rolled); TA also rolled

Raw output — Upstream OTel (control)

$ kubectl get pods -n <test-ns-upstream>
otel-targetallocator-0                     1/1     Running   0          14h
otel-collector-0                           1/1     Running   0          14h
otel-collector-1                           1/1     Running   0          14h

# Same targets discovered and allocated:
$ curl -s http://localhost:8080/jobs | python3 -m json.tool | grep job_name
"kubernetes-pods-annotated"
"serviceMonitor/upstream-otel/nginx/0"
"podMonitor/upstream-otel/node-exporter/0"
"prometheus-sample-app"

Unit tests

$ go test ./cmd/amazon-cloudwatch-agent-target-allocator/... -count=1
ok   .../config     0.297s
ok   .../allocation 0.149s
ok   .../watcher    0.402s

$ go test ./internal/manifests/collector/... -count=1
ok   .../collector   0.027s

Related

Commits

  1. 1376451 — Register enable-prometheus-cr-watcher flag and fix PrometheusCR watcher startup
  2. 0405f4d — test(target-allocator): guard scrape_protocols defaulting on config load
  3. 0318a47 — fix(collector): roll pods when Prometheus config changes

…er startup

The target-allocator declared the enable-prometheus-cr-watcher flag name as a
constant but never registered it on the flag set, while the operator passes
--enable-prometheus-cr-watcher whenever PrometheusCR.enabled is true. Because
args are parsed with pflag.ExitOnError, the unregistered flag caused the binary
to print 'unknown flag' and exit(2), putting the target-allocator pod into
CrashLoopBackOff.

This change registers the flag and ORs it with the YAML prometheus_cr.enabled
setting, then fixes three latent defects that were previously unreachable
because the binary crashed first:

- promOperator: set a non-empty Namespace on the synthetic Prometheus object so
  the prometheus-operator config generator no longer panics with
  'namespace can't be empty' in store.ForNamespace.
- promOperator: set EvaluationInterval so the generated config does not render an
  empty global.evaluation_interval, which the prometheus config parser rejects
  with 'empty duration string'.
- main: create and register service-discovery metrics and pass them to
  discovery.NewManager; passing a nil sdMetrics map makes every SD provider fail
  to register, yielding zero discovered targets.

RELEASE_NOTES updated.
@musa-asad musa-asad changed the title Register enable-prometheus-cr-watcher flag and fix PrometheusCR watcher startup Fix Target Allocator startup crashes, PrometheusCR watcher, and Prometheus config pod restart Jun 16, 2026
@musa-asad musa-asad self-assigned this Jun 16, 2026
@musa-asad musa-asad force-pushed the ta-register-prometheus-cr-watcher-flag branch 2 times, most recently from b936969 to d198928 Compare June 16, 2026 18:28
Add a regression test asserting that loading a Target Allocator config whose
static scrape job omits scrape_protocols still yields a non-empty
ScrapeProtocols on every loaded scrape config. This is defaulted by the pinned
Prometheus library during yaml.UnmarshalStrict into the prometheus Config type,
so the distributed /scrape_configs payload is never empty and the agent's
prometheus-receiver validation passes. The test fails fast if a future
dependency or load-path change drops this defaulting.
@musa-asad musa-asad force-pushed the ta-register-prometheus-cr-watcher-flag branch 2 times, most recently from 36d23fa to 0318a47 Compare June 16, 2026 18:32
@musa-asad musa-asad requested review from okankoAMZ and sky333999 June 16, 2026 18:37
@musa-asad musa-asad marked this pull request as ready for review June 17, 2026 14:53
@musa-asad musa-asad force-pushed the ta-register-prometheus-cr-watcher-flag branch 4 times, most recently from ac8b92d to b27505e Compare June 17, 2026 21:09
The pod-template restart-trigger sha256 was computed from Spec.Config only,
so a change to Spec.Prometheus (rendered into a separate ConfigMap) left the
pod template byte-identical and the workload controller did not roll the pods.

Fold the serialized Spec.Prometheus (PrometheusConfig.Yaml()) into the hash
input when it is non-empty, so a Prometheus-only change bumps the pod-template
annotation and triggers a rolling restart, matching agent-config behavior.
When no Prometheus config is set the hash input is byte-identical to the agent
config alone, leaving non-Prometheus agents unaffected.
@musa-asad musa-asad force-pushed the ta-register-prometheus-cr-watcher-flag branch from b27505e to d61d693 Compare June 18, 2026 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant