Fix Target Allocator startup crashes, PrometheusCR watcher, and Prometheus config pod restart by musa-asad · Pull Request #386 · aws/amazon-cloudwatch-agent-operator

musa-asad · 2026-06-16T09:22:45Z

Summary

Multiple defects prevent the Target Allocator (TA) from working correctly in both Helm and EKS Add-On deployments:

TA CrashLoopBackOff: The TA binary never registered the --enable-prometheus-cr-watcher flag the operator passes, so pflag.ExitOnError calls os.Exit(2) immediately.
Three latent defects (unreachable until fix §1): synthetic Prometheus empty-namespace panic, empty evaluation_interval parse failure, nil SD metrics → 0 targets discovered.
Prometheus config changes don't restart agent pods: Only spec.config changes triggered restarts — spec.prometheus changes were ignored.
scrape_protocols regression risk: No test guarded that the Prometheus dependency defaults scrape_protocols on config load.

Changes

TA binary (commit 1):

config/flags.go — register enable-prometheus-cr-watcher Bool flag + getter
config/config.go — OR the flag into Config.PrometheusCR.Enabled
watcher/promOperator.go — set ObjectMeta.Namespace from OTELCOL_NAMESPACE env; set EvaluationInterval to ScrapeInterval
main.go — initialize sdMetrics with discovery.CreateAndRegisterSDMetrics()

Pod restart (commit 3):

internal/manifests/collector/annotations.go — configHashInput() appends serialized Spec.Prometheus to the hash, gated by Prometheus.IsEmpty()

Regression test (commit 2):

config/scrape_protocols_regression_test.go — TestScrapeProtocolsDefaultedOnLoad

Test Output

A/B Comparison — Upstream OTel vs CWA Helm (fixed) vs CWA Add-On (fixed)

All 3 environments deployed on the same EKS cluster, same OTEL version, same scrape targets. The fixed images were deployed on both the Helm chart path AND the EKS managed Add-On path (add-on installed via aws eks create-addon, then TA/operator deployments patched to the fixed images).

Check	upstream-otel	cwa-helm (fixed)	cwa-addon (fixed)
TA pod Running	Running 14h, 0 restarts	Running 21m, 0 restarts	Running 11h, 0 restarts
/livez	200 OK	200 (mTLS)	200 (mTLS)
/readyz	200 OK	200 (mTLS)	200 (mTLS)
Collectors registered	both collector-0 & -1 receive targets	both agent-0 & -1 receive targets	both agent-0 & -1 receive targets
Jobs discovered	4 (2 static + 2 prometheusCR)	5 (3 static + 2 prometheusCR)	4 (2 static + 2 prometheusCR)
scrape_protocols present	present on all 4 jobs, 0 MISSING	present on all 5 jobs, 0 MISSING	present on all 4 jobs, 0 MISSING
Agent pods Running (0 restarts)	2/2 Running, 0 restarts	2/2 Running, 0 restarts	2/2 Running, 0 restarts
Error strings (unknown flag / scrape_protocols / Failed to apply / See you next time)	0/0/0/0	0/0/0/0	0/0/0/0
Config change rolls pods	N/A	YES — both UIDs changed	YES — both UIDs changed; TA also rolled

Conclusion: Both fixed paths match upstream on every check. The --enable-prometheus-cr-watcher flag is accepted (no crash), prometheusCR jobs are discovered, scrape_protocols is present, targets are sharded across collectors, and prometheus config changes trigger pod restarts.

Raw output — Helm path (StatefulSet, 2 replicas, TA + prometheusCR enabled)

$ kubectl get pods -n <test-ns-helm>
cloudwatch-agent-0                                                1/1     Running   0          2m38s
cloudwatch-agent-1                                                1/1     Running   0          2m38s
cloudwatch-agent-target-allocator-5c5b6bd49d-smsb5                1/1     Running   0          2m38s

$ kubectl logs <ta-pod> --tail=10 | grep -E 'flag|error|Starting'
{"level":"info","ts":...,"msg":"Starting target allocator"}
{"level":"info","ts":...,"msg":"Starting target watcher","strategy":"consistent-hashing"}

# TA args confirm flag accepted:
$ kubectl get pod <ta-pod> -o jsonpath='{.spec.containers[0].args}'
["--enable-prometheus-cr-watcher"]

# Error string counts (all 0):
$ for p in cloudwatch-agent-0 cloudwatch-agent-1; do echo "$p:"; kubectl logs $p | grep -c 'unknown flag'; kubectl logs $p | grep -c 'scrape_protocols cannot be empty'; done
cloudwatch-agent-0: 0 0
cloudwatch-agent-1: 0 0

# Config change pod restart proof:
BEFORE: pod UIDs c9fa...  cbe0...
(changed prometheus relabel value)
AFTER:  pod UIDs 066f...  1310...   (ALL NEW — pods rolled)

Raw output — EKS Add-On path (add-on installed, images patched)

$ kubectl get pods -n <test-ns-addon>
amazon-cloudwatch-observability-controller-manager-f6cdf4fqs4bp   1/1     Running   0          21m
cloudwatch-agent-0                                                1/1     Running   0          20m
cloudwatch-agent-1                                                1/1     Running   0          20m
cloudwatch-agent-target-allocator-77b76f576d-r688k                1/1     Running   0          11h

# TA args confirm flag accepted on add-on path:
$ kubectl get pod <ta-pod> -n <test-ns-addon> -o jsonpath='{.spec.containers[0].args}'
["--enable-prometheus-cr-watcher"]

# Error string counts (all 0):
$ for p in cloudwatch-agent-0 cloudwatch-agent-1; do echo "$p:"; kubectl logs -n <test-ns-addon> $p | grep -c 'unknown flag'; kubectl logs -n <test-ns-addon> $p | grep -c 'scrape_protocols cannot be empty'; done
cloudwatch-agent-0: 0 0
cloudwatch-agent-1: 0 0

# Config change pod restart proof:
BEFORE: pod UIDs ad31...  912d...
(changed prometheus relabel value)
AFTER:  pod UIDs 9586...  814a...   (ALL NEW — pods rolled); TA also rolled

Raw output — Upstream OTel (control)

$ kubectl get pods -n <test-ns-upstream>
otel-targetallocator-0                     1/1     Running   0          14h
otel-collector-0                           1/1     Running   0          14h
otel-collector-1                           1/1     Running   0          14h

# Same targets discovered and allocated:
$ curl -s http://localhost:8080/jobs | python3 -m json.tool | grep job_name
"kubernetes-pods-annotated"
"serviceMonitor/upstream-otel/nginx/0"
"podMonitor/upstream-otel/node-exporter/0"
"prometheus-sample-app"

Unit tests

$ go test ./cmd/amazon-cloudwatch-agent-target-allocator/... -count=1
ok   .../config     0.297s
ok   .../allocation 0.149s
ok   .../watcher    0.402s

$ go test ./internal/manifests/collector/... -count=1
ok   .../collector   0.027s

Commits

1376451 — Register enable-prometheus-cr-watcher flag and fix PrometheusCR watcher startup
0405f4d — test(target-allocator): guard scrape_protocols defaulting on config load
0318a47 — fix(collector): roll pods when Prometheus config changes

…er startup The target-allocator declared the enable-prometheus-cr-watcher flag name as a constant but never registered it on the flag set, while the operator passes --enable-prometheus-cr-watcher whenever PrometheusCR.enabled is true. Because args are parsed with pflag.ExitOnError, the unregistered flag caused the binary to print 'unknown flag' and exit(2), putting the target-allocator pod into CrashLoopBackOff. This change registers the flag and ORs it with the YAML prometheus_cr.enabled setting, then fixes three latent defects that were previously unreachable because the binary crashed first: - promOperator: set a non-empty Namespace on the synthetic Prometheus object so the prometheus-operator config generator no longer panics with 'namespace can't be empty' in store.ForNamespace. - promOperator: set EvaluationInterval so the generated config does not render an empty global.evaluation_interval, which the prometheus config parser rejects with 'empty duration string'. - main: create and register service-discovery metrics and pass them to discovery.NewManager; passing a nil sdMetrics map makes every SD provider fail to register, yielding zero discovered targets. RELEASE_NOTES updated.

Add a regression test asserting that loading a Target Allocator config whose static scrape job omits scrape_protocols still yields a non-empty ScrapeProtocols on every loaded scrape config. This is defaulted by the pinned Prometheus library during yaml.UnmarshalStrict into the prometheus Config type, so the distributed /scrape_configs payload is never empty and the agent's prometheus-receiver validation passes. The test fails fast if a future dependency or load-path change drops this defaulting.

The pod-template restart-trigger sha256 was computed from Spec.Config only, so a change to Spec.Prometheus (rendered into a separate ConfigMap) left the pod template byte-identical and the workload controller did not roll the pods. Fold the serialized Spec.Prometheus (PrometheusConfig.Yaml()) into the hash input when it is non-empty, so a Prometheus-only change bumps the pod-template annotation and triggers a rolling restart, matching agent-config behavior. When no Prometheus config is set the hash input is byte-identical to the agent config alone, leaving non-Prometheus agents unaffected.

musa-asad mentioned this pull request Jun 16, 2026

fix(collector): roll pods when Prometheus config changes #387

Closed

musa-asad changed the title ~~Register enable-prometheus-cr-watcher flag and fix PrometheusCR watcher startup~~ Fix Target Allocator startup crashes, PrometheusCR watcher, and Prometheus config pod restart Jun 16, 2026

musa-asad self-assigned this Jun 16, 2026

musa-asad force-pushed the ta-register-prometheus-cr-watcher-flag branch 2 times, most recently from b936969 to d198928 Compare June 16, 2026 18:28

musa-asad force-pushed the ta-register-prometheus-cr-watcher-flag branch 2 times, most recently from 36d23fa to 0318a47 Compare June 16, 2026 18:32

musa-asad requested review from okankoAMZ and sky333999 June 16, 2026 18:37

This was referenced Jun 16, 2026

Do not warn on target allocator config load for non-TA Prometheus configs aws/amazon-cloudwatch-agent#2157

Open

Add default Prometheus EMF metric_declaration when target allocator is enabled aws-observability/helm-charts#319

Closed

musa-asad marked this pull request as ready for review June 17, 2026 14:53

musa-asad force-pushed the ta-register-prometheus-cr-watcher-flag branch 4 times, most recently from ac8b92d to b27505e Compare June 17, 2026 21:09

musa-asad force-pushed the ta-register-prometheus-cr-watcher-flag branch from b27505e to d61d693 Compare June 18, 2026 21:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Target Allocator startup crashes, PrometheusCR watcher, and Prometheus config pod restart#386

Fix Target Allocator startup crashes, PrometheusCR watcher, and Prometheus config pod restart#386
musa-asad wants to merge 3 commits into
aws:mainfrom
musa-asad:ta-register-prometheus-cr-watcher-flag

musa-asad commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

musa-asad commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

TA binary (commit 1):

Pod restart (commit 3):

Regression test (commit 2):

Test Output

A/B Comparison — Upstream OTel vs CWA Helm (fixed) vs CWA Add-On (fixed)

Raw output — Helm path (StatefulSet, 2 replicas, TA + prometheusCR enabled)

Raw output — EKS Add-On path (add-on installed, images patched)

Raw output — Upstream OTel (control)

Unit tests

Related

Commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

musa-asad commented Jun 16, 2026 •

edited

Loading