Fix Target Allocator startup crashes, PrometheusCR watcher, and Prometheus config pod restart#386
Open
musa-asad wants to merge 3 commits into
Open
Conversation
…er startup The target-allocator declared the enable-prometheus-cr-watcher flag name as a constant but never registered it on the flag set, while the operator passes --enable-prometheus-cr-watcher whenever PrometheusCR.enabled is true. Because args are parsed with pflag.ExitOnError, the unregistered flag caused the binary to print 'unknown flag' and exit(2), putting the target-allocator pod into CrashLoopBackOff. This change registers the flag and ORs it with the YAML prometheus_cr.enabled setting, then fixes three latent defects that were previously unreachable because the binary crashed first: - promOperator: set a non-empty Namespace on the synthetic Prometheus object so the prometheus-operator config generator no longer panics with 'namespace can't be empty' in store.ForNamespace. - promOperator: set EvaluationInterval so the generated config does not render an empty global.evaluation_interval, which the prometheus config parser rejects with 'empty duration string'. - main: create and register service-discovery metrics and pass them to discovery.NewManager; passing a nil sdMetrics map makes every SD provider fail to register, yielding zero discovered targets. RELEASE_NOTES updated.
b936969 to
d198928
Compare
Add a regression test asserting that loading a Target Allocator config whose static scrape job omits scrape_protocols still yields a non-empty ScrapeProtocols on every loaded scrape config. This is defaulted by the pinned Prometheus library during yaml.UnmarshalStrict into the prometheus Config type, so the distributed /scrape_configs payload is never empty and the agent's prometheus-receiver validation passes. The test fails fast if a future dependency or load-path change drops this defaulting.
36d23fa to
0318a47
Compare
ac8b92d to
b27505e
Compare
The pod-template restart-trigger sha256 was computed from Spec.Config only, so a change to Spec.Prometheus (rendered into a separate ConfigMap) left the pod template byte-identical and the workload controller did not roll the pods. Fold the serialized Spec.Prometheus (PrometheusConfig.Yaml()) into the hash input when it is non-empty, so a Prometheus-only change bumps the pod-template annotation and triggers a rolling restart, matching agent-config behavior. When no Prometheus config is set the hash input is byte-identical to the agent config alone, leaving non-Prometheus agents unaffected.
b27505e to
d61d693
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Multiple defects prevent the Target Allocator (TA) from working correctly in both Helm and EKS Add-On deployments:
--enable-prometheus-cr-watcherflag the operator passes, sopflag.ExitOnErrorcallsos.Exit(2)immediately.evaluation_intervalparse failure, nil SD metrics → 0 targets discovered.spec.configchanges triggered restarts —spec.prometheuschanges were ignored.scrape_protocolsregression risk: No test guarded that the Prometheus dependency defaultsscrape_protocolson config load.Changes
TA binary (commit 1):
config/flags.go— registerenable-prometheus-cr-watcherBool flag + getterconfig/config.go— OR the flag intoConfig.PrometheusCR.Enabledwatcher/promOperator.go— setObjectMeta.NamespacefromOTELCOL_NAMESPACEenv; setEvaluationIntervaltoScrapeIntervalmain.go— initializesdMetricswithdiscovery.CreateAndRegisterSDMetrics()Pod restart (commit 3):
internal/manifests/collector/annotations.go—configHashInput()appends serializedSpec.Prometheusto the hash, gated byPrometheus.IsEmpty()Regression test (commit 2):
config/scrape_protocols_regression_test.go—TestScrapeProtocolsDefaultedOnLoadTest Output
A/B Comparison — Upstream OTel vs CWA Helm (fixed) vs CWA Add-On (fixed)
All 3 environments deployed on the same EKS cluster, same OTEL version, same scrape targets. The fixed images were deployed on both the Helm chart path AND the EKS managed Add-On path (add-on installed via
aws eks create-addon, then TA/operator deployments patched to the fixed images).Conclusion: Both fixed paths match upstream on every check. The
--enable-prometheus-cr-watcherflag is accepted (no crash), prometheusCR jobs are discovered,scrape_protocolsis present, targets are sharded across collectors, and prometheus config changes trigger pod restarts.Raw output — Helm path (StatefulSet, 2 replicas, TA + prometheusCR enabled)
Raw output — EKS Add-On path (add-on installed, images patched)
Raw output — Upstream OTel (control)
Unit tests
Related
Commits
1376451— Register enable-prometheus-cr-watcher flag and fix PrometheusCR watcher startup0405f4d— test(target-allocator): guard scrape_protocols defaulting on config load0318a47— fix(collector): roll pods when Prometheus config changes