Skip to content

[aws_rds_otel] Add ML anomaly detection module#19920

Draft
JM-elastic wants to merge 1 commit into
elastic:mainfrom
JM-elastic:add-ml-aws_rds_otel
Draft

[aws_rds_otel] Add ML anomaly detection module#19920
JM-elastic wants to merge 1 commit into
elastic:mainfrom
JM-elastic:add-ml-aws_rds_otel

Conversation

@JM-elastic

Copy link
Copy Markdown

What

Adds a machine-learning anomaly-detection module (kibana/ml_module/) to the aws_rds_otel integration, proposing anomaly detection as an addition alongside the integration's existing dashboards, alert rules, and SLO templates. Modeled on the kubernetes_otel ML module (#19030).

Why — complements the threshold alerts, doesn't duplicate them

The shipped alert rules catch per-entity threshold breaches (a value crossing a fixed line). These ML jobs model each metric per entity against its own history, catching the drift those miss — e.g. a connection count climbing toward the pool ceiling, or latency and freeable memory creeping past baseline but below the fixed floor. Each detector's description defers per-entity spikes to the alert rules, the same split kubernetes_otel uses. The detectors are drawn from the service's own signals and real failure modes — not tailored to any specific workflow.

Jobs

  • aws_rds_instance_resource_anomalyhigh_mean/low_mean per DBInstanceIdentifier (partition cloud.region) on DatabaseConnections, CPUUtilization, ReadLatency, WriteLatency, FreeableMemory (low), DiskQueueDepth.

Datafeeds are composite-aggregated — required, because these metrics-aws.*.otel-* indices contain aggregate_metric_double fields that a plain (non-aggregating) ML datafeed cannot read.

Validation

Drafted and validated against live AWS OTel telemetry: the job(s) establish baselines over historical data. The RDS connection-pool-exhaustion case was scored against a known injected incident and detected it on the correct entity (recall/precision/f1 = 1.0).

Methodology, tooling, and the scoring harness: https://github.com/elastic/aws_otel_ml_draft

Notes for reviewers (@elastic/obs-infraobs-integrations)

  • Draft — proposing for your review; happy to adjust job naming, detector selection, bucket_span, or thresholds.
  • Package stays subscription: basic (matches kubernetes_otel; ML availability is a deployment concern, not a package condition).
  • Entity fields use the raw CloudWatch dimensions present in the indexed documents (DBInstanceIdentifier), not the normalized fields the alert-rule termFields reference (those are not present in the documents).
  • Open tuning item: the low-variance detectors (DiskQueueDepth, ReadLatency) can over-fire on idle instances — a candidate for bucket_span / multi-bucket tuning.

Add ML anomaly detection module for RDS instance resource metrics (connections, CPU, latency, memory, disk queue).
@JM-elastic JM-elastic force-pushed the add-ml-aws_rds_otel branch from 93c56de to 940f13c Compare July 1, 2026 21:51
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

✅ Elastic Docs Style Checker (Vale)

No issues found on modified lines!


The Vale linter checks documentation changes against the Elastic Docs style guide. To use Vale locally or report issues, refer to Elastic style guide for Vale.

@elastic-vault-github-plugin-prod

Copy link
Copy Markdown

✅ All changelog entries have the correct PR link.

@infra-vault-gh-plugin-prod

Copy link
Copy Markdown

💚 Build Succeeded

@andrewkroh andrewkroh added the Integration:aws_rds_otel AWS RDS Metrics OpenTelemetry Assets label Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Integration:aws_rds_otel AWS RDS Metrics OpenTelemetry Assets

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants