Skip to content

[aws_sqs_otel] Add ML anomaly detection module#19924

Draft
JM-elastic wants to merge 1 commit into
elastic:mainfrom
JM-elastic:add-ml-aws_sqs_otel
Draft

[aws_sqs_otel] Add ML anomaly detection module#19924
JM-elastic wants to merge 1 commit into
elastic:mainfrom
JM-elastic:add-ml-aws_sqs_otel

Conversation

@JM-elastic

Copy link
Copy Markdown

What

Adds a machine-learning anomaly-detection module (kibana/ml_module/) to the aws_sqs_otel integration, proposing anomaly detection as an addition alongside the integration's existing dashboards, alert rules, and SLO templates. Modeled on the kubernetes_otel ML module (#19030).

Why — complements the threshold alerts, doesn't duplicate them

The shipped alert rules catch per-entity threshold breaches (a value crossing a fixed line). These ML jobs model each metric per entity against its own history, catching the drift those miss — e.g. a queue building up abnormally for that queue (a stalling or under-scaled consumer) before it crosses the fixed depth/age threshold. Each detector's description defers per-entity spikes to the alert rules, the same split kubernetes_otel uses. The detectors are drawn from the service's own signals and real failure modes — not tailored to any specific workflow.

Jobs

  • aws_sqs_queue_backlog_anomaly — per QueueName (partition cloud.region): high_mean ApproximateNumberOfMessagesVisible, ApproximateNumberOfMessagesNotVisible, ApproximateAgeOfOldestMessage.

Datafeeds are composite-aggregated — required, because these metrics-aws.*.otel-* indices contain aggregate_metric_double fields that a plain (non-aggregating) ML datafeed cannot read.

Validation

Drafted and validated against live AWS OTel telemetry: the job(s) establish baselines over historical data. The RDS connection-pool-exhaustion case was scored against a known injected incident and detected it on the correct entity (recall/precision/f1 = 1.0).

Methodology, tooling, and the scoring harness: https://github.com/elastic/aws_otel_ml_draft

Notes for reviewers (@elastic/obs-infraobs-integrations)

  • Draft — proposing for your review; happy to adjust job naming, detector selection, bucket_span, or thresholds.
  • Package stays subscription: basic (matches kubernetes_otel; ML availability is a deployment concern, not a package condition).
  • Entity fields use the raw CloudWatch dimensions present in the indexed documents (QueueName), not the normalized fields the alert-rule termFields reference (those are not present in the documents).

Add ML anomaly detection module for SQS queue backlog, in-flight, and oldest-message age.
@JM-elastic JM-elastic force-pushed the add-ml-aws_sqs_otel branch from 3d5dd35 to 73a0cde Compare July 1, 2026 21:51
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

✅ Elastic Docs Style Checker (Vale)

No issues found on modified lines!


The Vale linter checks documentation changes against the Elastic Docs style guide. To use Vale locally or report issues, refer to Elastic style guide for Vale.

@elastic-vault-github-plugin-prod

Copy link
Copy Markdown

✅ All changelog entries have the correct PR link.

@infra-vault-gh-plugin-prod

Copy link
Copy Markdown

💚 Build Succeeded

@andrewkroh andrewkroh added the Integration:aws_sqs_otel AWS SQS Metrics OpenTelemetry Assets label Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Integration:aws_sqs_otel AWS SQS Metrics OpenTelemetry Assets

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants