Skip to content

MON-4059: Update TelemeterClientFailures alert#550

Open
slashpai wants to merge 1 commit into
openshift:mainfrom
slashpai:MON-4059
Open

MON-4059: Update TelemeterClientFailures alert#550
slashpai wants to merge 1 commit into
openshift:mainfrom
slashpai:MON-4059

Conversation

@slashpai

@slashpai slashpai commented Nov 19, 2024

Copy link
Copy Markdown
Member

Use the new metricsclient_http_requests_total metric which would tell the difference between 4xx errors (e.g. bad pull secret) and
5xx (issue on Red Hat side).

Ref: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/TelemeterClientFailures.md#diagnosis

Use the new `metricsclient_http_requests_total` metric
which would tell the difference between 4xx errors
(e.g. bad pull secret) and
5xx (issue on Red Hat side).

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 19, 2024
@openshift-ci-robot

openshift-ci-robot commented Nov 19, 2024

Copy link
Copy Markdown
Contributor

@slashpai: This pull request references MON-4059 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

Use the new metricsclient_http_requests_total metric which would tell the difference between 4xx errors (e.g. bad pull secret) and
5xx (issue on Red Hat side).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2024
@openshift-ci-robot

openshift-ci-robot commented Nov 19, 2024

Copy link
Copy Markdown
Contributor

@slashpai: This pull request references MON-4059 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

Use the new metricsclient_http_requests_total metric which would tell the difference between 4xx errors (e.g. bad pull secret) and
5xx (issue on Red Hat side).

Ref: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/TelemeterClientFailures.md#diagnosis

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@simonpasquier simonpasquier left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend to create 2 alerting rules: one for the federate_from client and another one for the federate_to client. They can have the same name but different descriptions.

sum by (namespace) (
rate(federate_requests_total{job="telemeter-client"}[15m])
) > 0.2
sum by(client, status_code) (rate(metricsclient_http_requests_total{status_code!~"200"}[15m])) > 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(suggestion) we need to preserve the namespace label. Let's also evaluate the rate of failed requests.

Suggested change
sum by(client, status_code) (rate(metricsclient_http_requests_total{status_code!~"200"}[15m])) > 0
sum by(client, status_code,namespace) (rate(metricsclient_http_requests_total{status_code!~"2..",job="telemeter-client"}[15m]))
/
on(client, namespace) group_left() sum by(client, namespace) (rate(metricsclient_http_requests_total{job="telemeter-client"}[15m])) > 0.2

@openshift-bot

Copy link
Copy Markdown
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2025
@openshift-bot

Copy link
Copy Markdown
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 31, 2025
@openshift-bot

Copy link
Copy Markdown
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci Bot closed this Jun 30, 2025
@openshift-ci

openshift-ci Bot commented Jun 30, 2025

Copy link
Copy Markdown
Contributor

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@simonpasquier

Copy link
Copy Markdown
Contributor

/reopen

@openshift-ci openshift-ci Bot reopened this Jun 30, 2025
@openshift-ci

openshift-ci Bot commented Jun 30, 2025

Copy link
Copy Markdown
Contributor

@simonpasquier: Reopened this PR.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot

openshift-ci-robot commented Jun 30, 2025

Copy link
Copy Markdown
Contributor

@slashpai: This pull request references MON-4059 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

Use the new metricsclient_http_requests_total metric which would tell the difference between 4xx errors (e.g. bad pull secret) and
5xx (issue on Red Hat side).

Ref: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/TelemeterClientFailures.md#diagnosis

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@simonpasquier

Copy link
Copy Markdown
Contributor

remove-lifecycle rotten

@simonpasquier

Copy link
Copy Markdown
Contributor

/remove-lifecycle rotten

@openshift-ci openshift-ci Bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 30, 2025
@openshift-ci

openshift-ci Bot commented Jun 30, 2025

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: slashpai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot

Copy link
Copy Markdown
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 29, 2025
@openshift-bot

Copy link
Copy Markdown
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 29, 2025
@openshift-bot

Copy link
Copy Markdown
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci Bot closed this Nov 29, 2025
@openshift-ci

openshift-ci Bot commented Nov 29, 2025

Copy link
Copy Markdown
Contributor

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@simonpasquier

Copy link
Copy Markdown
Contributor

/reopen
/remove-lifecycle rotten

@simonpasquier

Copy link
Copy Markdown
Contributor

/lifecycle frozen

@openshift-ci

openshift-ci Bot commented Dec 1, 2025

Copy link
Copy Markdown
Contributor

@simonpasquier: The lifecycle/frozen label cannot be applied to Pull Requests.

Details

In response to this:

/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot reopened this Dec 1, 2025
@openshift-ci

openshift-ci Bot commented Dec 1, 2025

Copy link
Copy Markdown
Contributor

@simonpasquier: Reopened this PR.

Details

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot

openshift-ci-robot commented Dec 1, 2025

Copy link
Copy Markdown
Contributor

@slashpai: This pull request references MON-4059 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

Use the new metricsclient_http_requests_total metric which would tell the difference between 4xx errors (e.g. bad pull secret) and
5xx (issue on Red Hat side).

Ref: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/TelemeterClientFailures.md#diagnosis

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 1, 2025
@openshift-bot

Copy link
Copy Markdown
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2026
@juzhao

juzhao commented Mar 9, 2026

Copy link
Copy Markdown

/remove-lifecycle stale

@openshift-ci openshift-ci Bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 9, 2026
@juzhao

juzhao commented Mar 9, 2026

Copy link
Copy Markdown

/test e2e-aws-ovn

@openshift-ci

openshift-ci Bot commented Mar 9, 2026

Copy link
Copy Markdown
Contributor

@slashpai: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants