Troubleshooting Alarms

Use troubleshooting information to identify and address common issues that can occur while working with alarms in Monitoring.

Before troubleshooting, ensure that you understand how alarms are evaluated. See Illustration of Alarm Evaluation.

Alarm Doesn't Fire

The alarm met the condition for firing, but it didn't fire. For example, a compute instance went down.

Cause: Long trigger delay

The alarm expression didn't evaluate to true for consecutive minutes in the trigger delay period.

The following image of an alarm's metric chart includes a shaded area to indicate the trigger delay period. In this example, the alarm summary shown on the alarm details page is Alarm fires when the Mean of CpuUtilization is greater than the threshold value of 80, with a trigger delay of 10 minutes. The trigger delay starts at 1:30 (when the threshold is exceeded) and ends at 1:40. The alarm expression evaluates to true at 1:30, then evaluates to false at 1:32. This true evaluation doesn't continue for the full ten-minute trigger delay period, so the alarm doesn't fire.


Trigger delay superimposed on an alarm metric chart.

To view the metric chart for an alarm, get its history.

For more information about how alarms are evaluated, see Illustration of Alarm Evaluation.

Remedy: Shorten the trigger delay

If the trigger delay is too long, and you want the alarm to fire immediately after breaching the threshold, then update the alarm to use a shorter trigger delay. For example, set the trigger delay to one minute. See Defining the Trigger Delay for an Alarm and Monitoring Query Language (MQL) Reference.

Cause: Interval is shorter than the emission frequency

The alarm expression evaluated to true, causing the alarm to fire, but at the next interval, even though the last data point exceeded threshold, the alarm cleared. The alarm cleared because the interval is shorter than the frequency of emission for the selected metric.

The following image of an alarm's metric chart shows hourly data points for the selected metric, StoredBytes, from the oci_object_storage metric namespace. The alarm query is StoredBytes[1m].sum() > 800000000, which specifies a one-minute interval. This interval is shorter than the metric's emission frequency, which is one hour. (The frequency is documented at Object Storage Metrics.)


Alarm metric chart for a metric with an hourly emission frequency.

In this example, the alarm fires at 3:00 and clears at 3:01. If the interval had been set to one hour, then the alarm expression would continue evaluating to true, and alarm would continue firing, until 4:00.

To view the metric chart for an alarm, get its history.

For more information about how alarms are evaluated, see Illustration of Alarm Evaluation.

Remedy: Increase the interval

If you want the alarm to fire, then update the alarm interval to be the same or longer than the metric's emission frequency. For example, for the StoredBytes metric, update the alarm interval to at least one hour, if you want the alarm to fire at 3:01 and continue firing until 4:00 in the previous example. See Selecting the Interval for an Alarm Query and Monitoring Query Language (MQL) Reference.

Cause: Wrong dimensions

The alarm expression didn't evaluate to true when a resource met the condition defined in the alarm because the resource was filtered out using dimensions.

For example, consider an alarm with dimensions selected for availability domain 1. The resource that met the condition is in availability domain 2. Alarm evaluation considers only resources that match the specified dimensions.

Remedy: Update dimensions

Either remove the dimensions, or update them to include the resource. See Selecting Dimensions for an Alarm Query.

Cause: Wrong query

Common examples:

  • The alarm query might specify the metric MemoryUtilization when you meant to select CpuUtilization.
  • The alarm query might specify the statistic mean() when instead you want the alarm to monitor the sum of data points in an interval (sum()).

To check the query for an alarm, get its details.

For information about query elements, see Monitoring Query Language (MQL) Reference. For more information about how alarms are evaluated, see Illustration of Alarm Evaluation.

Remedy: Update the query

Update the alarm to specify the metric that you want. To edit the MQL directly, see Editing the MQL Expression When Updating an Alarm.

Cause: The alarm is disabled

Remedy: Enable the alarm

  1. Open the navigation menu and click Observability & Management. Under Monitoring, click Alarm Definitions.
    Note

    These steps are for the Console. For complete instructions, see Enabling an Alarm.
  2. Click the name of the alarm that you want to update.
  3. On the alarm details page, select Alarm is enabled.

Alarm Doesn't Send a Notification

When the alarm fires, it doesn't send a notification.

Cause: The alarm or dimension is suppressed

Remedy: Remove the suppression

Cause: Subscription isn't part of the configured topic

For example, let's say that you aren't getting alarm messages in your in-box. The topic specified for the alarm might not have an email subscription for the email address that you want.

To check if the topic includes the expected subscription, see Getting a Topic's Details.

Remedy: Update topic to include subscription

See Creating a Subscription.

You could also update the alarm to reference a new topic and subscription, or an existing topic that includes the subscription that you want. See Selecting a Topic as Notification Destination for an Alarm.

Alarm Sends Too Many Notifications

When the alarm fires, it sends more notifications than expected.

Cause: Repeat notifications are enabled

The alarm is configured to repeat alarm notifications when the alarm keeps firing without interruption.

Remedy: Disable repeat notifications

  1. Open the navigation menu and click Observability & Management. Under Monitoring, click Alarm Definitions.
    Note

    These steps are for the Console. For complete instructions, see Repeating Notifications for an Alarm.
  2. Click the name of the alarm that you want to update.
  3. On the alarm details page, click Actions and then select Edit alarm.
  4. Under Define alarm notifications, clear the Repeat notification? check box.
  5. Click Save alarm.

Cause: Split notifications are enabled

The alarm is configured to send a notification for each metric stream that fires. For example, if 50 metric streams fire, then the alarm sends 50 notifications. This is expected behavior for split notifications. See Scenario: Split Messages by Metric Stream.

For example, the following image shows an alarm metric chart with two metric streams that exceed the threshold at 1:30, causing the alarm to fire.


Two metric streams fire at 1:30.

Following is the alarm message sent for the compute instance with the metric value of 87.

Email message sent for the first firing metric stream in the example.

Following is the alarm message sent for the compute instance with the metric value of 95.

Email message sent for the second firing metric stream in the example.

To view the metric chart for an alarm, get its history.

If you didn't intend for the alarm to send a notification for each firing metric stream, then consider updating the alarm to group notifications instead. See When to Group Notifications. After this update, the alarm sends a single notification when the alarm fires, regardless of the number of metric streams that are firing.

Alarm Doesn't Save (404 Error)

When trying to save a new or updated alarm, you see a 404 error preventing the creation or update of the alarm.

Cause: Insufficient policies

A 404 error indicates that you don't have the required IAM policies.

Remedy: Obtain required policies

See IAM Policies.

Alarm Fires and Clears Continually

Troubleshoot an alarm that keeps switching between Firing and OK status values.

Either the alarm interval is too small or the trigger delay is too large (or both). The resource emits the specified metric at a greater frequency than the alarm interval.

For example, consider the metric DatabaseAvailability, which is emitted every 5 minutes.

API request (relevant portions):

  "isNotificationsPerMetricDimensionEnabled":false,
  "namespace":"oci_autonomous_database",
  "query":"DatabaseAvailability[1m].absent()",
  "pendingDuration":"PT3M",

Console configuration:

Field Value
Metric namespace oci_autonomous_database
Metric name DatabaseAvailability
Interval 1 minute
Statistic Mean
Trigger rule
  • Operator: absent
  • Trigger delay minutes: 3
Message grouping Group notifications across metric streams
Example: Alarm Switches Status

Following is an example of an alarm's status switching between Firing and OK status values from 1:00 to 1:08. Note the OK status at 1:01, 1:02, 1:06, and 1:07. At these times, the alarm evaluation results met the condition for the one-minute interval, but the status change was internally pending because of the three-minute trigger delay. The alarm status changed to Firing at 1:03 and 1:08 because three consecutive evaluations met the condition.

Time Value in metric chart* Alarm condition met? Alarm status
1:00 0 No OK
1:01 1 Yes. Status change is internally pending OK
1:02 1 Yes. Status change is internally pending OK
1:03 1 Yes Firing
1:04 1 Yes Firing
1:05 0 No OK
1:06 1 Yes. Status change is internally pending OK
1:07 1 Yes. Status change is internally pending OK
1:08 1 Yes Firing

*For value in metric chart, 0 means the metric is present while 1 means the metric is absent. For an example metric chart, see Creating an Absence Alarm.

To remedy this situation, update the following alarm configuration:

For example, update the interval to 10 minutes and update the trigger delay to 1 minute.

API request (relevant portions):

  "isNotificationsPerMetricDimensionEnabled":false,
  "namespace":"oci_autonomous_database",
  "query":"DatabaseAvailability[10m].absent()",
  "pendingDuration":"PT1M",

Console configuration:

Field Value
Metric namespace oci_autonomous_database
Metric name DatabaseAvailability
Interval 10 minutes
Statistic Mean
Trigger rule
  • Operator: absent
  • Trigger delay minutes: 1
Message grouping Group notifications across metric streams
Example: Metric is Present, Alarm is OK
In this example, the metric is present at the expected times (every five minutes): 2:00, 2:05, and 2:10. At each time, the alarm evaluates for presence of the metric during the last ten minutes. The alarm's status remains OK for the listed times.
Time Value in metric chart* Alarm condition met? Alarm status
2:00 0 No OK
2:01 1 No OK
2:02 1 No OK
2:03 1 No OK
2:04 1 No OK
2:05 0 No OK
2:06 1 No OK
2:07 1 No OK
2:08 1 No OK
2:09 1 No OK
2:10 0 No OK
2:11 1 No OK
*For value in metric chart, 0 means the metric is present while 1 means the metric is absent. For an example metric chart, see Creating an Absence Alarm.
Example: Metric is Absent, Alarm is Firing
In this example, the metric is present at 2:00, but absent at 2:05 and 2:10. Because the alarm interval is ten minutes, the alarm condition wasn't met at 2:05. At 2:10 the alarm changes to Firing status because the alarm condition is met (zero metrics were present for the ten-minute interval).
Time Value in metric chart* Alarm condition met? Alarm status
2:00 0 No OK
2:01 1 No OK
2:02 1 No OK
2:03 1 No OK
2:04 1 No OK
2:05 1 No OK
2:06 1 No OK
2:07 1 No OK
2:08 1 No OK
2:09 1 No OK
2:10 1 Yes Firing
2:11 1 Yes Firing
*For value in metric chart, 0 means the metric is present while 1 means the metric is absent. For an example metric chart, see Creating an Absence Alarm.