Exadata Cloud Infrastructure Critical and Information Event Types Exadata Cloud Infrastructure infrastructure resources emit "critical" and "information" data plane events that allow you to receive notifications when your infrastructure resource needs attention.
About Event Types on Exadata Cloud Infrastructure π
Learn about the event types available for Exadata Cloud Infrastructure resources.
Exadata Cloud Infrastructure resources emit
events, which are structured messages that indicate changes in resources. For more
information about Oracle Cloud Infrastructure Events, see Overview of
Events. You may subscribe to events and be notified when they occur using the
Oracle Notification service, see Notifications Overview.
The following prerequisites are required for the Events to flow out of the
VM Cluster.
The Event Service requires the following:
Events on the VM Cluster depends on Oracle Trace File Analyzer (TFA) agent. Ensure
that these components are up and running. AHF version 22.2.2 or higher is
required for capturing events from the VM Cluster. To start, stop, or check the
status of TFA, see Incident Logs and Trace Files . To enable AHF Telemetry for the VM Cluster using the dbcli ulitilty, see AHF Telemetry Commands
The following network configurations are required.
Egress rules for outgoing traffic: The default egress rules are
sufficient to enable the required network path : For more information, see
Default Security List .If you have
blocked the outgoing traffic by modifying the default egress rules on your
Virtual Cloud Network(VCN), you will need to revert the settings to allow
outgoing traffic. The default egress rule allowing outgoing traffic (as
shown in Security Rules for the Oracle Exadata
Database Service on Dedicated Infrastructure) is as follows:
Stateless: No (all rules must be stateful)
Destination Type: CIDR
Destination CIDR: All <region> Services in Oracle
Services Network
IP Protocol: TCP
Destination Port: 443 (HTTPS)
Public IP or Service Gateway: The database server host must have
either a public IP address or a service gateway to be able to send database
server host metrics to the Monitoring service.
If the instance does not
have a public IP address, set up a service gateway on the virtual cloud
network (VCN). The service gateway lets the instance send database
server host metrics to the Monitoring service without the traffic going
over the internet. Here are special notes for setting up the service
gateway to access the Monitoring service:
When creating the service gateway, enable the service label called
All <region> Services in Oracle Services Network. It
includes the Monitoring service.
When setting up routing for the subnet that contains the
instance, set up a route rule with Target Type set to
Service Gateway, and the Destination Service
set to All <region> Services in Oracle Services
Network.
Oracle Exadata Database Service on Dedicated
Infrastructure Event Types
π
The events in this section are emitted by the cloud Exadata infrastructure
resource
Note
Exadata systems that use the old
DB system resource model are deprecated and will be desupported in a future release.
The DB system event are not described.
Oracle Exadata Database Service on Dedicated
Infrastructure Maintenance Event
Types
π
The events in this section are emitted by the cloud Exadata infrastructure
resource for Maintenance Events
Note
Exadata systems that use the old DB
system resource model are deprecated and will be desupported in a future release. The DB
system event are not described.
This is an Oracle Cloud Operations notification for quarterly maintenance update of your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> part of Maintenance Run <mr-display-name>, OCID <mr-ocid> in approximately NO_OF_DAYS_LEFT days on SCHEDULE_TIME. The maintenance includes EA1, Full DB Server update for DB servers 1, 3, rolling, without custom action time. EA2, Full server update for DB server 2,4, non-rolling with 30 mins custom action time between server selected per the scheduling plan.
You will get a notification on the start of the quarterly maintenance update.
This is an Oracle Cloud Operations notification for quarterly maintenance update of your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> part of Maintenance Run <mr-display-name>, OCID <mr-ocid>. Your maintenance update started at <start-time>. You will get a notification on the completion of the quarterly maintenance update.
Success: This is an Oracle Cloud Operations notification for quarterly maintenance update of your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> part of Maintenance Run <mr-display-name>, OCID <mr-ocid>. Your maintenance update started at <start-time> and was completed successfully at <end-time>. You have successfully completed maintenance updates for this window.
Failed: This is an Oracle Cloud Operations notification for quarterly maintenance update of your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> part of Maintenance Run <mr-display-name>, OCID <mr-ocid>. Your maintenance update started at <start-time> and was failed to complete successfully as planned. Our operations team is evaluating the failure and will notify you of the next steps to complete the maintenance update for this quarter.
Canceled: This is an Oracle Cloud Operations notification for quarterly maintenance update of your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> part of Maintenance Run <mr-display-name>, OCID <mr-ocid>. Your maintenance update started at <start-time>. Your maintenance has been canceled as requested. And a new window will be created according to the time given.
Duration Exceeded: This is an Oracle Cloud Operations notification for quarterly maintenance update of your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> part of Maintenance Run <mr-display-name>, OCID<mr-ocid>. Your maintenance update started at <start-time>. Your window was configured for a WINDOW_DURATION duration.
Your maintenance is taking longer than the configured window duration. This window has duration enforcement enabled. Oracle automation will reschedule all updates that have not started to a future maintenance window. Please acknowledge the updates rescheduled to run in a future unplanned maintenance window created by Oracle.
Cloud Exadata Infrastructure - Maintenance Custom action time Begin (ROLLING)
This is an Oracle Cloud Operations notification for custom action time configured for your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> for Database Server <db-server-name>, OCID <db-server-ocid>. Your custom action time started at <start-time>. You will get a notification on the completion of the custom action time for the Database Server.
Cloud Exadata Infrastructure - Maintenance Custom action time End (ROLLING)
This is an Oracle Cloud Operations notification for custom action time configured for your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> for Database Server <db-server-name>, OCID <db-server-ocid>. Your custom action time started at <start-time> ended at <end-time>.
Cloud Exadata Infrastructure - Maintenance Custom action time Begin (NONROLLING)
This is an Oracle Cloud Operations notification for custom action time configured for your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> for Database Servers <db-server-name>, OCID <dbserver-ocid> | <db-server-name>, OCID <dbserver-ocid>. Your custom action time started at <start-time>. You will get a notification on the completion of the custom action time for the Database Servers.
Cloud Exadata Infrastructure - Maintenance Custom action time End (NONROLLING)
This is an Oracle Cloud Operations notification for custom action time configured for your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> for Database Servers <db-server-name>, OCID <dbserver-ocid> | <db-server-name>, OCID <dbserver-ocid>. Your custom action time started at <start-time> ended at <end-time>.
This is an Oracle Cloud Operations notification for quarterly maintenance update of the Storage Servers of your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> for Storage Server(s) count <cell-count>. Your maintenance update started at <start-time>. You will get a notification on the completion of the quarterly maintenance update for the Storage Servers.
Cloud Exadata Infrastructure - maintenance Storage servers End
This is an Oracle Cloud Operations notification for quarterly maintenance update of the Storage Servers of your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> for Storage Server(s) count <cell-count>. Your maintenance update started at <start-time> and was completed successfully at <end-time>.
This is an Oracle Cloud Operations notification for quarterly maintenance update of the Database Server component of your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> for Database Server <db-server-name>, OCID <db-server-ocid>. Your maintenance update started at <start-time>. You will get a notification on the completion of the quarterly maintenance update for the Database Server.
Cloud Exadata Infrastructure - maintenance Database Servers End (ROLLING)
This is an Oracle Cloud Operations notification for quarterly maintenance update of the Database Server component of your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> for Database Server <db-server-name>, OCID <db-server-ocid>. Your maintenance update started at <start-time> and was completed successfully at <end-time>.
This is an Oracle Cloud Operations notification for quarterly maintenance update of the Database Server component of your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> for Database Servers <db-server-name>, OCID <dbserver-ocid> | <db-server-name>, OCID <dbserver-ocid>. Your maintenance update started at <start-time>. You will get a notification on the completion of the quarterly maintenance update for the Database Servers.
Cloud Exadata Infrastructure - maintenance Database Servers End (NONROLLING)
This is an Oracle Cloud Operations notification for quarterly maintenance update of the Database Server component of your ExaDB-D Exadata Infrastructure <infra-name>, OCID <infra-ocid> for Database Servers <db-server-name>, OCID <dbserver-ocid> | <db-server-name>, OCID <dbserver-ocid>. Your maintenance update started at <start-time> and was completed successfully at <end-time>.
Exadata Cloud Infrastructure Critical and
Information Event Types π
Exadata Cloud Infrastructure infrastructure resources emit "critical" and
"information" data plane events that allow you to receive notifications when your
infrastructure resource needs attention.
Exadata Cloud Service infrastructure resources emit "critical" and "information" data
plane events that allow you to receive notifications when your infrastructure resource
needs urgent attention ("critical" events), or notifications for events that are not
critical, but which you may want to monitor ("information" events). The eventType values
for these events are the following:
These events use the additionalDetails section of the event message to
provide specific details about what is happening within the infrastructure resource
emitting the event. In the additionalDetails section, the
eventName field provides the name of the critical or information
event. (Note that some fields in the example that follows have been omitted for
brevity.)
{
"eventType" : "com.oraclecloud.databaseservice.exadatainfrastructure.critical",
....
"data" : {
....
"additionalDetails" : {
....
"description" : "SQL statement terminated by Oracle Database Resource Manager due to excessive consumption of CPU and/or I/O.
The execution plan associated with the terminated SQL stmt is quarantined. Please find the sql identifier in
sqlId field of this JSON payload. This feature protects an Oracle database from performance degradation.
Please review the SQL statement. You can see the statement using the following commands: \"set serveroutput off\",
\"select sql_id, sql_text from v$sqltext where sql_id =<sqlId>\", \"set serveroutput on\"",
"component" : "storage",
"infrastructureType" : "exadata",
"eventName" : "HEALTH.INFRASTRUCTURE.CELL.SQL_QUARANTINE",
"quarantineMode" : "\"FULL Quarantine\""
....
}
},
"eventID" : "<unique_ID>",
....
}
}
In the tables below, you can read about the conditions and operations that
trigger "critical" and "information" events. Each condition or operation is identified
by a unique eventName value.
Critical events for Exadata Cloud Service infrastructure:
Critical Event -
EventName
Description
HEALTH.INFRASTRUCTURE.CELL.SQL_QUARANTINE
SQL statement terminated by Oracle Database Resource Manager due to
excessive consumption of CPU and/or I/O. The execution plan
associated with the terminated SQL stmt is quarantined. Please find
the sql identifier in sqlId field of this JSON payload. This feature
protects an Oracle database from performance degradation. Please
review the SQL statement. You can see the statement using the
following commands:
\"set serveroutput off\"
\"select sql_id, sql_text from v$sqltext where sql_id
=<sqlId>\"
\"set serveroutput on\"
Informational events for Exadata Cloud Service infrastructure:
Information Event -
EventName
Description
HEALTH.INFRASTRUCTURE.CELL.FLASH_DISK_FAILURE
Flash Disk Failure has been detected. This is being investigated by
Oracle Exadata team and the disk will be replaced if needed. No action
needed from the customer.
In the following example of a "critical" event, you can see within the
additionalDetails section of the event message that this
particular message concerns an SQL statement that was terminated by Oracle Database
Resource Manager because it was consuming excessive CPU or I/O resources. The
eventName and description fields within the
additionalDetails section provide information regarding the
critical situation:
{
"eventType" : "com.oraclecloud.databaseservice.exadatainfrastructure.critical",
"cloudEventsVersion" : "0.1",
"eventTypeVersion" : "2.0",
"source" : "Exadata Storage",
"eventTime" : "2021-07-30T04:53:18Z",
"contentType" : "application/json",
"data" : {
"compartmentId" : "ocid1.tenancy.oc1.<unique_ID>",
"compartmentName" : "example_name",
"resourceName" : "my_exadata_resource",
"resourceId" : "ocid1.dbsystem.oc1.phx.<unique_ID>",
"availabilityDomain" : "phx-ad-2",
"additionalDetails" : {
"serviceType" : "exacs",
"sqlID" : "gnwfm1jgqcfuu",
"systemId" : "ocid1.dbsystem.oc1.eu-frankfurt-1.<unique_ID>",
"creationTime" : "2021-05-14T13:29:28+00:00",
"dbUniqueID" : "1558836122",
"quarantineType" : "SQLID",
"dbUniqueName" : "AB0503_FRA1S6",
"description" : "SQL statement terminated by Oracle Database Resource Manager due to excessive consumption of CPU and/or I/O.
The execution plan associated with the terminated SQL stmt is quarantined. Please find the sql identifier in sqlId
field of this JSON payload. This feature protects an Oracle database from performance degradation.
Please review the SQL statement. You can see the statement using the following commands: \"set serveroutput off\",
\"select sql_id, sql_text from v$sqltext where sql_id =<sqlId>\", \"set serveroutput on\"",
"quarantineReason" : "Manual",
"asmClusterName" : "None",
"component" : "storage",
"infrastructureType" : "exadata",
"name" : "143",
"eventName" : "HEALTH.INFRASTRUCTURE.CELL.SQL_QUARANTINE",
"comment" : "None",
"quarantineMode" : "\"FULL Quarantine\"",
"rpmVersion" : "OSS_20.1.8.0.0_LINUX.X64_210317",
"cellsrvChecksum" : "14f73eb107dc1be0bde757267e931991",
"quarantinePlan" : "SYSTEM"
}
},
"eventID" : "<unique_ID>",
"extensions" : {
"compartmentId" : "ocid1.tenancy.oc1.<unique_ID>"
}
}
In the following example of an "information" event, you can see within
the additionalDetails section of the event message that this
particular message concerns a flash disk failure that is being investigated by the
Oracle Exadata operations team. The eventName and
description fields within the
additionalDetails section provide information regarding the
event:
{
"eventType" : "com.oraclecloud.databaseservice.exadatainfrastructure.information",
"cloudEventsVersion" : "0.1",
"eventTypeVersion" : "2.0",
"source" : "Exadata Storage",
"eventTime" : "2021-12-17T19:14:42Z",
"contentType" : "application/json",
"data" : {
"compartmentId" : "ocid1.tenancy.oc1..aaaaaaaao3lj36x6lwxyvc4wausjouca7pwyjfwb5ebsq5emrpqlql2gj5iq",
"compartmentName" : "intexadatateam",
"resourceId" : "ocid1.dbsystem.oc1.phx.abyhqljt5y3taezn7ug445fzwlngjfszbedxlcbctw45ykkaxyzc5isxoula",
"availabilityDomain" : "phx-ad-2",
"additionalDetails" : {
"serviceType" : "exacs",
"component" : "storage",
"systemId" : "ocid1.dbsystem.oc1.phx.abyhqljt5y3taezn7ug445fzwlngjfszbedxlcbctw45ykkaxyzc5isxoula",
"infrastructureType" : "exadata",
"description" : "Flash Disk Failure has been detected. This is being investigated by Oracle Exadata team and the disk will be
replaced if needed. No action needed from the customer.",
"eventName" : "HEALTH.INFRASTRUCTURE.CELL.FLASH_DISK_FAILURE",
"FLASH_1_1" : "S2T7NA0HC01251 failed",
"otto-ingestion-time" : "2021-12-17T19:14:43.205Z",
"otto-send-EventService-time" : "2021-12-17T19:14:44.198Z"
}
},
"eventID" : "30130ab4-42fa-4285-93a7-47e49522c698",
"extensions" : {
"compartmentId" : "ocid1.tenancy.oc1..aaaaaaaao3lj36x6lwxyvc4wausjouca7pwyjfwb5ebsq5emrpqlql2gj5iq"
}
}
Review the list of event types that Data Guard group and Data Guard Associations emit.
Note
To receive events related to Data Guard actions on multiple standby databases, subscribe to the Data Guard group resource events. If you have not switched to the new model, you can continue to subscribe to the Data Guard Associations resource events. However, after switching to the new model, you will need to explicitly subscribe to the new Data Guard Group resource events.
Data Guard Event Types (Data Guard Group resource)
Review the list of event types that Data Guard groups emit.
Database Service Events feature implementation enables you to get notified
about health issues with your Oracle Databases or other components on the Guest VM.
It is possible that Oracle Database or Clusterware may not be healthy or
various system components may be running out of space in the Guest VM. You are not
notified of this situation, unless you opt-in.
Note
You are opting in with the
understanding that the list of events can change in the future. You can opt-out of
this feature at any time
Database Service Events feature implementation generates events for Guest VM
operations and conditions, as well as Notifications for customers by leveraging the
existing OCI Events service and Notification mechanisms in their tenancy. Customers can
then create topics and subscribe to these topics through email, functions, or streams.
Note
Events flow on Exadata Cloud Infrastructure depends on the following components: Oracle Trace File
Analyzer (TFA), sysLens, and Oracle Database Cloud Service (DBCS) agent. Ensure that
these components are up and running.
Manage Oracle Trace File Analyzer
To check the run status of Oracle Trace File Analyzer, run the
tfactl status command as root or a
non-root
user:
# tfactl status
.-------------------------------------------------------------------------------------------------.
| Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status|
+----------------+---------------+--------+------+------------+----------------------+------------+
| node1 | RUNNING | 41312 | 5000 | 22.1.0.0.0 | 22100020220310214615 | COMPLETE |
| node2 | RUNNING | 272300 | 5000 | 22.1.0.0.0 | 22100020220310214615 | COMPLETE |
'----------------+---------------+--------+------+------------+----------------------+------------'
To start the Oracle Trace File Analyzer daemon on the local node, run the
tfactl start command as
root:
# tfactl start
Starting TFA..
Waiting up to 100 seconds for TFA to be started..
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Successfully started TFA Process..
. . . . .
TFA Started and listening for commands
To stop the Oracle Trace File Analyzer daemon on the local node, run the
tfactl stop command as
root:
# tfactl stop
Stopping TFA from the Command Line
Nothing to do !
Please wait while TFA stops
Please wait while TFA stops
TFA-00002 Oracle Trace File Analyzer (TFA) is not running
TFA Stopped Successfully
Successfully stopped TFA..
Manage sysLens
If sysLens is running, then once every 15 minutes data is collected
in the local domU to discover the events to be reported. To check if sysLens is
running, run the systemctl status syslens command as
root in the
domU:
# systemctl status syslens
\u25cf syslens.service
Loaded: loaded (/etc/systemd/system/syslens.service; disabled; vendor preset: disabled)
Active: active (running) since Wed 2022-03-16 18:08:59 UTC; 34s ago
Main PID: 358039 (python3)
Memory: 31.6M
CGroup: /system.slice/syslens.service
\u2514\u2500358039 /usr/bin/python3 /var/opt/oracle/syslens/bin/syslens_main.py --archive /var/opt/oracle/log/...
Mar 16 18:08:59 node1 systemd[1]: Started syslens.service.
Mar 16 18:09:09 node1 su[360495]: (to oracle) root on none
Mar 16 18:09:09 node1 su[360539]: (to grid) root on none
Mar 16 18:09:10 node1 su[360611]: (to grid) root on none
Mar 16 18:09:11 node1 su[360653]: (to oracle) root on none
If the sysLens is enabled, when there is a reboot of the domU, then
sysLens starts automatically. To validate if sysLens is enabled to collect
telemetry, run the systemctl is-enabled syslens command as
root in the
domU:
# systemctl is-enabled syslens
enabled
To validate if sysLens is configured to notify events, run the
/usr/bin/syslens --config
/var/opt/oracle/syslens/data/exacc.syslens.config --get-key
enable_telemetry command as root in the
domU:
# /usr/bin/syslens --config /var/opt/oracle/syslens/data/exacc.syslens.config --get-key enable_telemetry
syslens Collection 2.3.3
on
Manage Database Service Agent
View the /opt/oracle/dcs/log/dcs-agent.log file to identify
issues with the agent.
To check the status of the Database Service Agent, run the
systemctl status
command:
Receive Notifications about
Database Service Events π
Subscribe to the Database Service Events and get notified.
To receive notifications, subscribe to Database Service Events and get
notified using the Oracle Notification service, see Notifications
Overview. For more information about Oracle Cloud
Infrastructure Events, see Overview of Events.
Review the list of event types that the Database Service
emits.
Note
Critical events are triggered due to several types of critical conditions and errors
that cause disruption to the database and other critical components. For example,
database hang errors, and availability errors for databases, database nodes, and
database systems to let you know if a resource becomes unavailable.
Information events are triggered when the database and other critical
components work as expected. For example, a clean shutdown of CRS, CDB, client, or scan
listener, or a startup of these components will create an event with the severity of
INFORMATION.
Threshold limits reduce the number of notifications customers will receive for similar
incident events whilst at the same time ensuring they receive the incident events and
are reminded in a timely fashion.
Table 6-3 Database Service Events
Friendly Name
Event Name
Remediation
Event Type
Threshold
Resource Utilization - Disk Usage
HEALTH.DB_GUEST.FILESYSTEM.FREE_SPACE
This event is reported when VM guest file system free space falls
below 10% free, as determined by the operating system df(1)
command, for the following file systems:
A DOWN event is created when a SCAN listener goes down. The event is
of type INFORMATION when a SCAN listener is shutdown due to user action, such as
with the Server Control Utility (srvctl) or Listener Control
(lsnrctl) commands, or any Oracle Cloud maintenance action that
uses those commands, such as performing a grid infrastructure software update. The
event is of type CRITICAL when a SCAN listener goes down unexpectedly. A
corresponding DOWN_CLEARED event is created when a SCAN listener is started.
There are three SCAN listeners per cluster called
LISTENER_SCAN[1,2,3].
A DOWN event is created when a client listener goes down. The event is
of type INFORMATION when a client listener is shutdown due to user action, such as
with the Server Control Utility (srvctl) or Listener Control
(lsnrctl) commands, or any Oracle Cloud maintenance action that
uses those commands, such as performing a grid infrastructure software update. The
event is of type CRITICAL when a client listener goes down unexpectedly. A
corresponding DOWN_CLEARED event is created when a client listener is started.
There is one client listener per node, each called LISTENER.
A DOWN event is created when a database instance goes down. The event
is of type INFORMATION when a database instance is shutdown due to user action, such
as with the SQL*Plus (sqlplus) or Server Control Utility
(srvctl) commands, or any Oracle Cloud maintenance action that
uses those commands, such as performing a database home software update. The event
is of type CRITICAL when a database instance goes down unexpectedly. A corresponding
DOWN_CLEARED event is created when a database instance is started.
AVAILABILITY.DB_GUEST.CRS_INSTANCE.EVICTION An
event of type CRITICAL is created when the Cluster Ready Service (CRS) evicts a node
from the cluster. The CRS alert.log is parsed for the CRS-1632 error indicating that a
node is being removed from the cluster.
An event of type CRITICAL is created when the Cluster Ready Service (CRS) evicts a
node from the cluster. The CRS alert.log is parsed for the CRS-1632 error indicating
that a node is being removed from the cluster.
N/A
Critical DB Errors
HEALTH.DB_CLUSTER.CDB.CORRUPTION
Database corruption has been detected on your primary or standby
database. The database alert.log is parsed for any specific errors that are
indicative of physical block corruptions, logical block corruptions, or logical
block corruptions caused by lost writes.
An event of type CRITICAL is created if a CDB is either unable to
archive the active online redo log or unable to archive the active online redo log
fast enough to the log archive destinations.
An event of type CRITICAL is created when an ASM disk group reaches
space usage of 90% or higher. An event of type INFORMATION is created when the ASM
disk group space usage drops below 90%.
A Write-then-Read operation with a dummy file has failed for a file
system, typically indicating the operating system had detected an I/O error or
inconsistency (i.e. corruption) with the file system and changed the file system
mount mode from read-write to read-only. The following file systems are tested:
Oracle EXAchk is Exadata database platform's holistic health check that includes
software, infrastructure and database configuration checks. CRITICAL check alerts
should be addressed in 24 hours to maintain the maximum stability and availability
of your system. This database service event alerts every 24 hours whenever there are
any CRITICAL checks that are flagged in the most recent Oracle EXAchk report. The
event will point to the latest Oracle EXAchk zip report.
Temporarily Restrict Automatic Diagnostic
Collections for Specific Events π
Use the tfactl blackout command to temporarily suppress
automatic diagnostic collections.
If you set blackout for a target, then Oracle Trace File Analyzer stops
automatic diagnostic collections if it finds events in the alert
logs for that target while scanning. By default, blackout will be in
effect for 24 hours.
You can also restrict automatic diagnostic collection at a granular
level, for example, only for ORA-00600 or even only
ORA-00600 with specific arguments.
Limits blackout only to the specified target type.
host: The whole node is under
blackout. If there is host blackout, then every
blackout element that's shown true in the
Telemetry JSON will have the reason for the
blackout.
crs: Blackout the availability
of the Oracle Clusterware resource or events in
the Oracle Clusterware logs.
asm: Blackout the availability
of Oracle Automatic Storage Management (Oracle
ASM) on this machine or events in the Oracle ASM
alert logs.
asmdg: Blackout an Oracle ASM
disk group.
database: Blackout the
availability of an Oracle Database, Oracle
Database backup, tablespace, and so on, or events
in the Oracle Database alert logs.
dbbackup: Blackout Oracle
Database backup events (such as CDB or archive
backups).
listener: Blackout the
availability of a listener.
service: Blackout the
availability of a service.
os: Blackout one or more
operating system records.
target all|name
Specify the target for blackout. You can specify a comma-delimited list of targets.
By default, the target is set to all.
container name
Specify the database container
name (db_unique_name) where the
blackout will take effect (for PDB,
DB_TABLESPACE, and
PDB_TABLESPACE).
pdb pdb_name
Specify the PDB where the blackout
will take effect (for
PDB_TABLESPACE only).
events all|"str1,str2"
Limits blackout only to the availability events, or event strings, which should not trigger auto collections, or be marked as blacked out in telemetry JSON.
all: Blackout everything for the target specified.
string: Blackout for incidents where any part of the line contains the strings specified.
Specify a comma-delimited list of strings.
timeout nh|nd|none
Specify the duration for blackout in number of hours or days before timing out. By default, the timeout is set to 24 hours (24h).
c|local
Specify if blackout should be set to cluster-wide or local.
By default, blackout is set to local.
reason comment
Specify a descriptive reason for the blackout.
docollection
Use this option to do an automatic diagnostic collection even if a blackout is set for this target.
Example 6-62 tfactl blackout
To blackout event: ORA-00600 on target
type: database, target:
mydb
To blackout ALL events on target
type: host, target:
all
tfactl blackout add -targettype host -event all -target all -timeout 1h -reason "Disabling all events during patching"
To print blackout
details
tfactl blackout print
.-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------.
| myhostname |
+---------------+---------------------+-----------+------------------------------+------------------------------+--------+---------------+--------------------------------------+
| Target Type | Target | Events | Start Time | End Time | Status | Do Collection | Reason |
+---------------+---------------------+-----------+------------------------------+------------------------------+--------+---------------+--------------------------------------+
| HOST | ALL | ALL | Thu Mar 24 16:48:39 UTC 2022 | Thu Mar 24 17:48:39 UTC 2022 | ACTIVE | false | Disabling all events during patching |
| DATABASE | MYDB | ORA-00600 | Thu Mar 24 16:39:03 UTC 2022 | Fri Mar 25 16:39:03 UTC 2022 | ACTIVE | false | NA |
| DATABASE | ALL | ORA-04031 | Thu Mar 24 16:39:54 UTC 2022 | Thu Mar 24 17:39:54 UTC 2022 | ACTIVE | false | NA |
| DB_DATAGUARD | MYDB | ALL | Thu Mar 24 16:41:38 UTC 2022 | Thu Mar 24 17:11:38 UTC 2022 | ACTIVE | false | NA |
| DBBACKUP | MYDB | ALL | Thu Mar 24 16:40:47 UTC 2022 | Fri Mar 25 16:40:47 UTC 2022 | ACTIVE | false | NA |
| DB_TABLESPACE | SYSTEM_CDBNAME_MYDB | ALL | Thu Mar 24 16:45:56 UTC 2022 | Thu Mar 24 17:15:56 UTC 2022 | ACTIVE | false | NA |
'---------------+---------------------+-----------+------------------------------+------------------------------+--------+---------------+--------------------------------------'
To remove blackout for event: ORA-00600 on
target type: database, target:
mydb
Problem Statement: One or more VM guest file systems has free
space below 10% free.
Risk: Insufficient VM guest file system free space can cause disk
space allocation failure, which can result in wide-ranging errors and failures in Oracle
software (Database, Clusterware, Cloud, Exadata).
Action:
Oracle Cloud and Exadata utilities run automatically to purge old log files and trace
files created by Oracle software to reclaim file system space.
If the automatic file system space reclamation utilities cannot sufficiently purge old
files to clear this event, then perform the following actions:
Remove unneeded files and/or directories created manually or by customer-installed
applications or utilities. Files created by customer-installed software are outside
the scope of Oracle's automatic file system space reclamation utilities. The
following operating system command, run as the opc user, is useful
for identifying directories consuming excessive disk
space:
$ sudo du -hx file-system-mount-point | sort -hr
Only
remove files or directories you are certain can be safely removed.
Reclaim /u02 file system disk space by removing
Database Homes that have no databases. For more information about managing Database
Homes, see Manage Oracle Database Homes on Exadata Database
Service on Exadata Cloud Infrastructure Instance.
Open service request to receive additional guidance about reducing file system space
use.
Problem Statement: The Cluster Ready Stack is in an offline
state or has failed.
Risk: If the Cluster Ready Service is offline on a node, then the
node cannot provide database services for the application.
Action:
Check if CRS was stopped by your administrator, as part of a planned maintenance
event, or a scale up or down of local storage.
The following patching events will stop CRS:
GRID Patching
Exadata VM patching of Guest
Exadata VM Patching of Host
If CRS has stopped unexpectedly, then the current status can be checked by issuing
the crsctl check crs command.
If the node is not responding, then the VM node may be rebooting. Wait for
the node reboot to finish, CRS will normally be started through the
init process.
If CRS is still down, then investigate the cause of the failure by referring to the
alert.log found in
/u01/app/grid/diag/crs/<node_name>/crs/trace.
Review
the log entries corresponding to the date/time of the down event. Act on any
potential remediation.
Restart the CRS, by issuing the crsctl start crs command.
A successful restart of CRS will generate the clearing event:
AVAILABILITY.DB_GUEST.CRS_INSTANCE.DOWN_CLEARED.
Problem Statement: A SCAN listener is down and unable to
accept application connections.
Risk: If all SCAN listeners are down, then application connections to
the database through the SCAN listener will fail.
Action:
Start the SCAN listener to receive the DOWN_CLEARED event.
DOWN event of type INFORMATION
If the event was caused by an Oracle Cloud maintenance action, such as
performing a Grid Infrastructure software update, then no action is required. The
affected SCAN listener will automatically failover to an available instance.
If the event was caused by user action, then start the SCAN listener at the next
opportunity.
DOWN event of type CRITICAL
Check SCAN status and restart the SCAN listener.
Login to the VM as opc user and sudo
to the grid
user:
sudo su - grid
Check the SCAN listener status on any
node:
srvctl status scan_listener
Start the SCAN
listener:
srvctl start scan_listener
Recheck the SCAN listeners status on any node:
If the
scan_listener is still down, then investigate the cause of
the scan listener failure:
Collect both the CRS and operating system logs 30 minutes prior
and 10 minutes for the <hostName>indicated in the log. Note the time in
the event payload is always provided in UTC. For tfactl
collection, adjust the time to the timezone of the VM Cluster. As the grid
user:
tfactl diagcollect -crs -os -node <hostName> βfrom "<eventTime adjusted for local vm timezone> - 30 minute " -to "<eventTime adjusted for local vm timezone> + 10 minutes"
Review the SCAN listener log located under
/u01/app/grid/diag/tnslsnr/<hostName>/<listenerName>/trace
Problem Statement: A client listener is down and unable to
accept application connections.
Risk:
If the node's client listener is down, then the database instances on the node
cannot provide services for the application.
If the client listener is down on all nodes, then any application that connects
to any database using the SCAN or VIP will fail.
Action:
Start the client listener to receive the DOWN_CLEARED
event.
DOWN event of type INFORMATION
If the event was caused by an Oracle Cloud maintenance action, such as
performing a Grid Infrastructure software update, then no action is required. The
affected client listener will automatically restart when maintenance affecting the
grid instance is complete.
If the event was caused by user action, then start the client listener
at the next opportunity.
DOWN event of type CRITICAL
Check the client listener status and then restart the client listener.
Login to the VM as opc user and sudo to the
grid user:
[opc@vm ~] sudo su - grid
Check the client listener status on any
node:
[grid@vm ~] srvctl status listener
Start the client
listener:
[grid@vm ~] srvctl start listener
Recheck the client listener status on any node:
If the
client listener is still down, then investigate the cause of the client listener
failure:
Use tfactl to collect both the CRS and operating system logs 30
minutes prior and 10 minutes for the
<hostName> indicated in the log.
Note the time in the event payload is always provided in UTC. For tfactl
collection, adjust the time to the timezone of the VM
Cluster.
[grid@vm ~] tfactl diagcollect -crs -os -node <hostName> βfrom "<eventTime adjusted for local vm timezone> - 30 minute " -to "<eventTime adjusted for local vm timezone> + 10 minutes"
Review the listener log located under
/u01/app/grid/diag/tnslsnr/<hostName>/<listenerName>/trace
Problem Statement: A database instance has gone down.
Risk: A database instance has gone down, which may result in reduced
performance if database instances are available on other nodes in the cluster, or
complete downtime if database instances on all nodes are down.
Action:
Start the database instance to receive the DOWN_CLEARED
event.
DOWN event of type INFORMATION
If the event was caused by an Oracle Cloud maintenance action, such as
performing a Database Home software update, then no action is required. The
affected database instance will automatically restart when maintenance affecting
the instance is complete.
If the event was caused by user action, then start the affected database
instance at the next opportunity.
DOWN event of type CRITICAL
Check database status and restart the down database instance.
Problem Statement: The Oracle Clusterware is designed to perform a node
eviction by removing one or more nodes from the cluster if some critical problem is
detected. A critical problem could be a node not responding via a network heartbeat, a
node not responding via a disk heartbeat, a hung or severely degraded machine, or a hung
ocssd.bin process. The purpose of this node eviction is to maintain the overall health
of the cluster by removing impaired members.
Risks: During the time it takes to restart the evicted node, the node cannot
provide database services for the application.
Action: CRS node eviction could be caused by OCSSD (CSS daemon),
CSSDAGENT, or CSSDMONITOR processes. This requires determining which process was
responsible for the node eviction and reviewing the relevant log files. Common causes of
OCSSD eviction are network failures/latencies, IO issues with CSS voting disks, a member
kill escalation. CSSDAGENT or CSSDMONITOR evictions could be OS scheduler problem or a
hung thread within CSS daemon.
Log files to review include:
clusterware alert log
cssdagent log
cssdmonitor log
ocssd log
lastgasp log
/var/log/messages
CHM/OS Watcher data
opatch lsinventory detail
For more information on collecting together most of the files, see
Autonomous Health Framework (AHF) - Including TFA and ORAchk/EXAchk (Doc ID
2550798.1).
For more information on troubleshooting CRS node eviction, see
Troubleshooting Clusterware Node Evictions (Reboots) (Doc ID
1050693.1).
Problem Statement: Corruptions can lead to application or
database errors and in worse case result in significant data loss if not addressed
promptly.
A corrupt block is a block that was changed so that it differs from what
Oracle Database expects to find. Block corruptions can be categorized as physical or
logical:
In a physical block corruption, which is also called a media
corruption, the database does not recognize the block at all; the checksum is
invalid or the block contains all zeros. An example of a more sophisticated
block corruption is when the block header and footer do not match.
In a logical block corruption, the contents of the block are
physically sound and pass the physical block checks; however, the block can be
logically inconsistent. Examples of logical block corruption include incorrect
block type, incorrect data or redo block sequence number, corruption of a row
piece or index entry, or data dictionary corruptions.
For more information, see Physical and Logical Block
Corruptions. All you wanted to know about it. (Doc ID 840978.1).
Block corruptions can also be divided into interblock corruption and
intrablock corruption:
In an intrablock corruption, the corruption occurs in the block
itself and can be either a physical or a logical block corruption.
In an interblock corruption, the corruption occurs between blocks
and can only be a logical block corruption.
Oracle checks for the following errors in the
alert.log:
ORA-01578
ORA-00752
ORA-00753
ORA-00600 [3020]
ORA-00600 [kdsgrp1]
ORA-00600 [kclchkblk_3]
ORA-00600 [13013]
ORA-00600 [5463]
Risk: A data corruption outage occurs when a hardware, software, or
network component causes corrupt data to be read or written. The service-level impact of
a data corruption outage may vary, from a small portion of the application or database
(down to a single database block) to a large portion of the application or database
(making it essentially unusable). If remediation action is not taken promptly, then
potential downtime and data loss can increase.
Action:
The current event notification currently triggers on physical block corruptions
(ORA-01578), lost writes (ORA-00752, ORA-00753 and ORA-00600 with first argument 3020),
and logical corruptions (typical detected from ORA-00600 with first argument of kdsgrp1,
kdsgrp1, kclchkblk_3, 13013 OR 5463).
Oracle recommends the following steps:
Confirm that these corruptions were reported in the alert.log trace file. Log a
Service Request (SR) with latest EXAchk report, excerpt of the alert.log and
trace file containing the corruption errors, any history of recent application,
database or software changes and any system, clusterware and database logs for
the same time period. For all these cases, a TFA collection should be available
and should be attached to the SR.
For repair recommendations, refer to Handling Oracle Database Corruption
Issues (Doc ID 1088018.1).
For physical corruptions or ORA-1578 errors, the following notes will be helpful:
Doc ID 1578.1 : OERR: ORA-1578 "ORACLE data block corrupted (file # %s, block #
%s)" Primary Note
Doc ID 472231.1 : How to identify all the Corrupted Objects in the Database
reported by RMAN
Doc ID 819533.1 : How to identify the corrupt Object reported by ORA-1578 / RMAN
/ DBVERIFY
Depending on the object that has the corruption, follow the guidance in Doc ID
1088018.1. Note RMAN can be used to recover one or many data block that are
physically corrupted. Also using Active Data Guard with real time apply, auto
block repair of physical data corruptions would have occurred automatically.
For logical corruptions caused by lost writes (ORA-00752, ORA-00753 and ORA-00600 with
first argument 3020) on the primary or standby databases, they will be detected on the
primary or with standby's redo apply process. The following notes will be helpful:
Follow the guidance, follow Doc ID 1088018.1.
If you have a standby and lost write corruption on the primary or standby, refer
to Resolving ORA-00752 or ORA-00600 [3020] During Standby Recovery (Doc ID
1265884.1)
For logical corruptions (typical detected from ORA-00600 with arguments of kdsgrp1,
kclchkblk_3, 13013 OR 5463):
Follow the guidance, follow Doc ID 1088018.1 for specific guidance on the error
that was detected.
If you have a standby and logical corruption on the primary, refer to Resolving
Logical Block Corruption Errors in a Physical Standby Database (Doc ID
2821699.1)
Problem Statement: CDB RAC Instance may temporarily or
permanently stall due to the log writer's (LGWR) inability to write the log buffers to
an online redo log. This occurs because all online logs need archiving. Once the
archiver (ARC) can archive at least one online redo log, LGWR will be able to resume
writing the log buffers to online redo logs and the application impact will be
alleviated.
Risk: If the archiver hang is temporary, then this can result in a
small application brown out or stall for application processes attempting to commit
their database changes. If the archiver is not unblocked, applications can experience
extended delay in processing.
Action:
See, Script To Find Redo log Switch History And Find Archivelog Size For
Each instance In RAC (Doc ID 2373477.1) to determine the hourly
frequency for each thread/instance.
If any hourly bucket is greater than 12, then consider resizing the online redo
logs. See item 2 below for resizing steps.
If the database hangs are temporary, then the archiver may be unable to keep up
with the redo log generated. Check the alert.log,
$ORACLE_BASE/diag/rdbms/<dbName>/<instanceName>/trace/alert_<instanceName>.log,
for "All online logs need archiving", multiple events in a short period can
indicate 2 possible solutions.
If the number of redo logs groups per thread is less than 4,
then consider adding additional logs groups to reach 4, see item 1 below
for add redo log steps.
The other possible solution is to resize the redo logs, see item 2 below
for resizing steps.
For Data Guard and Non Data Guard review the Configure Online Redo Logs
Appropriately of section Oracle Database High Availability Overview
and Best Practices for sizing guidelines.
Add a redo log group for each thread. The additional redo log should equal the
current log size.
Use the following
query:
select max(group#) Ending_group_number, thread#, count(*) number_of_groups_per_thread, bytes redo_size_in_bytes from v$log group by thread#,bytes
Add one new group per thread using the same size as the current redo
logs.
alter database add logfile thread <thread_number> Group <max group + 1> ('<DATA_DISKGROUP>') size <redo_size_in_bytes>
Resize the online redo logs by adding larger redo logs and dropping the
current smaller redo logs.
Use the following
query:
select max(group#) Ending_group_number, thread#, count(*) number_of_groups_per_thread, bytes redo_size_in_bytes from v$log group by thread#,bytes
Add the same number of redo logs for each thread
<number_of_groups_per_thread> that currently
exist. The <new_redo_size_in_bytes> should be based
on Configure Online Redo Logs Appropriately of section Oracle
Database High Availability Overview and Best Practices.
alter database add logfile thread <thread_number> Group <max group + 1> ('<DATA_DISKGROUP>') size <new_redo_size_in_bytes>
The original smaller redo logs should be deleted. A
redo log can only be deleted if its status is inactive.
To determine the status of a redo logs
issue:
select group#, thread#, status, bytes from v$log order by bytes, group#, thread#;
To delete the original smaller redo
logs:
alter database drop logfile <group#>
If the database is hung, the primary log archive destination and alternate may be
full. Review the HEALTH.DB_CLUSTER.DISK_GROUP.FREE_SPACE for details on
freeing space in RECO and DATA disk groups.
Problem Statement: Hang management detected a process hang and
generated a ORA-32701 error message. Additionally, this event may be raised if
Diagnostic Process (DIA0) process detects a hang in a critical database process.
Risk: A hang can indicate resource, operating system, or application
coding related issues.
Action:
Investigate the cause of the session hang.
Review TFA events for the database for the following message patterns
corresponding to the date/time of the event: ORA-32701, "DIA0 Critical Database
Process Blocked" or "DIA0 Critical Database Process As
Root".
Problem Statement: A daily incremental BACKUP of the CDB
failed.
Risk: A failure of the backup can compromise the ability to use the
backups for restore/recoverability of the database. Recoverability Point Object (RPO)
and the Recoverability Time Object (RTO) can be impacted.
Action:
Review the RMAN logs corresponding to the date/time of the event. Note the event time
stamp <eventTime> is in UTC, adjust as necessary for the VM's
timezone.
For Exadata Cloud Infrastructure Oracle
Managed Backups or User Configured Backups under dbaascli:
RMAN output can be found at
/var/opt/oracle/log/<DB_NAME>/obkup.
Daily
incremental logs have the following format
obkup_yyyy-mm-dd_24hh:mm:ss.zzzzzzzzzzzz.log within
the obkup directory. The logs are located on the
lowest active node/instance of the database when the backup was
initiated.
Review the log for any failures:
If the failure is due to an external event outside of
RMAN, for example the backup location was full or a networking
issue, resolve the external issue.
For other RMAN script errors, collect the diagnostic
logs and open a Service Request. See DBAAS
Tooling: Using dbaascli to Collect Cloud Tooling Logs and
Perform a Cloud Tooling Health Check.
If the issue is transient or is resolved, take a new
incremental backup: See dbaascli database
backup.
For Customer owned and managed backup taken through RMAN:
Problem Statement: ASM disk group space usage is at or exceeds
90%.
Risk: Insufficient ASM disk group space can cause database creation
failure, tablespace and data file creation failure, automatic data file extension
failure, or ASM rebalance failure.
Action:
ASM disk group used space is determined by the running the following query while
connected to the ASM
instance.
[opc@node ~] sudo su - grid
[grid@node ~] sqlplus / as sysasm
SQL> select 'ora.'||name||'.dg', total_mb, free_mb, round ((1-(free_mb/total_mb))*100,2) pct_used from v$asm_diskgroup;
NAME TOTAL_MB FREE_MB PCT_USED
------------------------------ ---------- ---------- ----------
ora.DATAC1.dg 75497472 7408292 90.19
ora.RECOC1.dg 18874368 17720208 6.11
ASM disk group capacity can be increased in the following ways:
Scale Exadata VM Cluster storage to add more ASM disk group capacity.
See Scaling an Exadata Cloud Infrastructure Instance.
Scale Exadata Infrastructure storage to add more ASM disk group
capacity. See Scaling Exadata X8M and X9M Compute and
Storage.
DATA disk group space use can be reduced in the following ways:
Drop unused data files and temp files from databases. See
Dropping Data Files.
Terminate unused databases (e.g. test databases). See Using the
Console to Terminate a Database.
RECO disk group space use can be reduced in the following ways:
Drop unnecessary Guaranteed Restore Points. See Using Normal and
Guaranteed Restore Points.
Delete archived redo logs or database backups already backed up outside
the Flash Recovery Area (FRA). See Maintaining the Fast Recovery
Area.
SPARSE disk group space use can be reduced in the following ways:
Move full copy test master databases to another disk group (e.g. DATA).
Drop unused snapshot databases or test master databases. See
Managing Exadata Snapshots.
For more information about managing the log and diagnostic files, see Managing the Log and Diagnostic Files on Oracle Exadata Database Service
on Dedicated Infrastructure.
Managing the Log and Diagnostic Files on
Oracle Exadata Database Service on Dedicated
Infrastructure π
The software components in Oracle Exadata Database Service on Dedicated
Infrastructure generate a variety of log
and diagnostic files, and not all these files are automatically archived and purged.
Thus, managing the identification and removal of these files to avoid running out of
file storage space is an important administrative task.
Database deployments on ExaDB-D include
the cleandblogs script to simplify this administrative task. The script
runs daily as a cron job on each compute node to archive key files and
remove old log and diagnostic files.
The cleandblogs script operates by using the adrci
(Automatic Diagnostic Repository [ADR] Command Interpreter) tool to identify and purge
target diagnostic folders and files for each Oracle Home listed in
/etc/oratab. It also targets Oracle Net Listener logs, audit files,
and core dumps.
On ExaDB-D, the script is run separately
as the oracle user to clean log and diagnostic files that are
associated with Oracle Database, and as the grid user to clean log and
diagnostic files that are associated with Oracle Grid Infrastructure.
The cleandblogs script uses a configuration file to determine how long
to retain each type of log or diagnostic file. You can edit the file to change the
default retention periods. The file is located at
/var/opt/oracle/cleandb/cleandblogs.cfg on each compute node.
Note
Configure an optimal retention period for each type of log or diagnostic file. An
insufficient retention period will hinder root cause analysis and problem
investigation.
Parameter
Description and Default Value
AlertRetention
Alert log (alert_instance.log) retention value
in days.
Default value: 14
ListenerRetention
Listener log (listener.log) retention value in
days.
Default value: 14
AuditRetentionDB
Database audit (*.aud) retention value in
days.
Default value: 1
CoreRetention
Core dump/files (*.cmdp*) retention value in
days.
Default value: 7
TraceRetention
Trace file (*.tr* and
*.prf) retention value in days.
Default value: 7
longpRetention
Data designated in the Automatic Diagnostic Repository (ADR) as
having a long life (the LONGP_POLICY attribute).
For information about ADR, see Automatic Diagnostic Repository
(ADR) in the Oracle Database Administrator's
Guide.
Default value: 14
shortpRetention
Data designated in the Automatic Diagnostic Repository (ADR) as
having a short life (the SHORTP_POLICY attribute).
For information about ADR, see Automatic Diagnostic Repository
(ADR) in the Oracle Database Administrator's
Guide.
Default value: 7
LogRetention
Log file retention in days for files under
/var/opt/oracle/log and log files in ACFS
under /var/opt/oracle/dbaas_acfs/log.
Default value: 14
LogDirRetention
cleandblogs logfile retention in days.
Default value: 14
ScratchRetention
Temporary file retention in days for files under
/scratch.
Default value: 7
Archiving Alert Logs and Listener Logs
When cleaning up alert and listener logs, cleandblogs first archives
and compresses the logs, operating as follows:
The current log file is copied to an archive file that ends with a date
stamp.
The current log file is emptied.
The archive file is compressed using gzip.
Any existing compressed archive files older than the retention period are
deleted.
Running the cleandblogs Script Manually
The cleandblogs script automatically runs daily on each compute
node, but you can also run the script manually if the need arises.
Connect to the compute node as the oracle user to clean log and
diagnostic files that are associated with Oracle Database, or connect as the
grid user to clean log and diagnostic files that are
associated with Oracle Grid Infrastructure.
For detailed instructions, see
Connecting to a Virtual Machine with SSH.
Change to
the directory containing the cleandblogs
script:
$ cd /var/opt/oracle/cleandb
Run the cleandblogs
script:
$ ./cleandblogs.pl
When running the script
manually, you can specify an alternate configuration file to use instead of
cleandblogs.cfg by using the --pfile
option:
Problem Statement: Too much VM memory is allocated for
HugePages use.
Risk: Excessive memory allocated to HugePages may result in poor
database performance, or the system running out of memory, experiencing excessive
swapping, or having crucial system services fail, causing system crash or node
eviction.
Action:
Reduce HugePages memory use. To determine the proper setting for operating system
parameter vm.nr_hugepages, see My Oracle Support document 361323.1.
Scale up VM memory. For more information about scaling VM memory, see
Introduction to Scale Up or Scale Down Operations.
Problem Statement: A file system that is expected to be
read-write can no longer be written to.
Risk: Oracle software (Linux, Database, Clusterware, Cloud, Exadata)
requires write access to file systems to operate correctly.
Action:
/u01 and /u02 file systems:
Stop running services, if any, that are using the file system, such as Oracle
Clusterware, Trace File Analyzer (TFA), and Enterprise Manager (EM) agent.
Unmount the file system.
Run file system check and repair.
ext4: Refer to Checking and Repairing a File
System.
xfs: Refer to Checking and Repairing an XFS File
System.
If the file system cannot be repaired then open a service request with
Oracle Support for assistance.
Mount the file system.
Start the services.
/ (root) file system:
Open a service request with Oracle Support for assistance.
If there is VM access, then collect full dmesg(1) command output
and provide it to Oracle Support.
Note that / (root) file system repair is possible only with console
access.
Problem Statement: A CRITICAL Exachk check failed and should
be reviewed and addressed as soon as possible.
Risk: A CRITICAL check is expected to impact a large number of
customers AND should be addressed immediately (for example, within 24 hours) AND meets
one or more of the following criteria:
On-disk corruption or data loss
Intermittent wrong results with Exadata feature usage (e.g. smart
scan)
System wide availability impact
Severe system wide performance impact seriously affecting application
service Service Level Agreements (SLAs)
Compromised redundancy and inability to restore redundancy
Inability to update software in a rolling manner
Configuration error that could lead to an unexpected or unknown
impact
Action:
Recommend that you bring up the EXAchk HTML report from the latest EXAchk zip file and
click "view" on each CRITICAL check and follow the recommendation guidance that
contains: Benefit/Impact, Risk, and Action/Repair guidance. Once the CRITICAL check is
addressed, the next EXAchk run will pass that check. For more information about Oracle
EXAchk, see Oracle Exadata Database Machine Exachk (Doc ID 1070954.1).
As the root user, you can re-run EXAchk command by
issuing:
If the check results are returning false data, then log a Service Request.
If there is a CRITICAL check that needs to be temporarily excluded, then follow the
"Skipping Specific Best Practice Checks in Exadata Cloud" section of
Oracle Exadata Database Machine Exachk (Doc ID 1070954.1).
Viewing Audit Log Events Oracle Cloud Infrastructure Audit service provides records of API operations performed against supported services as a list of log events.
Oracle Cloud Infrastructure Audit service provides records of API
operations performed against supported services as a list of log events.
An audit event is generated when you connect to the serial console using a Secure Shell
(SSH) connection. Navigate to Audit in the Console and search for
VmConsoleConnected. When you navigate to Audit in the Console, a list of
results is generated for the current compartment. Audit logs are organized by compartment,
so if you are looking for a particular event, you must know which compartment the event
occurred in. You can filter the list in the following ways:
Date and time
Request Action Types (operations)
Keywords
For more information, see Viewing Audit Log Events.
Example 6-64 Serial Console Connection Audit
Event Example
This is a reference event for Serial Console
Connection: