Troubleshooting Exadata Database Service on Cloud@Customer Systems

These topics cover some common issues you might run into and how to address them.

Patching Failures on Exadata Database Service on Cloud@Customer Systems

Patching operations can fail for various reasons. Typically, an operation fails because a database node is down, there is insufficient space on the file system, or the virtual machine cannot access the object store.

Determining the Problem

In the Console, you can identify a failed patching operation by viewing the patch history of an Exadata Database Service on Cloud@Customer system or an individual database.

A patch that was not successfully applied displays a status of Failed and includes a brief description of the error that caused the failure. If the error message does not contain enough information to point you to a solution, you can use the database CLI and log files to gather more data. Then, refer to the applicable section in this topic for a solution.

Troubleshooting and Diagnosis

Diagnose the most common issues that can occur during the patching process of any of the Exadata Database Service on Cloud@Customer components.

Database Server VM Issues

One or more of the following conditions on the database server VM can cause patching operations to fail.

Database Server VM Connectivity Problems

Cloud tooling relies on the proper networking and connectivity configuration between virtual machines of a given VM cluster. If the configuration is not set properly, this may incur in failures on all the operations that require cross-node processing. One example can be not being able to download the required files to apply a given patch.

Given the case, you can perform the following actions:

  • Verify that your DNS configuration is correct so that the relevant virtual machine addresses are resolvable within the VM cluster.
  • Refer to the relevant Cloud Tooling logs as instructed in the Obtaining Further Assistance section and contact Oracle Support for further assistance.
Oracle Grid Infrastructure Issues

One or more of the following conditions on Oracle Grid Infrastructure can cause patching operations to fail.

Oracle Grid Infrastructure is Down

Oracle Clusterware enables servers to communicate with each other so that they can function as a collective unit. The cluster software program must be up and running on the VM Cluster for patching operations to complete. Occasionally you might need to restart the Oracle Clusterware to resolve a patching failure.

In such cases, verify the status of the Oracle Grid Infrastructure as follows:
./crsctl check cluster
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
If Oracle Grid Infrastructure is down, then restart by running the following commands:
crsctl start cluster -all
crsctl check cluster
Oracle Databases Issues

An improper database state can lead to patching failures.

Oracle Database is Down

The database must be active and running on all the active nodes so the patching operations can be completed successfully across the cluster.

Use the following command to check the state of your database, and ensure that any problems that might have put the database in an improper state are resolved:
srvctl status database -d db_unique_name -verbose

The system returns a message including the database instance status. The instance status must be Open for the patching operation to succeed.

If the database is not running, use the following command to start it:
srvctl start database -d db_unique_name -o open

Obtaining Further Assistance

If you were unable to resolve the problem using the information in this topic, follow the procedures below to collect relevant database and diagnostic information. After you have collected this information, contact Oracle Support.

Related Topics

Collecting Cloud Tooling Logs

Use the relevant log files that could assist Oracle Support for further investigation and resolution of a given issue.

DBAASCLI Logs

/var/opt/oracle/log/dbaascli
  • dbaascli.log

Collecting Oracle Diagnostics

To collect the relevant Oracle diagnostic information and logs, run the dbaascli diag collect command.

For more information about the usage of this utility, see DBAAS Tooling: Using dbaascli to Collect Cloud Tooling Logs and Perform a Cloud Tooling Health Check.

VM Operating System Update Hangs During Database Connection Drain

Description: This is an intermittent issue. During virtual machine operating system update with 19c Grid Infrastructure and running databases, dbnodeupdate.sh waits for RHPhelper to drain the connections, which will not progress because of a known bug "DBNODEUPDATE.SH HANGS IN RHPHELPER TO DRAIN SESSIONS AND SHUTDOWN INSTANCE".

Symptoms: There are two possible outcomes due to this bug:
  1. VM operating system update hangs in rhphelper
    • Hangs the automation
    • Some or none of the database connections will have drained, and some or all of the database instances will remain running.
  2. VM operating system update does not drain database connections because rhphelper crashed
    • Does not hang automation
    • Some or none of the database connection draining completes

/var/log/cellos/dbnodeupdate.trc will show this as the last line:

(ACTION:) Executing RHPhelper to drain sessions and shutdown instances. 
(trace:/u01/app/grid/crsdata/scaqak04dv0201/rhp//executeRHPDrain.150721125206.trc)
Action:
  1. Upgrade Grid Infrastructure version to 19.11 or above.

    (OR)

    Disable rhphelper before updating and enable it back after updating.

    To disable before updating is started:
    /u01/app/19.0.0.0/grid/srvm/admin/rhphelper /u01/app/19.0.0.0/grid 19.10.0.0.0 -setDrainAttributes ENABLE=false
    To enable after updating is completed:
    /u01/app/19.0.0.0/grid/srvm/admin/rhphelper /u01/app/19.0.0.0/grid oracle-home-current-version -setDrainAttributes ENABLE=true

    If you disable rhphelper, then there will be no database connection draining before database services and instances are shutdown on a node before the operating system is updated.

  2. If you missed disabling RHPhelper and upgrade is not progressing and hung, then it is observed that the draining of services is taking time:
    1. Inspect the /var/log/cellos/dbnodeupdate.trc trace file, which contains a paragraph similar to the following:
      (ACTION:) Executing RHPhelper to drain sessions and shutdown instances. 
      (trace: /u01/app/grid/crsdata/<nodename>/rhp//executeRHPDrain.150721125206.trc)
    2. Open the /var/log/cellos/dbnodeupdate.trc trace file.
      If rhphelper fails, then the trace file contains the message as follows:
      "Failed execution of RHPhelper"
      If rhphelper hangs, then the trace file contains the message as follows:
      (ACTION:) Executing RHPhelper to drain sessions and shutdown instances.
    3. Identify the rhphelper processes running at the operating system level and kill them.

      There are two commands that will have the string “rhphelper” in the name – a Bash shell, and the underlying Java program, which is really rhphelper executing.

      rhphelper runs as root, so must be killed as root (sudo from opc).

      For example:
      [opc@<HOST> ~] pgrep –lf rhphelper
      191032 rhphelper
      191038 java
      
      [opc@<HOST> ~] sudo kill –KILL 191032 191038
    4. Verify that the dbnodeupdate.trc file moves and the Grid Infrastructure stack on the node is shutdown.

    For more information about RHPhelper, see Using RHPhelper to Minimize Downtime During Planned Maintenance on Exadata (Doc ID 2385790.1).

Adding a VM to a VM Cluster Fails

Description: When adding a VM to a VM cluster, you might encounter the following issue:
[FATAL] [INS-32156] Installer has detected that there are non-readable files in oracle home.
CAUSE: Following files are non-readable, due to insufficient permission oracle.ahf/data/scaqak03dv0104/diag/tfa/tfactl/user_root/tfa_client.trc
ACTION: Ensure the above files are readable by grid.

Cause: Installer has detected a non-readable trace file, oracle.ahf/data/scaqak03dv0104/diag/tfa/tfactl/user_root/tfa_client.trc created by Autonomous Health Framework (AHF) in Oracle home that causes adding a cluster VM to fail.

AHF ran as root created a trc file with root ownership, which the grid user is not able to read.

Action: Ensure that the AHF trace files are readable by the grid user before you add VMs to a VM cluster. To fix the permission issue, run the following commands as root on all the existing VM cluster VMs:
chown grid:oinstall /u01/app/19.0.0.0/grid/srvm/admin/logging.properties
chown -R grid:oinstall /u01/app/19.0.0.0/grid/oracle.ahf*
chown -R grid:oinstall /u01/app/grid/oracle.ahf*

Nodelist is not Updated for Data Guard-Enabled Databases

Description: Adding a VM to a VM cluster completes successfully, however, for Data Guard-enabled databases, the new VM is not added to the nodelist in the /var/opt/oracle/creg/<db>.ini file.

Cause: Data Guard-enabled databases will not be extended to the newly added VM. And therefore, the <db>.ini file will also not be updated because the database instance is not configured in the new VM.

Action: To add an instance to primary and standby databases and to the new VMs (Non-Data Guard), and to remove an instance from a Data Guard environment, see My Oracle Support note 2811352.1.

CPU Offline Scaling Fails

Description: CPU offline scaling fails with the following error:
** CPU Scale Update **An error occurred during module execution. Please refer to the log file for more information

Cause: After provisioning a VM cluster, the /var/opt/oracle/cprops/cprops.ini file, which is automatically generated by the database as a service (DBaaS) is not updated with the common_dcs_agent_bindHost and common_dcs_agent_port parameters and this causes CPU offline scaling to fail.

Action: As the root user, manually add the following entries in the /var/opt/oracle/cprops/cprops.ini file.
common_dcs_agent_bindHost=<IP_Address>
common_dcs_agent_port=7070
Note

The common_dcs_agent_port value is 7070 always.
Run the following command to get the IP address:
netstat -tunlp | grep 7070
For example:
netstat -tunlp | grep 7070
tcp 0 0 <IP address 1>:7070 0.0.0.0:* LISTEN 42092/java
tcp 0 0 <IP address 2>:7070 0.0.0.0:* LISTEN 42092/java

You can specify either of the two IP addresses, <IP address 1> or <IP address 2> for the common_dcs_agent_bindHost parameter.