Roving Edge Infrastructure Troubleshooting

Use troubleshooting information to identify and address common issues that can occur while working with Roving Edge Infrastructure.

General

Getting Oracle Support

If after reviewing and using these troubleshooting tips you still need help, open a service request for your issue. See Open a Support Ticket for more information.

Device is locked again

Roving Edge Infrastructure devices require that you unlock them after every reboot and power cycle. If the RED is unexpectedly locked, verify that the power connection is steady, and check if it was recently restarted. Check that the power connection is steady and that the Roving Edge Infrastructure device did not restart.

Exiting the Serial Console

Disconnect the Serial Console by inputting the following in sequential order: Enter, ~ (tilde), . (period).

Device Console URL gives "unavailable" or "not trusted" message

The Device Console communicates with TLS/HTTPS on port 8015 of each Roving Edge Infrastructure device. When your browser displays a security warning indicating the URL is unavailable or is not a trusted URL, ensure that the TLS certificate is installed and trusted on their machine.

If the Device Console's TLS certificate is not installed and trusted on your host computer, add the TLS certificate from the Device Console using the browser to your host computer's keychain/certificate collection and mark it as trusted. In browsers such as Chrome, Edge, and Firefox, the TLS certificate resides in the browser window to the left of the URL. Consult your particular browser's documentation for more information on how to download the certificate.

An "unavailable" or "not trusted" message might also occur if the system is partially down. Examples include when rebooting for a system upgrade or starting for the first time after a power outage. To help diagnose whether the issue is related to the TLS certificate or a system outage, check for a good or bad response to the https://<host>:12060/v1/tenants/orei endpoint in the operator's browser or with a tool like CURL. If accessing that endpoint results in a security warning, check that the Roving Edge Infrastructure device's TLS certificate is properly installed and trusted. If the endpoint times out or returns a non-200 response the system might be experiencing a partial outage.

Browser Security Warning When Accessing the Device Console

The Device Console communicates with TLS/HTTPS on port 8015 of a given device. When the Device Console browser displays a security warning, ensure that the TLS certificate is installed and trusted on your Roving Edge Infrastructure device. If the Device Console's TLS certificate is not installed and trusted on the host computer, add the TLS certificate from the Device Console in the browser to the host computer's keychain/certificate collection. Then mark it as trusted. In browsers such as Chrome, Edge, and Firefox, the TLS certificate resides in the browser window to the left of the URL. Consult your browser documentation for more information on how to download the certificate.

"Service unknown" when creating policies for "service rover"

If you get the error "Service unknown" when creating policies for "service rover," you might need to create a child tenancy in Oracle Cloud Infrastructure. See Creating a New Child Tenancy in the Oracle Cloud Infrastructure documentation for more information on this feature.

System Upgrade

System Upgrade loading icon keeps spinning

The System Upgrade tool persists in its loading state until a timeout occurs, after which it indicates that the system upgrade status cannot be determined. This timeout occurs most often when the REDs are disconnected from the internet. The System Upgrade requires a connection to OCI to determine whether an upgrade for the RED is available.

If your device is disconnected from the internet, you can update your device using the disconnected upgrade process. See Updating the Roving Edge Infrastructure Device Software while Disconnected for more information.

System Upgrade bundle download process fails

Check your internet connection and press Download Upgrade to attempt the download. If the download is not successful after multiple attempts, reach out to Oracle support for help.

Networking

IP address range for the public IP pool configuration does not get submitted

After typing an IP range and pressing Enter, press Enter again on the blank input line to submit. If more IP ranges are required, press Enter after each range to open another line of input. Submit a blank input line as the last entry to submit everything. To cancel and go back, press Ctrl+C.

Cannot access public service endpoints (169.254.169.254 at ports 8015, 18336, and so forth)

Ensure that the firewall on the instance does not block the 196.254.0.0/16 address range. It is common for an OCI-exported image to block link-local address range by default. If so, remove the rule that is blocking any connections to 196.254.0.0/16 from the firewall settings. Consult your operating system documentation regarding firewall configuration procedure.

Storage

Lack of available storage space causes block volume operations to fail

Lack of available storage space might cause block storage operations to fail. Free up space by deleting resources that are no longer needed, such as Object Storage objects, boot and block volumes, and VMs. Regularly check your REDs' available storage to ensure you are not at risk of running out. See Roving Edge Infrastructure Device Monitoring for more information.

Low object storage available capacity triggers warnings and read-only

When the system reaches 80% capacity used, it triggers a Warning status in the Monitoring page. When the system reaches 95% capacity used, it enters read-only mode, and the Monitoring page shows Object Storage status as Degraded or Warning.

Oracle recommends avoiding running intensive writing operations when the system is functioning at 80% capacity used. If you are at or close to 80%, transfer data to the OCI cloud until the system is well below 80% capacity.

If the system exceeds 95% capacity used threshold, it enters read-only mode, and core functionality (including Compute and Object Storage) is limited. All Compute operations, such as custom VMs, boot volumes, and block volumes, and all Object Storage operations are suspended. The suspension of the system prevents you from writing to a storage device when durability and redundancy cannot be guaranteed.

If no available storage space remains on the device, you can free more space by deleting resources that are no longer needed, such as objects in Object Storage, boot and block volumes, and VMs. If delete requests fail because no storage space remains and the system is in read-only mode, you can activate Safe Mode through the Serial Console. Safe Mode permits you to make the necessary deletions.

Avoiding oversubscribing storage problems

Follow best practice recommendations on how to set up or plan Compute, Block Storage, and Object Storage resource consumption to avoid oversubscription problems. Block Storage and Compute do not reserve storage space for volumes in advance. Instead, storage space is consumed when data is written to the volume. For example, if a 100 GB block volume is created, it does not mean that 100 GB is reserved from the total available storage space for this volume. The storage space remains available to all services and can be exhausted before the 100 GB volume is filled with data.

Also, Compute and Block Storage do not validate the specified size of a created volume against the available storage space. This lack of validation can lead to oversubscription when the total size of created volumes exceeds the storage space available on the device. Do not rely on block volume sizes to calculate storage space utilization. Instead, follow the information about storage space usage displayed in the Device Console's Monitoring page.

Monitoring page shows Object Storage status as "Degraded" or "Warning"

If the storage function within a RED malfunctions or has physical problems, the Monitoring page on the Device Console might show the Warning or Degraded status periodically for the Object Storage service. If this situation occurs, the RED attempts to rebalance its storage and recover declared redundancy level. Eventually it shows a healthy state if RED has available space and is able to recover enough redundant copies on remaining the RED devices being used for storage.

Image import from Object Storage to Compute is taking a long time

If an image does not appear in the Custom Images list, the import has failed. If the import fails, check the device nodes Details page:

  1. Open the navigation menu and select Node Management > Nodes. The Nodes page appears, displaying the service and feature status of all your Roving Edge Infrastructure devices in tabular format.

  2. Select the node whose status you want to monitor and view its Details page.

  3. Click the Storage tab and review what percentage of the device of storage has been used.

If the Object Storage service is not healthy, the Monitoring page displays Degraded or Warning as the status. If Object Storage is healthy, check the Monitoring page to ensure that enough available space exists. If insufficient space is available, remove any images, objects, VMs, and other items to make room for the wanted image.

Objects with certain version IDs can cause problems

Running a CLI command where the object's version ID starts with a dash ("-") and contains the characters h or i causes the CLI to enter interactive mode. For example:
oci os object get ... --version-id '-WhjCQ.-IYgDLuoZ9gbxpn.8Q.q-iZt' ...

If this occurs, you can use one of the following workarounds:

  • Include the equal sign ("=") between the --version-id parameter and its value. Do not include any spaces before of after the =. For example:

    oci os object get ... --version-id="-WhjCQ.-IYgDLuoZ9gbxpn.8Q.q-iZt" ...

    Only use double-quotes around the value.

  • Include the --from-json parameter in the command and specify the input in a JSON format. See Advanced JSON Options for more information.

Compute/Instances

Instance creation attempt results in "Out of Capacity" message

Instance capacity is limited by the number of available cores and available memory. Terminate some of the existing instances that are not in use and try again. Stopped instances count toward the resources used.

Image Import Failure

Large images take a while to import, much longer if other disk-heavy applications or operations are on going. If an import is taking too long and you want to end it, select Terminate from the Import menu. An image import will automatically time out and cancel after four hours.

VM launches into Running state, but upon connection looping on some boot messages

Roving Edge Infrastructure only supports .oci and .qcow2 images, with UEFI booting. To check for image-related problems, open the Device Console and go to the Details page of the compute VM instance. Check whether the image format is .oci, .qcow2, or another type. Images exported from OCI cloud are usually .oci type. Confirm the image and boot type with the provider of the image.

On a Linux machine, use the qemu-img utility to see image info using the following command:

qemu-img info image_file

Cannot Access External Resource from VM Instance

  1. If the domain name is referencing an external resource, ensure that external DNS resolvers are added to the list of nameservers inside the instance. Consult your operating system documentation regarding DNS configuration procedure.

    For example, on some Linux-based systems, nameserver IPs needs to be added to /etc/resolv.conf file.

  2. Ensure that the RED external connectivity settings are correct. See Setting Up Devices
  3. Ensure that the instance firewall settings do not block outgoing connections. Consult your operating system documentation regarding firewall configuration procedures.

Cannot Connect to VM Instance Using SSH

  1. Ensure that the VM instance is running. Open the Device Console and check the Details page of the compute VM instance page to ensure that the instance state is RUNNING. If the instance is not running, enter Start to launch the instance. Wait for state to change to RUNNING.

  2. Ensure that the VM instance has public IP address assigned. Open the Device Console and go to the Details page of the compute VM instance. Click the instance name and verify that the instance has public IP assigned by reviewing Public IP Address value under Instance Access section.

    If the instance does not have a public IP assigned, add one using the following steps:

    1. Open the Details page of the compute VM instance.

    2. Click Attached VNICs under Resources to display the list of attached VNICs.

    3. Select the primary VNIC.

    4. The Details page for the primary VNIC appears.

    5. Click Edit.

      Alternately, select the Actions menu (Actions Menu) for the VNIC you want to edit and click Edit.

      The Edit VNIC dialog box appears.

    6. Select the Ephemeral Public IP option.

    7. Click Update.

    8. If public IP assignment fails, open the Serial Console and select Network Configuration to ensure that the RED's public IP address pool is set up and has IPs available.

  3. Ensure that RED external connectivity settings are correct. Open the Serial Console and select Configuring Devices. Ensure the RED's IP address, network prefix length, and gateway IP address are set up correctly.

  4. Ensure that the VM instance is reachable through ICMP requests. Run following command:

    ping 100.100.1.10

    where 100.100.1.10 is target instance's public IP address. If the command is successful, the problem might be with instance configuration (SSH service, firewall rules). Consult your operating system documentation regarding SSH and firewall setup for more information.

  5. Ensure that the VM instance has started correctly. If running the ping 100.100.1.10 command is not successful, check the instance console history to look for a successful start sequence. See Console History Capture for Roving Edge Infrastructure.

  6. Reboot the node using the device's power button or through the Serial Console.

Cannot Access a Port on a VM Instance From the External Machine

  1. Ensure that RED external connectivity settings are correct. See Setting Up Devices.

  2. Ensure that instance firewall settings do not block incoming connections. Consult your operating system documentation regarding firewall configuration procedure.

  3. Ensure that the public IP address is accessing the VM instance, not the private IP address or fully qualified domain name (FQDN). Instance private IP address is visible only inside VCN subnet. Instance FQDN is visible only when the default VCN internal DNS service is being used (169.254.169.254), which is not accessible outside of the VCN network.

Cannot access VM instance from another instance

  1. Ensure that the target instance is running. Open the Device Console and check the Details page of the compute VM instance to ensure that the target instance state is Running.

  2. Ensure that the request-sending instance has its network configuration, such as IP address, network mask, and gateway, set up correctly. Follow the subnet settings guidelines when performing the configuration. Consult your operating system documentation regarding network configuration for more information.

    On Linux-based systems, verify the setup using the following command:

    ip addr show ip route show default
  3. Ensure that the target instance firewall settings do not block incoming connections. Consult your operating system documentation regarding firewall configuration procedure for more information.

  4. Ensure that the request-sending instance firewall settings do not block outgoing connections. Consult your operating system documentation regarding firewall configuration procedure.

  5. If ICMP is not blocked on the target instance, ensure the ping command is successful. Run the following command from the request-sending instance shell:

    ping 10.0.0.2

    where 10.0.0.2 is the target instance's private IP.

  6. If the ping command result is No route to host, ensure that the default route is set to subnet gateway. Consult your operating system documentation regarding default route settings. For example, for Linux-based operating systems, the command might be:

    ip route show default

    with the expected output:

    default via 10.0.0.1 dev eth0

    where 10.0.0.1 is 10.0.0.0/24 subnet's gateway IP address (VCN subnet gateway always uses the first address in the subnet range).

Cannot access another VM instance by fully qualified domain name

Ensure that the target instance is running. Open the Device Console and check the Details page of the compute VM instance to ensure that the target instance state is Running. If the target instance is Stopped, restart it. Confirm that the request-sending instance has 169.254.169.254 set as a nameserver. Consult your operating system documentation regarding DNS configuration procedure for any questions.

VM launches, but there's no public IP address to connect to using SSH

When creating a VM instance, select the Assign a public IP address option. Ensure that the public IP pool specified during device setup (using the Serial Console) has enough addresses for the number of VMs (including ones in Stopped state). If not enough addresses exist, terminate some VMs to free up addresses, or create more public IPs using the Serial Console.

VM creation goes right to Terminating state

This is likely because of one of the following:

  • Lack of public IPs: Lack of IPs can occur because of the public IP pool not being set up in the Serial Console, or is out of IPs for some other undetermined reason. Check that the RED's public IP pool range has been set (if creating a VM with the default option of public IP):

    1. Open the Serial Console.

    2. Select Configure Networking (option 3).

    3. Select Display Public IP Pool Status (option 4).

    If the public IP pool has not been set, go back and select Public IP Pool Range for Compute Instances. Follow the displayed instructions to input public IP ranges. The Serial Console includes a usage guide for more information.

  • Full ceph object/block storage: The inability to allocate space for the VM's boot volume can cause the VM to enter the Termination state. Ensure that the object/block storage is not full by checking the top of the Monitoring page in the RED console.

  • Full CPU usage: There exists a maximum of 32 OCPUs in total across VMs, including those OCPUs that are stopped. On the Device Console's Compute page, ensure that the total OCPU count of existing VMs is less than the maximum of 32. If all 32 OCPUs are being used, terminate some instances to free up resources.

  • Full GPU usage: There exists a maximum of one GPU-shape VM, including those GPUs that are stopped. A RED can only have a single GPU-shaped VM provisioned at a time. Attempts to create more GPU-shape instances terminate during provisioning. On the Device Console's Compute page, ensure that there are no instances with GPU shape in Running or Stopped state. If a GPU shaped instance exists, terminate it.

  • Invalid image: Roving Edge Infrastructure only supports .oci and .qcow2 image formats, with UEFI booting. On the Device Console's Compute page, open the Instances section and determine which VM is terminating. Click the terminating VM to open its Details page, where you can note the image name. The image name and extension indicates whether it is .oci or .qcow2 or another type. Images exported from OCI cloud are usually .oci type. Verify the image and boot type with the person who provided the image.

    On a Linux machine, use the qemu-img utility to see image info using the following command:

    qemu-img info image_file

Slow VM performance or slow terminal usage using SSH

Slow RED performance can result when other VMs are experiencing heavy usage, such as those running disk- or network-intensive applications. Resource-heavy device operations, such as importing large object storage contents or compute images, can also degrade performance. If you are working with an intensive application, use a VM shape with higher OCPU count, as they also come with more RAM. Stop or terminate the current VM, then create another VM using the same image, but with the bigger shape.

Your VM launches into Running state, but the SSH rejects your key, refuses connection, or times out.

If you launch a VM whose state listed as Running, but SSH rejects your key, refuses the connection, or times out, try the following:

  • Ensure that you are trying to connect to the VM's public IP address using SSH.

  • Ensure that you are using the private key (not public) as part of the SSH command on your host computer.

  • Give the VM a minute or longer to fully launch. Providing this time allows the SSH service to load. Then try again to connect.

  • In rare cases, if the image your uploaded or imported already contains public user SSH keys, the new keys uploaded or copy/pasted as part of the VM creation process might not be included. Take a snapshot of the original image with the wanted keys added, and use that modified image.

VM instance stuck for a long time

Provisioning of certain images and resources, such as boot volumes, GPU, and bigger shapes, can take 10 minutes or more. If a VM instance has been stuck for a long time, do the following:

  1. Access the Device Console and open the Details page for the instance.

  2. Review the Attached Block Volumes and Attached VNICs sections, and note any resources stuck in Attaching or Detaching state.

  3. If any block volumes or VNICs are seen stuck in attaching/detaching state, check the Monitoring page to see if Block Storage and VCN services are healthy.

    • If used storage space is nearly full, there might not be enough capacity to provision an instance. Consider terminating other instances, removing block volumes, or both to free up space.

    • If the public IP pool is used up, provisioning a new instance with public IP (specified by default) isn't doable. Either terminate existing instances to free up IPs, or add public IPs using the Serial Console.

  4. Review the Monitoring page for any other services are unhealthy.

If the solutions listed in here do not solve the issue, consider terminating the instance.

Stuck instances will be cleared out automatically after a few hours, otherwise they might need to be manually terminated.

Data Synchronization

Create Task Fails with Error "Same or Circular Task Exists"

Data Sync tasks are uni-directional and are sensitive to circular references. You cannot set up a bi-directional sync using two tasks and the same object storage buckets used by OCI and the REDs. Ensure that the task you are creating does not attempt to reverse the sync direction of a previously created task. If it does, modify one of the tasks needs to not reverse the direction of the other.

Tasks are specified, but sync operations do not start

Data Sync requires that you assign a connection for each REDs to an OCI cloud location where you want the data sync operations to occur. Check the OCI status page to see if OCI services are running. If network or object storage issues occur, resolve these issues before attempting to run or schedule a data sync. Next, see if the local network has connectivity by running ping OCI from the host machine to verify connectivity between Roving Edge Infrastructure and OCI. If pinging OCI does not work, verify that no firewall or network rules blocking connectivity exist.

If you create a Data Sync task job for synchronizing a bucket from RED-to-OCI or OCI-to-RED, and its estimated runtime is more than 12 hours, then exactly after 12 hours the Data Sync job fails because the authentication token expires after every 12 hours. If the Data Sync job fails after running more than 12 hours, do the following

  1. Open the navigation menu and select Data Sync.

    The Data Sync Tasks page appears. All data sync tasks are listed in tabular form.

  2. Check the data sync task that failed.

  3. Click Start.

    Alternately, select the Actions menu (Actions Menu) for the data sync task that you checked and click Start

  4. Confirm the start when prompted.