You can enable GPU metrics with the Oracle Cloud Agent High Performance Computing plugin on your instances.
Current OCI HPC package
New OCA plugin
Description
oci-cn-auth
Compute HPC RDMA Authentication
oci-rdma-authentication
Configures RDMA/RoCE network interfaces with QoS, MTU, etc. settings and maintains authentication.
oci-hpc-mlx-configure
Compute HPC RDMA Auto-Configuration
oci-hpc-configure
Configures Mellanox ConnectX-5 firmware and PCIE settings.
oci-hpc-rdma-configure
Compute HPC RDMA Auto-Configuration
oci-hpc-configure
Configures RDMA interface ip addresses.
oci-hpc-dapl-configure
Compute HPC RDMA Auto-Configuration
oci-hpc-configure
Configure legacy MPI DAPL oci-dat.conf.
Note
You can transition from python-based solutions to use the Oracle Cloud Agent High Performance Computing plugin.
Enabling Compute HPC RDMA Authentication and Auto-Configuration on an Existing Instance
To enable HPC RDMA authentication and auto-configuration on a host that is running the current OCI HPC packages, follow these steps.
Note
Do not perform this workflow on a running workload. These actions can be disruptive and result in data loss.
Determine which version of Oracle Cloud Agent is installed. Version 1.35.0 or above is required. If the version is not 1.35.0 or above, contact support to obtain the installation package.
# sudo systemctl status oracle-cloud-agent
# sudo systemctl status oracle-cloud-agent-updater
Ubuntu20
# sudo systemctl status snap.oracle-cloud-agent.oracle-cloud-agent.service
# sudo systemctl status snap.oracle-cloud-agent.oracle-cloud-agent-updater.service
Download the current agent configuration on the instance. See Managing Plugins for information on how to enable the plugin.
Verify that the plugin is running. It takes several minutes for the agentConfig changes to populate to the Oracle Cloud Agent.
# ps -leaf | grep oci-rdma-authentication
Confirm that all RDMA network interfaces have a wpa_supplicant
# ps -leaf | grep wpa_supplicant
Launching instance with HPC RDMA Authentication plug-in enabled 🔗
Provided the custom image has Oracle Cloud Agent 1.35.0 or above and the OCI HPC packages are not present, the LaunchInstanceDetails is used to apply the agentConfig with the plug-in enabled. OS must have the NVIDIA GPU drivers and Mellanox OFED drivers installed.
With Oracle Cloud Agent 1.35.0 new functionality to monitor RDMA and GPU is available. To enable this functionality on an existing instance do the following:
Download the current agent configuration on the instance. The sections below are only one way of enabling the plug-in. For more information, see Oracle Cloud Agent.
If you use a private VPN, you need Service Gateway. If you use a public internet gateway, Service Gateway is not required.
For information on how to use the Monitoring service, see Securing Monitoring.
Create a dynamic group
This example creates a group that contains all instances in a specific compartment.
Any {instance.compartment.id = '<compartment_ocid>'}
Create a policy
Create a policy using the dynamic group to allow to instances to publish metrics. The HPC monitoring plug-in creates 2 custom namespaces that are billed:
gpu_infrastructure_health
rdma_infrastructure_health
Allow dynamic-group <group_name> to use metrics in compartment <compartment_name> where target.metrics.namespace=<metric_namespace>'
Allow dynamic-group <group_name> to read metrics in compartment <compartment_name>
For information on how to publish custom metrics to the Monitoring service, see Publishing Custom Metrics.