Overview of Metrics for an Instance and Related Resources
This section gives an overall picture of the different types of metrics available for an instance and its storage and network devices. See the following diagram and table for a summary.
Metric Namespace
Resource ID
Where Measured
Available Metrics
oci_computeagent
Instance OCID
On the instance. The metrics in this namespace are aggregated across all the related resources on the instance. For example, DiskBytesRead is aggregated across all the instance's attached storage volumes, and NetworkBytesIn is aggregated across all the instance's attached VNICs.
IAM policies: To monitor resources, you must be granted the required type of access in a policy written by an administrator, whether you're using the Console or the REST API with an SDK, CLI, or other tool. The policy must give you access to the monitoring services as well as the resources being monitored. If you try to perform an action and get a message that you don't have permission or are unauthorized, contact the administrator to find out what type of access you were granted and which compartment you need to work in. For more information about user authorizations for monitoring, see IAM Policies.
Metrics exist in Monitoring: The resources that you want to monitor must emit metrics to the Monitoring service.
Compute instances: To emit metrics, the Compute Instance Monitoring plugin must be enabled on the instance, and plugins must be running. The instance must also have either a service gateway or a public IP address to send metrics to the Monitoring service. For more information, see Enabling Monitoring for Compute Instances.
Available Metrics: oci_computeagent 🔗
The compute instance metrics help you measure activity level and throughput of compute instances. The metrics listed in the following table are available for any monitoring-enabled compute instance. To get these metrics, enable monitoring on the instance.
The metrics in this namespace are aggregated across all the related resources on the instance. For example, DiskBytesRead is aggregated across all the instance's attached storage volumes, and NetworkBytesIn is aggregated across all the instance's attached VNICs.
For metrics emitted by the metric namespace oci_computeagent, data points are sampled every ten seconds. A batch of six of data points is emitted every minute. Therefore, for every minute granularity, the aggregate count is always six, the aggregate sum is the sum of the six data points, and the aggregate average is the average of the six data points.
You also can use the Monitoring service to create custom queries.
Each metric includes the following dimensions :
availabilityDomain
The availability domain where the instance resides.
Network transmission throughput. Expressed as bytes transmitted.
1This metric is a cumulative counter that shows monotonically increasing behavior for each session of the Oracle Cloud Agent software, resetting when the operating system is restarted.
2The Networking service provides more metrics (in the oci_vcn metric namespace) for each VNIC on the instance. For more information, see Networking Metrics.
3The Block Volume service provides more metrics (in the oci_blockstore metric namespace) for each volume attached to the instance. For more information, see Block Volume Metrics.
Available Metrics: gpu_infrastructure_health 🔗
The compute instance metrics help you measure the activity level and throughput of compute instances. The metrics listed in the following table are available for any monitoring-enabled compute instance. To get these metrics, enable monitoring on the instance.
The metrics in this namespace are aggregated across all the related resources on the instance. For example, DiskBytesRead is aggregated across all the instance's attached storage volumes, and NetworkBytesIn is aggregated across all the instance's attached VNICs.
For metrics emitted by the metric namespace gpu_infrastructure_health, data points are sampled every ten seconds. A batch of six of data points is emitted every minute. Therefore, for every minute granularity, the aggregate count is always six, the aggregate sum is the sum of the six data points, and the aggregate average is the average of the six data points.
You also can use the Monitoring service to create custom queries.
Each metric includes the following dimensions :
component
GPU or rdma_nic
timestamp
UTC time when the payload/heartbeat is emitted
version
The payload version number for compatibility
Metric
Metric Display Name
Unit
Description
Dimensions
GpuUtilization
GPU utilization
percent
Activity level from GPU. Expressed as a percentage of total time.
For instance pools, the value is averaged across all instances in the pool.
availabilityDomain
faultDomain
gpuId
imageId
instancePoolId
region
resourceDisplayName
resourceId
shape
GpuMemoryUtilization
GPU memory utilization
percent
The percentage of the GPU memory resource in use.
GpuPowerDraw
GPU power draw
integer
The amount of GPU power used.
GpuTemperature
GPU temperature
integer
The GPU temperature reported.
GpuEccSingleBitErrors
GPU single-bit errors
integer
The number of GPU single bit ECC errors reported.
GpuEccDoubleBitErrors
GPU double-bit errors
integer
The number of GPU double bit ECC errors reported.
1This metric is a cumulative counter that shows monotonically increasing behavior for each session of the Oracle Cloud Agent software, resetting when the operating system is restarted.
2The Networking service provides more metrics (in the oci_vcn metric namespace) for each VNIC on the instance. For more information, see Networking Metrics.
3The Block Volume service provides more metrics (in the oci_blockstore metric namespace) for each volume attached to the instance. For more information, see Block Volume Metrics.
Fault Metrics: gpu_infrastructure_health
Metric
Metric Display Name
Unit
Description
Dimensions
Fault
GPU fault
count
If the value is 0, there are no faults. If the value is 1, faults are detected.
availabilityDomain
faultCode
faultDomain
gpuId
imageId
instancePoolId
pcieAddress
region
resourceDisplayName
resourceId
shape
1This metric is a cumulative counter that shows monotonically increasing behavior for each session of the Oracle Cloud Agent software, resetting when the operating system is restarted.
2The Networking service provides more metrics (in the oci_vcn metric namespace) for each VNIC on the instance. For more information, see Networking Metrics.
3The Block Volume service provides more metrics (in the oci_blockstore metric namespace) for each volume attached to the instance. For more information, see Block Volume Metrics.
Available Metrics: rdma_infrastructure_health 🔗
The compute instance metrics help you measure activity level and throughput of compute instances. The metrics listed in the following table are available for any monitoring-enabled compute instance. To get these metrics, enable monitoring on the instance.
The metrics in this namespace are aggregated across all the related resources on the instance. For example, DiskBytesRead is aggregated across all the instance's attached storage volumes, and NetworkBytesIn is aggregated across all the instance's attached VNICs.
For metrics emitted by the metric namespace rdma_infrastructure_health, data points are sampled every ten seconds. A batch of six of data points is emitted every minute. Therefore, for every minute granularity, the aggregate count is always six, the aggregate sum is the sum of the six data points, and the aggregate average is the average of the six data points.
You also can use the Monitoring service to create custom queries.
Each metric includes the following dimensions :
component
GPU or rdma_nic
timestamp
UTC time when the payload/heartbeat is emitted
version
The payload version number for compatibility
Metric
Metric Display Name
Unit
Description
Dimensions
RdmaTxBytes
RDMA aggregate network transmit bytes
bytes
The bytes transmitted on the RDMA interface.
availabilityDomain
faultDomain
imageId
instancePoolId
rdmaId
region
resourceDisplayName
resourceId
shape
RdmaRxBytes
RDMA aggregate network receive bytes
bytes
The bytes received on the RDMA interface.
RdmaTxPackets
RDMA aggregate network transmit packets
integer
The number of RDMA interface packets transmitted.
RdmaRxPackets
RDMA aggregate network receive packets
integer
The number of RDMA interface packets received.
1This metric is a cumulative counter that shows monotonically increasing behavior for each session of the Oracle Cloud Agent software, resetting when the operating system is restarted.
2The Networking service provides more metrics (in the oci_vcn metric namespace) for each VNIC on the instance. For more information, see Networking Metrics.
3The Block Volume service provides more metrics (in the oci_blockstore metric namespace) for each volume attached to the instance. For more information, see Block Volume Metrics.
Fault Metrics: rdma_infrastructure_health
Metric
Metric Display Name
Unit
Description
Dimensions
RdmaLinkSpeedFault
Faults
count
Detects if a link speed fault is present.
If the value is 0, there are no faults. If the value is 1, faults are detected.
availabilityDomain
faultDomain
imageId
instancePoolId
pcieAddress
rdmaId
region
resourceDisplayName
resourceId
shape
RdmaPcieAddressFault
Faults
count
Detects if a PCIE address fault is present.
If the value is 0, there are no faults. If the value is 1, faults are detected.
RdmaPcieBerCheckFault
Faults
count
Detects if a PCIE BER fault is present.
If the value is 0, there are no faults. If the value is 1, faults are detected.
RdmaPcieCableFlapFault
Faults
count
Detects if a PCIE cable flap fault is present.
If the value is 0, there are no faults. If the value is 1, faults are detected.
RdmaPcieCablePlugFault
Faults
count
Detects if a PCIE cable plug fault is present.
If the value is 0, there are no faults. If the value is 1, faults are detected.
RdmaPcieCableStateFault
Faults
count
Detects if a PCIE cable state fault is present.
If the value is 0, there are no faults. If the value is 1, faults are detected.
1This metric is a cumulative counter that shows monotonically increasing behavior for each session of the Oracle Cloud Agent software, resetting when the operating system is restarted.
2The Networking service provides more metrics (in the oci_vcn metric namespace) for each VNIC on the instance. For more information, see Networking Metrics.
3The Block Volume service provides more metrics (in the oci_blockstore metric namespace) for each volume attached to the instance. For more information, see Block Volume Metrics.
For an attached block volume: While viewing the instance's details, under Resources, click Attached block volumes, and then click the volume that you're interested in. Under Resources, click Metrics to see the volume's charts. For more information about the emitted metrics, see Block Volume Metrics.
For the attached boot volume: While viewing the instance's details, under Resources, click Boot volume, and then click the volume that you're interested in. Under Resources, click Metrics to see the volume's charts. For more information about the emitted metrics, see Block Volume Metrics.
For an attached VNIC: While viewing the instance's details, under Resources, click Attached VNICs, and then click the VNIC that you're interested in. Under Resources, click Metrics to see the charts for the VNIC. For more information about the emitted metrics, see Networking Metrics.