Observability
Service Mesh includes various observability features including metrics and logging.
Metrics
By installing Service Mesh, you gain Observability features that collect telemetry data throughout the mesh and your application. Both inbound and outbound traffic now flow through Service Mesh proxies. Key operating statistics like latency, failures, and requests are now collected.
What Metrics does Service Mesh Emit?
Service Mesh utilizes Envoy as the proxy technology. Envoy emits many statistics depending on configuration. Generally the statistics fall into three categories:
-
Downstream: Downstream statistics relate to incoming connections/requests that are coming into the proxy.
-
Upstream: Upstream statistics relate to outgoing connections/requests that are made from the proxy.
-
Server: Server statistics describe how the Envoy instance is working. Statistics like server uptime or amount of allocated memory are categorized here.
For a list of metrics that Service Mesh proxies emit, refer to the following Envoy statistics documentation:
- Listener Stats: https://www.envoyproxy.io/docs/envoy/latest/configuration/listeners/stats
- Server Stats: https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/statistics
- Cluster Manager Stats: https://www.envoyproxy.io/docs/envoy/latest/configuration/upstream/cluster_manager/cluster_stats
- Connection Manager Stats: https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/stats#config-http-conn-man-stats
Envoy exposes metrics through an admin /stats/prometheus
endpoint.
This endpoint is accessible for users to scrape Envoy metrics to Prometheus
instances. Installing Prometheus along with scrape configuration and Grafana is all
that is required to get started with monitoring these crucial metrics. For setup
instructions, see (Add Application Monitoring and Graphing Support).
After completing the setup, Prometheus scrapes telemetry data emitted by Service Mesh proxies. You can then access Grafana through the service external IP to query and graph telemetry data collected in Prometheus.
As a starting point, consider monitoring the following service mesh metrics.
envoy_http_ingress_http_downstream_rq_time_sum (Downstream Request Time)
envoy_http_ingress_http_downstream_rq_xxx (Downstream Response Code Count)
envoy_http_ingress_http_downstream_rq_total (Total downstream requests)
envoy_cluster_upstream_rq_total (Total Upstream Requests)
envoy_cluster_upstream_rq_completed (Total Upstream Requests completed)
envoy_cluster_upstream_rq_xxx (Upstream Response Code Count)
envoy_cluster_upstream_rq_time_sum (Upstream Request Time)
Service Mesh Tagging
Service Mesh also adds specific tags to all the stats exposed by Envoy. This feature allows you to filter metrics by various tags associated to mesh resources. Tags include the following:
- Mesh OCID
- VirtualService OCID (if available for resource)
- VirtualService Name (if available for resource)
- Envoy Cluster Name (for cluster stats)
- Deployment Type (either virtual_deployment or ingress_deployment)
- Virtual Deployment Name (if deployment type is virtual_deployment)
- Ingress Deployment Name (if deployment type is ingress_deployment)
- Deployment OCID
Example Metric with Service Mesh Tags Applied
envoy_cluster_upstream_rq_completed{mesh_id="ocid1.mesh.oc1.iad.aaa...",
virtual_service_id="ocid1.meshvirtualservice.oc1.iad.aaa...",
virtual_service_name="pet-rescue/pets",deployment_type="virtual_deployment",virtual_deployment_name="pet-rescue/pets-v1",
deployment_id="ocid1.meshvirtualdeployment.oc1.iad.aaa...",
cluster_name="in|HTTP|9080|ocid1.meshvirtualdeployment.oc1.iad.aaa..."} 568
In the example, the value
aaa...
is an abbreviation for the full
OCID value.Naming Conventions
As part of proxy configuration setup, Service Mesh internally generates names for various resources with the following format. The names are used as part of the stat names. For example, virtual service deployment has the following cluster name generated:
{
"version_info": "5",
"cluster": {
"@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
"name": "in|HTTP|9080|ocid1.meshvirtualdeployment.oc1.iad.aaa...",
"type": "STATIC",
"connect_timeout": "0.250s",
"load_assignment": {
"cluster_name": "in|HTTP|9080|ocid1.meshvirtualdeployment.oc1.iad.aaa...",
"endpoints": [
{
"lb_endpoints": [
{
"endpoint": {
"address": {
"socket_address": {
"address": "127.0.0.1",
"port_value": 9080
}
}
}
}
]
}
]
}
},
"last_updated": "2022-04-25T21:31:24.730Z"
}
Cluster stats associated to this deployment look like the following
when scraping the /stats/prometheus
endpoint. The cluster_name is
added as a tag for the stat and removed from the stat name:
envoy_cluster_external_upstream_rq_2xx{mesh_id="ocid1.mesh.oc1.iad.aaa...",
virtual_service_id="ocid1.meshvirtualservice.oc1.iad.aaa...",
virtual_service_name="pet-rescue/pets",deployment_type="virtual_deployment",virtual_deployment_name="pet-rescue/pets-v1",
deployment_id="ocid1.meshvirtualdeployment.oc1.iad.aaa...",
cluster_name="in|HTTP|9080|ocid1.meshvirtualdeployment.oc1.iad.aaa..."} 568
For non-cluster stats such as virtual hosts, stat names look like the following:
envoy_vhost_in_HTTP_9080_ocid1_meshvirtualdeployment_oc1_iad_aaa..._vcluster_other_upstream_rq_timeout
The following table provides the name format for the various types in the proxy configuration:
Type | Format | Example Value |
---|---|---|
Cluster Name | <traffic_direction> | <protocol> | <port> |
<virtual_deployment_ocid> |
<certificate_ocid> |
Ingress:
Egress:
|
Route Config Name | <traffic_direction> | <protocol> | <port> |
<virtual_deployment_ocid> |
Ingress:
Egress:
|
Virtual Hostname | <virtual_service_name> | <port> |
|
Listener Name | <traffic_direction> | <port> |
<virtual_deployment_ocid> |
Ingress:
Egress:
|
Ingress Gateway Route Config Name | ig | <port> | <ingress_gateway_ocid> |
<hostnames> |
|
Ingress Gateway Virtual Hostname | ig | <port> | <ingress_gateway_ocid> |
<hostnames> |
|
Ingress Gateway Listener Name | ig | <port> |
<ingress_gateway_ocid> |
|
Logging
OCI Logging is activated on virtual deployments and ingress gateways after you install a mesh. OCI Logging Service collects logs for later analysis. Service Mesh provides two types of logs: error logs and traffic logs. These logs might be used to generate log-based statistics or to debug 404 and 503 issues.
{
"results": [
{
"data": {
"datetime": "XXXXXXX",
"logContent": {
"data": {
"message": "2022-02-11T17:58:54.435653464+00:00 stderr F I0311 17:58:54.752392 1 httplog.go:90] verb=\"GET\" URI=\"/openapi/v2\" latency=15.083521ms resp=304 UserAgent=\"\" srcIP=\"x.x.x.x:xxxx\": ",
"tailed_path": "/var/log/containers/packageserver-aaaa.log"
},
"id": "7acddd...",
"oracle": {
"compartmentid": "ocid1.compartment.oc1..aaaaaaaa...",
"ingestedtime": "2022-02-11T17:59:03.950Z",
"instanceid": "ocid1.instance.oc1.iad.aaaaa...",
"loggroupid": "ocid1.loggroup.oc1.iad.amaaaaaa...",
"logid": "ocid1.log.oc1.iad.amaaaa...",
"tenantid": "ocid1.tenancy.oc1..aaaa..."
},
"source": "oke-cqcs...",
"specversion": "1.0",
"subject": "/var/log/containers/packageserver-aaaa.log",
"time": "2022-02-11T17:58:54.881Z",
"type": "com.oraclecloud.logging.custom.inputsource"
}
}
}
]
}