Observability

Service Mesh includes various observability features including metrics and logging.

Metrics

By installing Service Mesh, you gain Observability features that collect telemetry data throughout the mesh and your application. Both inbound and outbound traffic now flow through Service Mesh proxies. Key operating statistics like latency, failures, and requests are now collected.

What Metrics does Service Mesh Emit?

Service Mesh utilizes Envoy as the proxy technology. Envoy emits many statistics depending on configuration. Generally the statistics fall into three categories:

  • Downstream: Downstream statistics relate to incoming connections/requests that are coming into the proxy.

  • Upstream: Upstream statistics relate to outgoing connections/requests that are made from the proxy.

  • Server: Server statistics describe how the Envoy instance is working. Statistics like server uptime or amount of allocated memory are categorized here.

For a list of metrics that Service Mesh proxies emit, refer to the following Envoy statistics documentation:

Envoy exposes metrics through an admin /stats/prometheus endpoint. This endpoint is accessible for users to scrape Envoy metrics to Prometheus instances. Installing Prometheus along with scrape configuration and Grafana is all that is required to get started with monitoring these crucial metrics. For setup instructions, see (Add Application Monitoring and Graphing Support).

After completing the setup, Prometheus scrapes telemetry data emitted by Service Mesh proxies. You can then access Grafana through the service external IP to query and graph telemetry data collected in Prometheus.

As a starting point, consider monitoring the following service mesh metrics.

  • envoy_http_ingress_http_downstream_rq_time_sum (Downstream Request Time)
  • envoy_http_ingress_http_downstream_rq_xxx (Downstream Response Code Count)
  • envoy_http_ingress_http_downstream_rq_total (Total downstream requests)
  • envoy_cluster_upstream_rq_total (Total Upstream Requests)
  • envoy_cluster_upstream_rq_completed (Total Upstream Requests completed)
  • envoy_cluster_upstream_rq_xxx (Upstream Response Code Count)
  • envoy_cluster_upstream_rq_time_sum (Upstream Request Time)

Service Mesh Tagging

Service Mesh also adds specific tags to all the stats exposed by Envoy. This feature allows you to filter metrics by various tags associated to mesh resources. Tags include the following:

  • Mesh OCID
  • VirtualService OCID (if available for resource)
  • VirtualService Name (if available for resource)
  • Envoy Cluster Name (for cluster stats)
  • Deployment Type (either virtual_deployment or ingress_deployment)
  • Virtual Deployment Name (if deployment type is virtual_deployment)
  • Ingress Deployment Name (if deployment type is ingress_deployment)
  • Deployment OCID

Example Metric with Service Mesh Tags Applied

envoy_cluster_upstream_rq_completed{mesh_id="ocid1.mesh.oc1.iad.aaa...",
virtual_service_id="ocid1.meshvirtualservice.oc1.iad.aaa...",
virtual_service_name="pet-rescue/pets",deployment_type="virtual_deployment",virtual_deployment_name="pet-rescue/pets-v1",
deployment_id="ocid1.meshvirtualdeployment.oc1.iad.aaa...",
cluster_name="in|HTTP|9080|ocid1.meshvirtualdeployment.oc1.iad.aaa..."} 568
Note

In the example, the value aaa... is an abbreviation for the full OCID value.

Naming Conventions

As part of proxy configuration setup, Service Mesh internally generates names for various resources with the following format. The names are used as part of the stat names. For example, virtual service deployment has the following cluster name generated:

{
 "version_info": "5",
 "cluster": {
  "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
  "name": "in|HTTP|9080|ocid1.meshvirtualdeployment.oc1.iad.aaa...",
  "type": "STATIC",
  "connect_timeout": "0.250s",
  "load_assignment": {
   "cluster_name": "in|HTTP|9080|ocid1.meshvirtualdeployment.oc1.iad.aaa...",
   "endpoints": [
    {
     "lb_endpoints": [
      {
       "endpoint": {
        "address": {
         "socket_address": {
          "address": "127.0.0.1",
          "port_value": 9080
         }
        }
       }
      }
     ]
    }
   ]
  }
 },
 "last_updated": "2022-04-25T21:31:24.730Z"
}

Cluster stats associated to this deployment look like the following when scraping the /stats/prometheus endpoint. The cluster_name is added as a tag for the stat and removed from the stat name:

envoy_cluster_external_upstream_rq_2xx{mesh_id="ocid1.mesh.oc1.iad.aaa...",
virtual_service_id="ocid1.meshvirtualservice.oc1.iad.aaa...",
virtual_service_name="pet-rescue/pets",deployment_type="virtual_deployment",virtual_deployment_name="pet-rescue/pets-v1",
deployment_id="ocid1.meshvirtualdeployment.oc1.iad.aaa...",
cluster_name="in|HTTP|9080|ocid1.meshvirtualdeployment.oc1.iad.aaa..."} 568 

For non-cluster stats such as virtual hosts, stat names look like the following:

envoy_vhost_in_HTTP_9080_ocid1_meshvirtualdeployment_oc1_iad_aaa..._vcluster_other_upstream_rq_timeout

The following table provides the name format for the various types in the proxy configuration:

Type Format Example Value
Cluster Name <traffic_direction> | <protocol> | <port> | <virtual_deployment_ocid> | <certificate_ocid>

Ingress:

in|HTTP|9080|ocid1.meshvirtualdeployment.oc1.iad.aaa...

Egress:

out|HTTP|9080|ocid1.meshvirtualdeployment.oc1.iad.aaa...
out|HTTP|9080|ocid1.meshvirtualdeployment.oc1.iad.aaa...|ocid1.certificate.oc1.iad.aaa...
Route Config Name <traffic_direction> | <protocol> | <port> | <virtual_deployment_ocid>

Ingress:

in|HTTP|9080|ocid1.meshvirtualdeployment.oc1.iad.aaa...

Egress:

out|HTTP|9080|ocid1.meshvirtualdeployment.oc1.iad.aaa...
Virtual Hostname <virtual_service_name> | <port>
product|9080
Listener Name <traffic_direction> | <port> | <virtual_deployment_ocid>

Ingress:

in|15000|ocid1.meshvirtualdeployment.oc1.iad.aaa...

Egress:

out|15001|ocid1.meshvirtualdeployment.oc1.iad.aaa...
Ingress Gateway Route Config Name ig | <port> | <ingress_gateway_ocid> | <hostnames>
ig|8080|ocid1.meshingressgateway.oc1.iad.aaa...|example.com|www.example.com
Ingress Gateway Virtual Hostname ig | <port> | <ingress_gateway_ocid> | <hostnames>
ig|8080|ocid1.meshingressgateway.oc1.iad.aaa...|example.com|www.example.com
Ingress Gateway Listener Name ig | <port> | <ingress_gateway_ocid>
ig|8080|ocid1.meshingressgateway.oc1.iad.aaa...

Logging

OCI Logging is activated on virtual deployments and ingress gateways after you install a mesh. OCI Logging Service collects logs for later analysis. Service Mesh provides two types of logs: error logs and traffic logs. These logs might be used to generate log-based statistics or to debug 404 and 503 issues.

{
    "results": [
        {
            "data": {
                "datetime": "XXXXXXX",
                "logContent": {
                    "data": {
                        "message": "2022-02-11T17:58:54.435653464+00:00 stderr F I0311 17:58:54.752392       1 httplog.go:90] verb=\"GET\" URI=\"/openapi/v2\" latency=15.083521ms resp=304 UserAgent=\"\" srcIP=\"x.x.x.x:xxxx\": ",
                        "tailed_path": "/var/log/containers/packageserver-aaaa.log"
                    },
                    "id": "7acddd...",
                    "oracle": {
                        "compartmentid": "ocid1.compartment.oc1..aaaaaaaa...",
                        "ingestedtime": "2022-02-11T17:59:03.950Z",
                        "instanceid": "ocid1.instance.oc1.iad.aaaaa...",
                        "loggroupid": "ocid1.loggroup.oc1.iad.amaaaaaa...",
                        "logid": "ocid1.log.oc1.iad.amaaaa...",
                        "tenantid": "ocid1.tenancy.oc1..aaaa..."
                    },
                    "source": "oke-cqcs...",
                    "specversion": "1.0",
                    "subject": "/var/log/containers/packageserver-aaaa.log",
                    "time": "2022-02-11T17:58:54.881Z",
                    "type": "com.oraclecloud.logging.custom.inputsource"
                }
            }
        }
    ]
}