Troubleshooting

Changes Made to Mesh Resources in the OCI Console or OCI CLI Revert to their Previous State

Issue

Any changes made to mesh resources (for example: ingress gateway, virtual service, virtual deployment, and so on) from the OCI Console or the OCI CLI revert to their previous state based on the update interval set for the operator.

Explanation

Currently, after initial creation of a mesh resource, changes to resources can only be made from OCI Service Operator for Kubernetes and therefore kubectl. Based on the operator update interval, for example, every hour, the OCI Service Operator for Kubernetes runs a reconciliation process. Any resources in the service mesh control plane with different settings are reverted to match the settings in the OCI Service Operator for Kubernetes.

Having General Traffic Issues with Service Mesh

Issue

Missing or unexpected service traffic is usually a sign of improper routing settings. Following are the most common reasons why a service communication is not working as expected.

Common Reasons

Resources are not in correct state

One reason for unexpected traffic to a virtual deployment could be a virtual service route table update caused a virtual deployment to go into a failed state and the default routing policy is set as UNIFORM. Ensure that all your resources are in an Active state. If any of your required resources are in a Failed or Deleted state, they are not used to build the routing configurations.
Protocol or Port Mismatch

The protocol or port between ingress gateway host listeners and virtual deployments doesn't match what is specified in an ingress gateway route table and virtual service route table.
DNS Host Mismatch

The DNS Hostname of the virtual deployment must match with the Kubernetes service.
Host Header Mismatch

Your internal and external service caller is not using host headers specified by the Kubernetes service. Remember that if you are not using a standard port for the protocol, <host>:<port> the same values must also be specified in the hosts of a virtual service or ingress gateway host listener.

This rule also extends to direct usage of IP Addresses. If you want to use an IP Address, then the IP address must be specified as the host of the ingress gateway host listener or virtual service.
Missing Access Policy

Not having an access policy which allows traffic to or from a service.

SSL Related Reasons

Server name mismatch

To initiate TLS handshake, your internal and external service communication requires using the hostnames specified in a virtual service or an ingress gateway.
Using Expired Certificate Authority

Service mesh checks that users do not provide already expired certificates or certificates authority as part of the Service Mesh resource creation. However, the customer is responsible for rotating the certificate authority before expiration so that certificates are renewed.

Troubleshooting Ingress Gateway Deployments

Issue

The IngressGatewayDeployment resource creates dependent resources like Deployment, Service, and Horizontal Pod Autoscaler. The Service created by IngressGatewayDeployment can in turn create a LoadBalancer resource. If any of these dependent resources fail to create, the IngressGatewayDeployment resource doesn't become active. To remediate some common issues, review the following:

Solution

If the deployment produces an error similar to the following, this error means that the service of type LoadBalancer created by IngressGatewayDeployment fails to create a public load balancer in a private subnet.

Warning SyncLoadBalancerFailed 3m2s (x10 over 48m) service-controller (combined from similar events): Error syncing load balancer: failed to ensure load balancer: creating load balancer: Service error:InvalidParameter. Private subnet with id <subnet-ocid> is not allowed in a public loadbalancer.. http status code: 400. Opc request id: <opc-request-id>

To use a private or internal load balancer, do the following.

Remove the service section from the IngressGatewayDeployment resource.
Create a Service with the correct annotations that points to the ingress gateway pods.

Your updated resources look similar to the following examples.

IngressGatewayDeployment without service

apiVersion: servicemesh.oci.oracle.com/v1beta1
kind: IngressGatewayDeployment
metadata:
  name: bookinfo-ingress-gateway-deployment
  namespace: bookinfo
spec:
  ingressGateway:
    ref:
      name: bookinfo-ingress-gateway
  deployment:
    autoscaling:
      minPods: 1
      maxPods: 1
  ports:
    - protocol: TCP
      port: 9080
      serviceport: 80

Service to create an internal load balancer.

apiVersion: v1
kind: Service
metadata:
  name: bookinfo-ingress
  namespace: bookinfo
  annotations:
    service.beta.kubernetes.io/oci-load-balancer-internal: "true"
spec:
  ports:
  - port: 80
    targetPort: 9080
    name: http
  selector:
    servicemesh.oci.oracle.com/ingress-gateway-deployment: bookinfo-ingress-gateway-deployment
  type: LoadBalancer

Horizontal Pod Autoscaler (HPA) does not Scrape Metrics

Issue

The Horizontal Pod Autoscaler (HPA) does not scrape metrics.

Solution

When an application pod is set up with Service Mesh, the Service Mesh proxy container is injected into the pod. Along with the proxy container, an init container is also injected which does a one time initialization required for enabling the proxy.

Because of the presence of the init container in the pod the metrics-server is unable to scrape metrics from the pod in some scenarios, refer to the following table.


metrics-server Version	HPA API Version	Able to Scrape Metrics
v0.6.x	autoscaling/v2beta2	No
v0.6.x	autoscaling/v1	Yes
v0.4.x	Any	No

Virtual Deployment Pods Receive No Traffic

Issue

My virtual deployment pods receive no traffic.

Solution

By default, the routing policy for a virtual service is DENY. Therefore, do one of the following:

Change the routing policy to UNIFORM.
Create a virtual service route table to route traffic to your virtual deployment.

Troubleshoot Traffic Issues with Proxy config_dump

Issue

You're experiencing one of the following traffic issues.

A service isn't receiving any traffic.
Secure communication isn't happening between services.
Traffic splitting isn't happening across versions.
A/B deployment testing, canary deployment fails.

Solution

To troubleshoot the issue, get the config_dump file for the pod with the issue. You can infer more information by looking at the source and destination pod config_dump files. To get the file, perform the following steps.

Run in the oci-sm-proxy container.

$ kubectl exec -i -t -n NAMESPACE_NAME POD_NAME -c oci-sm-proxy "--" /bin/bash

Inside the container, access the config_dump file.
```
$ curl localhost:9901/config_dump
```
Exit from the container.
```
$ exit
```

Analyze Traffic Between Service Versions with Prometheus

Issue

Prometheus metrics are key for identifying whether traffic is sent to a particular version of a service in the last 5 minutes.

Solution

To view service traffic in the last five minutes, perform the following steps.

Note

The following examples assume a service named "pets" is deployed and has multiple versions.

Open the Prometheus Dashboard in the browser using port-forwarding:
```
kubectl port-forward PROMETHEUS_POD_NAME -n PROMETHEUS_NAMESPACE PROMETHEUS_CONTAINER_PORT:9090
```
To view prometheus metrics, visit http://localhost:9090/graph in the browser.
To view the total count of all requests sent to the pets-v1 service, enter the following command in prometheus search. Press Execute.
```
envoy_cluster_external_upstream_rq_completed{virtual_deployment_name="pets-v1"}
```
To view rate of requests over the past 5 minutes to the pets-v1 service, enter the following command in prometheus search. Press Execute.
```
rate(envoy_cluster_external_upstream_rq_completed{virtual_deployment_name="pets-v1"}[5m])
```

Fetch total number of requests sent to all pods with curl:

curl localhost:9090/api/v1/query?query="envoy_cluster_external_upstream_rq_completed" | jq

Oracle Cloud Infrastructure Documentation

Troubleshooting - General

Changes Made to Mesh Resources in the OCI Console or OCI CLI Revert to their Previous State

Issue

Explanation

Having General Traffic Issues with Service Mesh

Issue

Common Reasons

SSL Related Reasons

Troubleshooting Ingress Gateway Deployments

Issue

Solution

Horizontal Pod Autoscaler (HPA) does not Scrape Metrics

Issue

Solution

Virtual Deployment Pods Receive No Traffic

Issue

Solution

Troubleshoot Traffic Issues with Proxy config_dump

Issue

Solution

Analyze Traffic Between Service Versions with Prometheus

Issue

Solution