GPU Inference

Learn how to use model deployments to perform inference on GPU instances. GPU offers greater performance benefits with compute intensive models as compared to CPU.

GPUs are great for some machine learning inference use cases, mostly Deep Learning models such as Large Language Models, speech, and image recognition. GPUs use parallel processing capabilities and high memory bandwidth, which enable them to handle large amounts of data and complex calculations much faster than traditional CPUs. The results are much reduced inference times and improved performance.

Data Science model deployment supports deploying models on GPU shapes by using several methods:

Fully Managed: The service uses the default inference server to deploy the model on the selected GPU shape. You have control of how to use the GPU by using the score.py file.
Container based: Use your own custom container for the inference server. Use the GPU from both the container and the score.py file.
NVIDIA Triton Inference Server: The Triton Inference Server provides excellent GPU usage, and is built with ease of GPU use from the ground up. You can use a Triton container when creating a model deployment.

Prepare the Model Artifact

Model deployment requires a model artifact stored in the model catalog, and that the model is in an active state. The score.py of the model artifact needs to expose a GPU device variable. The load_model function must move the model to the selected GPU device. Then, the predict function must move the input tensors to GPU device. After inference is computed on the GPU device, it must move the output tensor back to CPU before returning from the function.

Tip

Consider using the ADS SDK to automatically generate a model artifact.

The ADS SDK offers automatic generation of the score.py file that GPU supports for PyTorch and TensorFlow frameworks.

The following score.py example uses an available GPU device to perform inference on an Resnet152 PyTorch model:

import numpy as np
import os
import torch
from torch import nn
import io
from PIL import Image
import base64
from random import randint
  
  
Image.MAX_IMAGE_PIXELS = None
  
model_name = 'PyTorch_ResNet152.pth'
  
# get an available GPU device 
def get_torch_device():
    num_devices = torch.cuda.device_count()
    if num_devices == 0:
        return "cpu"
    if num_devices == 1:
        return "cuda:0"
    else:
        return f"cuda:{randint(0, num_devices-1)}"
  
print("Device selected for inference", get_torch_device())
device = torch.device(get_torch_device())
 
def load_model(model_file_name=model_name):
    """
    Loads model from the serialized format
  
    Returns
    -------
    model:  Pytorch model instance
    """
    print(f"Devcie {device}")
    model_dir = os.path.dirname(os.path.realpath(__file__))
    contents = os.listdir(model_dir)
    if model_file_name in contents:
        model.load_state_dict(torch.load(os.path.abspath(model_file_name)))    
        model = model.to(device)
        print(f"model saved to {model.get_device()}")
        return model
    else:
        raise FileNotFoundError(f'{model_file_name} is not found in model directory {model_dir}.')
 
def predict(data, model=load_model()):
    """
    Returns prediction given the model and data to predict
  
    Parameters
    ----------
    model: Model instance returned by load_model API
    data: Data format in json
  
    Returns
    -------
    predictions: Output from scoring server
        Format: {'prediction':output from model.predict method}
  
    """
  
    img_bytes = io.BytesIO(base64.b64decode(data.encode('utf-8')))
    image = Image.open(img_bytes).resize((224, 224))
    arr = np.array(image)
    X = torch.FloatTensor(np.transpose(arr, axes=(2, 0, 1))).unsqueeze(0)
    X = X.to(device)
    with torch.no_grad():
        Y = model(X).to("cpu")
        pred = torch.nn.functional.softmax(Y[0], dim=0).argmax().item()
    return {'prediction': pred}

Conda Environment with Model Runtime Dependencies

The runtime.yaml file in the model artifact must include a conda environment that includes the GPU dependencies.

In the following example, the runtime.yaml file instructs the model deployment to pull a published conda environment from the Object Storage path defined by the INFERENCE_ENV_PATH environment variable:

MODEL_ARTIFACT_VERSION: '3.0'
MODEL_DEPLOYMENT:
  INFERENCE_CONDA_ENV:
    INFERENCE_ENV_PATH: oci://service-conda-packs@id19sfcrra6z/service_pack/gpu/PyTorch_1.10_for_GPU_on_Python_3.8/1.0/pytorch110_p38_gpu_v1
    INFERENCE_ENV_SLUG: pytorch_pack_v1
    INFERENCE_ENV_TYPE: data_science
    INFERENCE_PYTHON_VERSION: 3.8

Create Model Deployment

After a GPU model artifact is created, you create a model deployment and select one of the supported GPU shapes.

Model deployment supports bring your own container as a runtime dependency if you use your inference server. You must select one of the GPU shapes when creating the model deployment.

Model Replicas

When using a service managed inference server, model deployments load several model replicas to available GPU cards to achieve better throughput. The number of model replicas are calculated based on

The size of the model.
Memory available on the GPU card.
Number of GPU cards.
Logical CPU cores available on the Compute shape.

For example, the model takes 2 GB in the memory and the VM.GPU2.1 shape is selected that has 1 GPU card with 16 GB GPU memory. The model deployment allocates a percentage (around 70%) of GPU memory to load models and remaining memory is saved for runtime computation during inference. The model deployment loads five replicas on one GPU card (16*(0.7) / 2 GB size of the model in memory). If 2 cards are available, then a total of 10 model replicas are loaded with 5 models in each card.

With automatically generated score.py files, the distribution of model replicas to GPU cards are based on a random algorithm, which statistically places a nearly equal number of replicas to each card. However, you can change the number of replicas by using the WEB_CONCURRENCY application environment variable.

Using Triton for GPU Inference

NVIDIA Triton Inference Server streamlines and standardizes AI inference by enabling teams to deploy, run, and scale trained AI models from any framework on any GPU or CPU-based infrastructure.

Setup an NVIDIA Triton Inference Server to use with model deployments.

To enable GPU inference, the config.pbtxt must contain KIND_GPU in instance group. The instance group configuration can also be changed to specify the number of replicas of a model loads to the GPU device