GPU Inference
Learn how to use model deployments to perform inference on GPU instances. GPU offers greater performance benefits with compute intensive models as compared to CPU.
GPUs are great for some machine learning inference use cases, mostly Deep Learning models such as Large Language Models, speech, and image recognition. GPUs use parallel processing capabilities and high memory bandwidth, which enable them to handle large amounts of data and complex calculations much faster than traditional CPUs. The results are much reduced inference times and improved performance.
Data Science model deployment supports deploying models on GPU shapes by using several methods:
- Fully Managed
-
The service uses the default inference server to deploy the model on the selected GPU shape. You have control of how to use the GPU by using the
score.py
file. - Container based
-
Use your own custom container for the inference server. Use the GPU from both the container and the
score.py
file. - NVIDIA Triton Inference Server
-
The Triton Inference Server provides excellent GPU usage, and is built with ease of GPU use from the ground up. You can use a Triton container when creating a model deployment.
Prepare the Model Artifact
Model deployment requires a model artifact stored in the model catalog, and
that the model is in an active state. The score.py of the model artifact needs to expose a GPU device
variable. The load_model
function must move the model to the selected
GPU device. Then, the predict
function must move the input tensors to
GPU device. After inference is computed on the GPU device, it must move the output
tensor back to CPU before returning from the function.
Consider using the ADS SDK to automatically generate a model artifact.
The ADS SDK offers automatic generation of the score.py file that GPU supports for PyTorch and TensorFlow frameworks.
The following score.py example uses an available GPU device to perform inference on an Resnet152 PyTorch model:
import numpy as np
import os
import torch
from torch import nn
import io
from PIL import Image
import base64
from random import randint
Image.MAX_IMAGE_PIXELS = None
model_name = 'PyTorch_ResNet152.pth'
# get an available GPU device
def get_torch_device():
num_devices = torch.cuda.device_count()
if num_devices == 0:
return "cpu"
if num_devices == 1:
return "cuda:0"
else:
return f"cuda:{randint(0, num_devices-1)}"
print("Device selected for inference", get_torch_device())
device = torch.device(get_torch_device())
def load_model(model_file_name=model_name):
"""
Loads model from the serialized format
Returns
-------
model: Pytorch model instance
"""
print(f"Devcie {device}")
model_dir = os.path.dirname(os.path.realpath(__file__))
contents = os.listdir(model_dir)
if model_file_name in contents:
model.load_state_dict(torch.load(os.path.abspath(model_file_name)))
model = model.to(device)
print(f"model saved to {model.get_device()}")
return model
else:
raise FileNotFoundError(f'{model_file_name} is not found in model directory {model_dir}.')
def predict(data, model=load_model()):
"""
Returns prediction given the model and data to predict
Parameters
----------
model: Model instance returned by load_model API
data: Data format in json
Returns
-------
predictions: Output from scoring server
Format: {'prediction':output from model.predict method}
"""
img_bytes = io.BytesIO(base64.b64decode(data.encode('utf-8')))
image = Image.open(img_bytes).resize((224, 224))
arr = np.array(image)
X = torch.FloatTensor(np.transpose(arr, axes=(2, 0, 1))).unsqueeze(0)
X = X.to(device)
with torch.no_grad():
Y = model(X).to("cpu")
pred = torch.nn.functional.softmax(Y[0], dim=0).argmax().item()
return {'prediction': pred}
Conda Environment with Model Runtime Dependencies
The runtime.yaml
file in the model artifact must include a conda environment that includes the GPU dependencies.
In the following example, the runtime.yaml
file instructs the model deployment to pull a published conda environment from the Object Storage path defined by the INFERENCE_ENV_PATH
environment variable:
MODEL_ARTIFACT_VERSION: '3.0'
MODEL_DEPLOYMENT:
INFERENCE_CONDA_ENV:
INFERENCE_ENV_PATH: oci://service-conda-packs@id19sfcrra6z/service_pack/gpu/PyTorch_1.10_for_GPU_on_Python_3.8/1.0/pytorch110_p38_gpu_v1
INFERENCE_ENV_SLUG: pytorch_pack_v1
INFERENCE_ENV_TYPE: data_science
INFERENCE_PYTHON_VERSION: 3.8
Create Model Deployment
After a GPU model artifact is created, you create a model deployment and select one of the supported GPU shapes.
Model deployment supports bring your own container as a runtime dependency if you use your inference server. You must select one of the GPU shapes when creating the model deployment.
Model Replicas
When using a service managed inference server, model deployments load several model replicas to available GPU cards to achieve better throughput. The number of model replicas are calculated based on
-
The size of the model.
-
Memory available on the GPU card.
-
Number of GPU cards.
-
Logical CPU cores available on the Compute shape.
For example, the model takes 2 GB in the memory and the VM.GPU2.1 shape is selected that has 1 GPU card with 16 GB GPU memory. The model deployment allocates a percentage (around 70%) of GPU memory to load models and remaining memory is saved for runtime computation during inference. The model deployment loads five replicas on one GPU card (16*(0.7) / 2 GB size of the model in memory). If 2 cards are available, then a total of 10 model replicas are loaded with 5 models in each card.
With automatically generated score.py files, the distribution of model replicas to GPU cards are based on a random algorithm, which statistically places a nearly equal number of replicas to each card. However, you can change the number of replicas by using the WEB_CONCURRENCY
application environment variable.
Using Triton for GPU Inference
NVIDIA Triton Inference Server streamlines and standardizes AI inference by enabling teams to deploy, run, and scale trained AI models from any framework on any GPU or CPU-based infrastructure.
Setup an NVIDIA Triton Inference Server to use with model deployments.
To enable GPU inference, the config.pbtxt must contain KIND_GPU
in
instance group. The instance group configuration can also be changed to specify the number of
replicas of a model loads to the GPU device