Large Model Support

Data Science Model Deployment and Model Catalog services now support large model deployments.

Large model artifacts can be stored in the model catalog service and be used to create model deployments. The endpoint mapping feature lets you integrate inference containers, such as the Text Generation Interface (TGI), even if they don't comply with standard API contracts for the /predict and /health endpoints.

Creating Model Deployment for Large Models

Model deployment supports Bring Your Own Container (BYOC). Build and use a custom container as the runtime dependency when you create a model deployment. With custom containers, you can package system and language dependencies, install, and configure inference servers, and set up different language run times, all within the defined boundaries of an interface with a model deployment resource to run the containers. BYOC means you can transfer containers between different environments so you can migrate and deploy applications to the OCI Cloud.

Changes to the Model Catalog

Create a model and save it to the model catalog by using the ADS SDK, OCI Python SDK, or the Console. For more information, see Creating and Saving a Model to the Model Catalog and Large Model Artifacts. Large model cataloging uses the same export feature to save models in the model catalog. The user experience is no different to the documented behavior.
Important

Deployment of Large Models

Model Deployment is designed to support an array of machine learning inference frameworks, catering to the diverse needs of large model deployments. Among these, OCI supports the Text Generation Interface (TGI), NVIDIA Triton Inference Server, and Virtual Large Language Models (VLLM) for Large Language Models (LLM). This support lets you select the framework that best fits the deployment requirements. TGI's integration with OCI supports customized container use, enabling precise environment set ups tailored to specific model behaviors and dependencies. For models requiring intensive computational resources, especially those of AI and deep learning, the NVIDIA Triton Inference Server offers a streamlined path on OCI. It helps with the efficient management of GPU resources and supports a broad spectrum of machine learning frameworks such as TensorFlow, PyTorch, and ONNX. OCI's handling of VLLM and NVIDIA Triton TensorRT LLM provides specialized optimizations for large language models. These frameworks benefit from enhanced performance capabilities through advanced optimization techniques, such as layer fusion and precision calibration, which are crucial for handling the very large computational demands of large-scale language processing tasks. By deploying these frameworks on OCI, you can use high throughput and low latency inference, making it ideal for applications that require real-time language understanding and generation. More information on the deployment of each option follows:

Was this article helpful?