Model Deployments BYOC
Troubleshoot BYOC model deployments.
Unable to Access the Container Image
When creating, updating, or activating model deployment operations, Data Science verifies that an authorized path to access the container image exists within the tenancy. If the verification fails, it could be because of missing resource principal policies, an incorrect image path, or the image doesn't exist. Ensure that the policies, path, and image specified are correct and try again.
Container Image Download Timeout
Each model deployment resource involves pulling the created container image from OCI Registry to the deployment Compute instance, where it's then run as a container for inferencing. The image download must be completed within 20 minutes. However, if the image size is too large or a temporary service downtime occurs in the registry, the operation might time out so the image size must be within 16 GB. If the image is larger than this, consider removing any unnecessary dependencies to reduce the size, and then try the deployment creation again.
Container Run Timeout
When deploying a model, the container image is transferred from the tenancy to the Data Science service tenancy, and used to run the model as a container for inferencing. The container has a defined timeout of 10 minutes to run so it's crucial to ensure that the inference serving container starts within this time frame.
Before deployment, it's important to validate the container locally, and test that both the /predict
and /health
calls are successful.
During deployment, it's also crucial to validate that no errors occur during container run, the predict call, or the health check call. Also, ensure that egress is enabled during model deployment resource creation if the inferencing logic running inside the container needs to access the internet. Failing to do so can result in a model bootstrap failure. To test this scenario, try disabling the internet during local testing.
Ensure that enough memory is allocated for loading and inferring the model to avoid out of memory issues.
Review the BYOC best practices and Test the Container for more information.
Unable to Start the Container
There can be several reasons for a container failing to start. To address this, it's best to identify and fix the failure during the local testing phase. Following are some possible reasons and fixes:
-
The container image must have the
curl
package installed for the DockerHEALTHCHECK
policy to succeed. If this package is missing, the container fails to start. -
The Docker
CMD
orEntrypoint
command line parameters must be provided either through the API or Dockerfile to bootstrap the web server. If these parameters are invalid, the container fails to start.
Unable to Access the Model
During bootstrap, the deployments Compute instance unzips the model artifact and mount the files to the /opt/ds/model/deployed_model
directory inside the running container in read-only mode.
Any files zipped from this path are used in the scoring logic. Zipping a set of files (including ML model and scoring logic), or a folder containing a set of files that have a different location path to the ML model inside the container.
Ensure that the correct path is used when loading the model in the scoring logic.