Supporting Large Artifacts in the Model Catalog

The maximum size of a model artifact is 6 GB.

The Console options for uploading models only supports model artifacts up to 100 MB in size. To upload large model artifacts, all the following examples use Python and the ADS. Large model artifacts are supported by copying an artifact from an Object Storage bucket to the service bucket of the model catalog.

Preliminary Steps for Using ADS

First, create some utility methods for the example to work:

import os
import oci
import random
import warnings
import numpy as np

import ads
from ads.catalog.model import ModelCatalog
from ads.common.model_metadata import UseCaseType
from ads.model.generic_model import GenericModel

from numpy import array
from numpy import ndarray
from shutil import rmtree

ads.set_auth("resource_principal")
warnings.filterwarnings('ignore')

# ads.set_auth("api_key")
# ads.set_debug_mode(False)
# auth = {"config": oci.config.from_file(os.path.join("~/.oci", "config"))}

from sklearn.datasets import make_classification
import pandas as pd
import os

class Size:
    MB_20 = 6000
    MB_200 = 60000
    MB_2000 = 600000    

def generate_large_csv(size: Size = Size.MB_20, file_path: str = "./large_csv_file.csv"):
    X_big, y_big = make_classification(n_samples=size, n_features=200)
    df_big = pd.concat([pd.DataFrame(X_big), pd.DataFrame(y_big)], axis=0)
    df_big.to_csv(os.path.join(file_path))

Next, create a dummy model to use in this example and populate the model with a large-scale CSV file. This example uses a 20 MB file though files up to 6 GB work.

class Square:
    def predict(self, x):
        x_array = np.array(x)
        return np.ndarray.tolist(x_array * x_array)
    
model = Square()


artifact_dir = "./large_artifact/"

generic_model = GenericModel(
    estimator=model, 
    artifact_dir=artifact_dir
)
generic_model.prepare(
    inference_conda_env="dataexpl_p37_cpu_v3",
    training_conda_env="dataexpl_p37_cpu_v3",
    use_case_type=UseCaseType.MULTINOMIAL_CLASSIFICATION,

    X_sample=X,
    y_sample=array(X) ** 2,
    force_overwrite=True
)

generate_large_csv(Size.MB_20, file_path=os.path.join(artifact_dir, "large_csv_file.csv"))

Saving a Large Model to the Model Catalog

You must have an Object Storage bucket to support models larger than 2 GB. You can create a bucket in the Console or using the OCI API.

Create an Object Storage bucket from the Console:

  1. Sign in to the Console.
  2. Open the navigation menu and click Storage. Under Object Storage & Archive Storage, click Buckets.
  3. Under List Scope, choose a Compartment.
  4. Click Create Bucket.

    Enter the following form information in the Create Bucket form.

    • Bucket Name: Enter a-bucket-name.
    • Default Storage Tier: Select Standard.

      Don't select the following options:

      • Enable Auto-Tiering
      • Enable Object Versioning
      • Emit Object Events
      • Uncommitted Multipart Uploads Cleanup
    • Encryption: Select Encrypt using Oracle managed keys
  5. Click Create. The bucket is created.

Construct the Bucket URI

The bucket URI isn't listed in the bucket details page in the Console so you must create the URI yourself. Create the bucket URI:

  1. Use the following template to create the bucket URI:

    oci://<bucket_name>@<namespace>/<objects_folder>/.

    Replace the bucket name with the one you created. For namespace, use a tenancy name (for example: my-tenancy). For object folder use my-object-folder.

    With the provided data, the bucket_uri would be: oci://my-bucket-name@my-tenancy/my-object-folder/.

  2. To upload large model artifacts, you must add two extra parameters to the GenericModel.save(...) method:
    • bucket_uri: (str, optional) Defaults to None.

      The Object Storage URI where model artifact is temporarily copied to.

      The bucket_uri is only necessary for uploading large artifacts when the size is greater than 2 GB. However, you can also use the method with small artifacts as well. For example:

      oci://<bucket_name>@<namespace>/prefix/.

    • remove_existing_artifact: (bool, optional) Defaults to True.

      The method decides whether artifacts uploaded to the Object Storage bucket should be removed.

  3. Take the model artifact and copy it from a notebook session to the bucket_uri.
  4. Next, copy the artifact from the bucket (bucket_uri) to the service bucket.

    If the artifact size greater than 2 GB and bucket_uri isn't provided, an error occurs.

    By default, the remove_existing_artifact attribute is set to True. The artifact is automatically removed from the bucket (bucket_uri) after a successful upload to the service bucket. If you don't want to remove artifact from the bucket, set: remove_existing_artifact = False.

To summarize, the process is:

  1. Prepare model artifacts.
  2. Save base information about the model to the model catalog.
  3. Upload model artifacts to a Object Storage bucket (bucket_uri).
  4. Upload model artifacts from a bucket to the model catalog service bucket.
  5. Remove temporary artifacts from a bucket based on the remove_existing_artifact parameter:
    large_model_id = generic_model.save(
        display_name='Generic Model With Large Artifact',
        bucket_uri=<provide bucket uri>,
        remove_existing_artifact=True
    )

Loading a Large Model to the Model Catalog

To load models larger than 2 GB, add two extra parameters to the GenericModel.from_model_catalog(...) method:

  • bucket_uri: (str, optional) Defaults to None.

    The Object Storage URI where model artifacts are temporarily copied to. The bucket_uri is only necessary for downloading large artifacts greater in size than 2 GB. The method works with the small artifacts as well. Example: oci://<bucket_name>@<namespace>/prefix/.

  • remove_existing_artifact: (bool, optional) Defaults to `True`.

    The method decides whether artifacts uploaded to the Object Storage bucket should be removed.

To summarize, the process is:

  1. Download the model artifacts from the model catalog service Object Storage bucket to a bucket.
  2. Download the model artifacts from the bucket to the notebook session.
  3. Remove the temporary artifacts from the bucket based on the remove_existing_artifact parameter.
  4. Load the base information about the model from the model catalog:

    large_model = GenericModel.from_model_catalog(
        large_model_id,
        "model.pkl",
        "./downloaded_large_artifact/",
        bucket_uri=<provide bucket uri> ,
        force_overwrite=True,
        remove_existing_artifact=True
    )