Preparing Model Metadata

Model metadata is optional though recommended.

Model Provenance Metadata

You can document the model provenance. This is optional. The following table lists the supported model provenance metadata:

Metadata Description
git_branch Branch of the Git repository.
git_commit Commit id.
repository_url URL of the remote Git repository.
script_dir Local path to the artifact directory.
training_id

OCID of the resource used to train the model, notebook session or job run.

You can use these environment variables when you save a model with the OCI SDK:

  • NB_SESSION_OCID

Example

provenance_details = CreateModelProvenanceDetails(repository_url="EXAMPLE-repositoryUrl-Value",
                                                  git_branch="EXAMPLE-gitBranch-Value",
                                                  git_commit="EXAMPLE-gitCommit-Value",
                                                  script_dir="EXAMPLE-scriptDir-Value",
                                                  # OCID of the ML job Run or Notebook session on which this model was
                                                  # trained
                                                  training_id="<<Notebooksession or ML Job Run OCID>>"
                                                  )

Model Taxonomy Metadata

You can document the model taxonomy. This is optional.

The metadata fields associated with model taxonomy allow you to describe the machine learning use case and framework behind the model. The defined metadata tags are the list of allowed values for use case type and framework for defined metadata and category values for custom metadata.

Preset Model Taxonomy

The following table lists the supported model taxonomy metadata:

Metadata Description
UseCaseType

Describes the machine learning use case associated with the model using one of the listed values like:

binary_classification
regression
multinomial_classification
clustering
recommender
dimensionality_reduction/representation
time_series_forecasting
anomaly_detection
topic_modeling
ner
sentiment_analysis
image_classification
object_localization
other
Framework

The machine learning framework associated with the model using one of the listed values like:

scikit-learn
xgboost 
tensorflow 
pytorch 
mxnet 
keras 
lightGBM
pymc3
pyOD
spacy 
prophet 
sktime 
statsmodels
cuml 
oracle_automl
h2o
transformers 
nltk 
emcee 
pystan 
bert
gensim
flair 
word2vec
ensemble (more than one library) 
other
FrameworkVersion The machine learning framework version. This is a free text value. For example, PyTorch 1.9.
Algorithm The algorithm or model instance class. This is a free text value. For example, CART algorithm.
Hyperparameters The hyperparameters of the model object. This is a JSON format.
ArtifactTestResults The JSON output of the artifact tests run on the client side.

Example

This example shows you how to document the model taxonomy, by capturing each key-value pair which creates a list of Metadata() objects:

# create the list of defined metadata around model taxonomy:
defined_metadata_list = [
    Metadata(key="UseCaseType", value="image_classification"),
    Metadata(key="Framework", value="keras"),
    Metadata(key="FrameworkVersion", value="0.2.0"),
    Metadata(key="Algorithm",value="ResNet"),
    Metadata(key="hyperparameters",value="{\"max_depth\":\"5\",\"learning_rate\":\"0.08\",\"objective\":\"gradient descent\"}")
]

Custom Model Taxonomy

You can add your own custom metadata to document your model. The maximum allowed file size for the combined defined and custom metadata is 32000 bytes.

Each custom metadata has these four attributes:

Field or Key Required? Description
key

Required

The key and label of your custom metadata.
value

Required

The value attached to the key.
category

Optional

The category of the metadata. Select one of these five values:

  • Performance
  • Training Profile
  • Training and Validation Datasets
  • Training Environment
  • other

The category attribute is useful to filter custom metadata. This is handy when one has a large number of custom metadata for a given model.

description

Optional

A description of the custom medata.

Example

This example shows how you can add custom metadata to capture the model accuracy, the environment, and the source of the training data:

# Adding your own custom metadata:
custom_metadata_list = [
    Metadata(key="Image Accuracy Limit", value="70-90%", category="Performance",
             description="Performance accuracy accepted"),
    Metadata(key="Pre-trained environment",
             value="https://blog.floydhub.com/guide-to-hyperparameters-search-for-deep-learning-models/",
             category="Training environment", description="Environment link for pre-trained model"),
    Metadata(key="Image Sourcing", value="https://lionbridge.ai/services/image-data/", category="other",
             description="Source for image training data")
]

Model Data Schemas Definition

You can document the model input and output data schemas. The input data schema definition provides the blueprint of the data parameter of the score.py file predict() function. You can think of the input data schema as the definition of the input feature vector that your model requires to make successful predictions. The output schema definition documents what the predict() function returns.

Important

The maximum allowed file size for the combined input and output schemas is 32000 bytes.

The schema definition for both input feature vector and model predictions are used for documentation purposes. This guideline applies to tabular datasets only.

The schema of the model input feature vector and output predictions is a JSON object. The object has a top-level list with a key called schema. The schema definition of each column is a different entry in the list.

Tip

You can use ADS to automatically extract the schema definition from a given training dataset.

For each column, the schema can be fully defined by assigning values to all these attributes:

Field or Key Type Required? Description
name STRING

Required

The name of the column.
description STRING

Optional

The description of the column.
required BOOL

Required

Whether or not the column is a required input feature to make a model prediction.
dtype STRING

Required

The data type of the column.
domain OBJECT

Optional

The range of allowed values that the feature can take.

The domain field is an dictionary containing the following keys:

Field or Key Type Required? Description Notes
domain.constraints LIST

Optional

Supports a list of predicates to constraints the range of allowed values for the feature.

You can input a language specific string expression template, which can be evaluated by the language interpreter and compiler. With Python, the string format is expected to follow STRING.

Constraints can be expressed using a list of expressions. For example, constraints=[Expression('$x > 5')].

You can apply more than one constraint.

Example of an expression:

  schema:
        - description: Id
          domain:
            constraints: []
            stats:
              25%: 365.75
              50%: 730.5
              75%: 1095.25
              count: 1460.0
              max: 1460.0
              mean: 730.5
              min: 1.0
              std: 421.6100093688479
            values: Discreet numbers
          name: Id
          required: false
          type: int64
        - description: MSSubClass
          domain:
            constraints: []
            stats:
              25%: 20.0
              50%: 50.0
              75%: 70.0
              count: 1460.0
              max: 190.0
              mean: 56.897260273972606
              min: 20.0
              std: 42.300570993810425
            values: Discreet numbers
          name: MSSubClass
          required: false
          type: int64
        - description: MSZoning
          domain:
            constraints:
            - expression: '$x in ["RL", "RM", "C (all)", "FV", "RH"]'
              - RL
              - RM
              - C (all)
              - FV
              - RH
            stats:
              count: 1460
              unique: 5
            values: Category
          name: MSZoning
          required: false
          type: category
domain.stats OBJECT

Optional

A dictionary of summary statistics describing the feature.

For float64 and int64 types:

  • X% (where X is a percentile value between 1-99. More than one percentile value can be captured)

  • count

  • max

  • mean

  • median

  • min

  • std

For category:

  • count

  • unique

  • mode

In ADS, the statistics are automatically generated based on the feature_stat in feature types.

domain.values STRING

Optional

Represent the semantic type of the column. Supported values include:

  • discrete numbers

  • numbers

  • Category

  • free text

domain.name STRING

Optional

Name of the attribute.

domain.dtype STRING

Required

The Pandas data type of the data. For example:

int64
float
category
datettime
domain.dtype STRING

Required

The feature type of the data. For example:

Category
Integer
LatLong, 

Example of an Input Data Schema

schema:
- description: Description of the column
  domain:
    constraints:
    - expression: '($x > 10 and $x <100) or ($x < -1 and $x > -500)' # Here user can input language specific string expression template which can be evaluated by the language interpreter/compiler. In case of python the string format expected to follow string.Template recognized format.
      language: python
    stats:  # This section is flexible key value pair. The stats will depend on what user wants to save. By default, the stats will be automatically generated based on the `feature_stat` in feature types
      mean: 20
      median: 21
      min: 5
    values: numbers # The key idea is to communicate what should be the domain of values that are acceptable. Eg rational numbers, discreet numbers, list of values, etc
  name: MSZoing # Name of the attribute
  required: false # If it is a nullable column

Example of an Output Data Schema

{
"predictionschema": [
    {
    "description": "Category of SR",
    "domain": {
    "constraints": [],
    "stats": [],
    "values": "Free text"
    },
    "name": "category",
    "required": true,
    "type": "category"
    }
    ]
}

Model Introspection Testing

  1. Copy the artifact_introspection_test in your model artifact into the top-level directory of the artifact.
  2. Install a Python version greater than 3.5.
  3. Install the pyyaml and requests Python libraries. This installation is only required once.
  4. Go to your artifact directory and install the artifact introspection tests.
    python3 -m pip install --user -r artifact_introspection_test/requirements.txt
  5. Set the artifact path and run the introspection test.
    python3 artifact_introspection_test/model_artifact_validate.py --artifact 

    The introspection tests generate local test_json_output.json and test_json_output.html files. This is an example of the introspection test results in JSON format:

    {
        "score_py": {
            "category": "Mandatory Files Check",
            "description": "Check that the file \"score.py\" exists and is in the top level directory of the artifact directory",
            "error_msg": "File score.py is not present.",
            "success": true
        },
        "runtime_yaml": {
            "category": "Mandatory Files Check",
            "description": "Check that the file \"runtime.yaml\" exists and is in the top level directory of the artifact directory",
            "error_msg": "File runtime.yaml is not present.",
            "success": true
        },
        "score_syntax": {
            "category": "score.py",
            "description": "Check for Python syntax errors",
            "error_msg": "Syntax error in score.py: ",
            "success": true
        },
        "score_load_model": {
            "category": "score.py",
            "description": "Check that load_model() is defined",
            "error_msg": "Function load_model is not present in score.py.",
            "success": true
        },
        "score_predict": {
            "category": "score.py",
            "description": "Check that predict() is defined",
            "error_msg": "Function predict is not present in score.py.",
            "success": true
        },
        "score_predict_data": {
            "category": "score.py",
            "description": "Check that the only required argument for predict() is named \"data\"",
            "error_msg": "Function predict in score.py should have argument named \"data\".",
            "success": true
        },
        "score_predict_arg": {
            "category": "score.py",
            "description": "Check that all other arguments in predict() are optional and have default values",
            "error_msg": "All other arguments in predict function in score.py should have default values.",
            "success": true
        },
        "runtime_version": {
            "category": "runtime.yaml",
            "description": "Check that field MODEL_ARTIFACT_VERSION is set to 3.0",
            "error_msg": "In runtime.yaml field MODEL_ARTIFACT_VERSION should be set to 3.0",
            "success": true
        },
        "runtime_env_type": {
            "category": "conda_env",
            "description": "Check that field MODEL_DEPLOYMENT.INFERENCE_ENV_TYPE is set to a value in (published, data_science)",
            "error_msg": "In runtime.yaml field MODEL_DEPLOYMENT.INFERENCE_ENV_TYPE should be set to a value in (published, data_science)",
            "success": true,
            "value": "published"
        },
        "runtime_env_slug": {
            "category": "conda_env",
            "description": "Check that field MODEL_DEPLOYMENT.INFERENCE_ENV_slug is set",
            "error_msg": "In runtime.yaml field MODEL_DEPLOYMENT.INFERENCE_ENV_slug should be set.",
            "success": true,
            "value": "mlgpuv1"
        },
        "runtime_env_path": {
            "category": "conda_env",
            "description": "Check that field MODEL_DEPLOYMENT.INFERENCE_ENV_PATH is set",
            "error_msg": "In runtime.yaml field MODEL_DEPLOYMENT.INFERENCE_ENV_PATH should be set.",
            "success": true,
            "value": "oci://service_conda_packs@ociodscdev/service_pack/gpu/General Machine Learning for GPUs/1.0/mlgpuv1"
        },
        "runtime_path_exist": {
            "category": "conda_env",
            "description": "If MODEL_DEPLOYMENT.INFERENCE_ENV_TYPE is data_science and MODEL_DEPLOYMENT.INFERENCE_ENV_slug is set, check that the file path in MODEL_DEPLOYMENT.INFERENCE_ENV_PATH is correct.",
            "error_msg": "In runtime.yaml field MODEL_DEPLOYMENT.INFERENCE_ENV_PATH doesn't exist.",
            "success": true
        },
        "runtime_slug_exist": {
            "category": "conda_env",
            "description": "If MODEL_DEPLOYMENT.INFERENCE_ENV_TYPE is data_science, check that the slug listed in MODEL_DEPLOYMENT.INFERENCE_ENV_slug exists.",
            "error_msg": "In runtime.yaml the value of the fileld INFERENCE_ENV_slug doesn't exist in the given bucket."
        }
    }
  6. Repeat steps 4 and 5 until there are no errors.

Using ADS for Introspection Testing

You can invoke the introspection manually by calling .introspect() method on the ModelArtifact object.

rf_model_artifact.introspect()
rf_model_artifact.metadata_taxonomy['ArtifactTestResults']

The result of model introspection is automatically saved to the taxonomy metadata and model artifacts. Model introspection is automatically triggered when the .prepare() method is invoked to prepare model artifact.

The .save() method doesn't perform a model introspection because this is normally done during the model artifact preparation stage. However, setting ignore_introspection to False causes model introspection to be performed during the save operation.