A Data Scientist's Guide to OCI

Oracle Cloud Infrastructure (OCI) provides a family of artificial intelligence (AI) and machine learning (ML) services. This guide gives data scientists an introductory tour of those services using the ML life cycle as its framework.

The Machine Learning Life Cycle

Building a machine learning (ML) model is an iterative process. Many of the necessary steps are repeated and modified until data scientists are satisfied with the model's performance. This process requires much data exploration, visualization, and experimentation.

The OCI Data Science service supports a data scientist throughout the full ML life cycle. It rapidly builds, trains, deploys, and manages ML models. Data Science users work in a familiar JupyterLab notebook interface where they write Python code and have access to the open source libraries.

Tip

Watch a brief overview video of the Data Science service.

As a prerequisite to using Data Science for the ML life cycle, you need to prepare your OCI environment and workspace. Then you perform the following tasks:

You can also explore all of our ML and AI offerings and visit additional resources.

Data can be prepared, transformed, and manipulated with the built-in functions of the ADS SDK. Underlying an ADSDataset object is a Pandas data frame. Any operation that can be performed on a pandas data frame can also be applied to an ADS dataset.

Note

All ADS datasets are immutable and any transforms that are applied result in a new dataset.

Prepare

Your data might be incomplete, inconsistent, or contain errors. You can use the ADS SDK to perform the following tasks:

Combine and clean data by using row and column operations.
Impute data by finding and replacing null values.
Encode categories.

Note

You can use feature types to separate how data is represented physically from what the data measures. You can create and assign many feature types to data. Read a blog post that explains how feature types improve your workflow.

Transform

You can the following methods in the ADS SDK to automatically transform a dataset:

suggest_recommendations() displays issues and recommends changes and code to fix the issues
auto_transform() returns a transformed dataset with all recommendations and optimizations applied automatically
visualize_transforms() visualizes the transformation that has been performed on a dataset

OCI Data Science also supports open source data manipulation tools such as pandas, Dask, and NumPy.

Tip

After all data transformations are complete, you can split the data into a train and test set or a train, test, and validation set.

Visualize and Explore

Visualization is one of the initial steps used to derive value from data. It lets analysts to efficiently gain insights from the data and guides exploratory data analysis. The ADS SDK includes a smart visualization tool that automatically detects the data type and renders plots that optimally represent the characteristics of the data. Following are some automatic visualization methods:

show_in_notebook() provides a comprehensive preview of a dataset's basic information
show_corr() includes the correlation ratio, the Pearson method, and the Cramer V method
Tip

Read a blog post that shows how to use these techniques to discover relationships in your data.
plot() uses automatic plotting to explore the relationship between two columns
feature_plot() uses custom plotting and visualizations using feature types

You can also use the ADS SDK call() method to plot data using your preferred libraries and packages, such as seaborn, Matplotlib, Plotly, Bokeh, and Geographic Information System (GIS). See the ADS SDK examples.

After the model training and evaluation processes are complete, the best candidate models are saved so they can be deployed. Read about model deployments and their key components.

Tip

The ADS SDK has a set of classes that push a model to production in a few steps. For more information, see Model Serialization.

Introduction to the Model Catalog

Before you can deploy a model, you need to save the model in the model catalog. The model catalog is a centralized and managed repository of model artifacts. Models stored in the catalog can be shared across members of a team and they can be loaded back into a notebook session. A model artifact is an archive file that contains the following files and data:

score.py: A Python script that contains your custom logic for loading serialized model objects to memory, and defines an inference endpoint (predict())
runtime.yaml: The runtime environment of the model, which provides the necessary Conda environment reference for model deployment purposes
Any additional files that are necessary to run your model in your artifact
Important

Any code used for inference must be archived at the same level as score.py or at a lower level. If any required files are present in folder levels higher than the score.py file, they are ignored, which could result in deployment failure.
Metadata about the provenance of the model, including any Git-related information
The script or notebook used to push the model to the catalog

Tip

Various model catalog examples and templates, including the score.py files, are provided in the GitHub repo.

Prepare Model Metadata and Documentation

Model metadata is optional but recommended. See Preparing Model Metadata and Working with Metadata. Metadata includes the following information:

Model input and output schemas: A description of the features that are necessary to make a successful model prediction
Provenance: Documentation that helps you improve the model's reproducibility and auditability
Taxonomy: A description of the model that you 're saving to the model catalog
Model introspection tests: A series of tests and checks run on a model artifact to test all aspects of the operational health of the model

Tip

The ADS SDK automatically populates the provenance and taxonomy when you save a model with ADS.

Save the Model to the Catalog

You can save a model to the catalog by using the ADS SDK, the OCI Python SDK, or the Console. For details, see Saving Models to the Model Catalog.

Note

Model artifacts stored in the catalog are immutable by design to prevent unwanted changes and ensure that any model in production can be tracked to the exact artifact used. You can't change a saved model.

Deploy the Model

The most common way to deploy models to production is as HTTP endpoints to serve predictions in real time. The Data Science service manages model deployments as resources and handles all infrastructure operations, including compute provisioning and load balancing. You can deploy a model by using the ADS SDK or the Console.

Tip

You can also deploy models as a function. Functions are highly scalable, on-demand, serverless architectures in OCI. For details, see this blog post.

Invoke the Model

After a model is deployed and active, its endpoint can successfully receive requests made by clients. Invoking a model deployment means that you can pass feature vectors or data samples to the predict endpoint. The model then returns predictions for those data samples. For more information, see Invoking a Model Deployment and then read about editing, deactivating, and otherwise managing a deployed model.

MLOps is the standardization, streamlining, and automation of ML life cycle management. ML assets are treated like other software assets within an iterative, continuous-integration (CI), continuous-deployment (CD) environment.

In DevOps, CI refers to the validation and integration of updated code into a central repository, and CD refers to the redeployment of those changes into production. In MLOps, CI refers to the validation and integration of new data and ML models, and CD refers to releasing that model into production.

Continuous training is unique to MLOps and refers to the automatic retraining of ML models for redeployment. If the model isn't updated, its predictions become increasingly less accurate, but you can use automation to retrain the model on new data as quickly as possible.

Jobs

Data Science jobs enable you to define and run a repeatable ML task on a fully managed infrastructure. Using jobs, you can perform the following tasks:

Run ML or data science tasks outside of a notebook session
Operationalize discrete data science and ML tasks, such as reusable runnable operations
Automate MLOps or the CI/CD pipeline
Run batch jobs or workloads triggered by events or actions
Perform batch, mini batch, or distributed batch job inference
In a JupyterLab notebook session, create long-running tasks or computation-intensive tasks in a Data Science job to keep the notebook free for you to continue your work

Note

Try out a tutorial about scheduling job runs.

Monitoring

Monitoring and logging are the last steps in the job life cycle. They provide insights into a job's performance and metrics, in addition to a record that you can refer to later for each job run. For more information about monitoring, alarms, and metrics, see Metrics.

Monitoring consists of metrics and alarms, and it enables you to check the health, capacity, and performance of cloud resources. You can then use this data to determine when to create more instances to manage increased load, troubleshoot issues with an instance, or better understand system behavior.
Alarms get triggered when a metric breaches set thresholds.
Metrics track CPU or GPU utilization, the percentage of available job run container memory usage, container network traffic, and container disk utilization. When these numbers reach a certain threshold, you can scale up resources, such as block storage and compute shape, to accommodate the workload.
Events let you subscribe to changes in resources, like job and job run events and respond to them by using functions, notifications, or streams. See Creating Automation Using Events.

Logging

You can use service logs or custom logs with job runs. A job run emits service logs to the OCI Logging service. With custom logs, you can specify which log events are collected in a particular context and the location where the logs are stored. You can use the Logging service to enable, manage, and browse job run logs for your jobs. For complete information, see Logging and About Logs.

Note

Integrating jobs resources with the Logging service is optional, but recommended, both for debugging potential issues and for monitoring the progress of running your job artifacts.

Full List of ML and AI Services

Although this guide focuses on the OCI Data Science service, you can use other ML and AI services with Data Science as a way to consume the services or as part of broader ML projects.

OCI's ML Services

OCI's AI Services

Oracle Cloud Infrastructure Documentation
Try Free Tier

A Data Scientist's Guide to OCI

The Machine Learning Life Cycle

Security

Prepare

Transform

Visualize and Explore

AutoML

Automated Evaluation Using the ADS SDK

Validation, Explanations, and Interpretation

Introduction to the Model Catalog

Prepare Model Metadata and Documentation

Save the Model to the Catalog

Deploy the Model

Invoke the Model

Jobs

Monitoring

Logging

Full List of ML and AI Services

Additional Resources

Explore

Community

Training and Certification

Oracle Cloud Infrastructure Documentation Try Free Tier

A Data Scientist's Guide to OCI

The Machine Learning Life Cycle 🔗

Security 🔗

Prepare 🔗

Transform 🔗

Visualize and Explore 🔗

AutoML 🔗

Automated Evaluation Using the ADS SDK

Validation, Explanations, and Interpretation

Introduction to the Model Catalog 🔗

Prepare Model Metadata and Documentation 🔗

Save the Model to the Catalog 🔗

Deploy the Model 🔗

Invoke the Model 🔗

Jobs 🔗

Monitoring 🔗

Logging 🔗

Full List of ML and AI Services 🔗

Additional Resources

Explore

Community

Training and Certification

Oracle Cloud Infrastructure Documentation
Try Free Tier

The Machine Learning Life Cycle

Security

Prepare

Transform

Visualize and Explore

AutoML

Introduction to the Model Catalog

Prepare Model Metadata and Documentation

Save the Model to the Catalog

Deploy the Model

Invoke the Model

Jobs

Monitoring

Logging

Full List of ML and AI Services