A Data Scientist's Guide to OCI
Oracle Cloud Infrastructure (OCI) provides a family of artificial intelligence (AI) and machine learning (ML) services. This guide gives data scientists an introductory tour of those services using the ML life cycle as its framework.
The Machine Learning Life Cycle
Building a machine learning (ML) model is an iterative process. Many of the necessary steps are repeated and modified until data scientists are satisfied with the model's performance. This process requires much data exploration, visualization, and experimentation.
The OCI Data Science service supports a data scientist throughout the full ML life cycle. It rapidly builds, trains, deploys, and manages ML models. Data Science users work in a familiar JupyterLab notebook interface where they write Python code and have access to the open source libraries.
As a prerequisite to using Data Science for the ML life cycle, you need to prepare your OCI environment and workspace. Then you perform the following tasks:
You can also explore all of our ML and AI offerings and visit additional resources. |
The quickest way to configure your tenancy for data science is to use OCI Resource Manager, which handles your prerequisites with just a few clicks. See Using Resource Manager to Configure Your Tenancy for Data Science.
Before you can get started with data and modeling, you need to ensure that your OCI tenancy is properly configured, with the following resources. For a tutorial on setting up a tenancy for Data Science, see Manually Configuring a Data Science Tenancy.
- Compartments – A logical container for organizing OCI resources. Read more at Learn Best Practices for Setting Up Your Tenancy.
- User groups – A group of users, including data scientists.
- Dynamic groups – A special type of group that contains resources (such as data science notebook sessions, job runs, and model deployments) that match rules that you define. These matching rules allow group membership to change dynamically as resources that match those rules are created or deleted. These resources can make API calls to services according to policies written for the dynamic group. For example, using the resource principal of a Data Science notebook session, you could call the Object Storage API to read data from a bucket.
- Policies – Define what principals, such as users and resources, have access to in OCI. Access is granted at the group and compartment level. You can write a policy that gives a group a specific type of access within a specific compartment.
- Create a data science project in your compartment.
Projects are containers that enable data science teams to organize their work. They represent collaborative workspaces for organizing notebook sessions and models.
You can also use the
create_project
method in the Oracle Accelerated Data Science (ADS) SDK. - Create a notebook session in the project and specify the compartment.
Notebook sessions are JupyterLab interfaces where you can work in an interactive coding environment to build and train models. Environments come with preinstalled open source libraries and the ability to add others.
Notebook sessions run in fully managed infrastructure. When you create a notebook session, you can select CPUs or GPUs, the compute shape, and the amount of storage without any manual provisioning. Every time you reactivate a notebook session, you have the opportunity to modify these options. You can also let the service manage networking for your notebook session.
You can also use the
create_notebook_session
method in the ADS SDK. - In the notebook session, install or create a Conda environment. Conda is an open source environment and package management system that you can use to quickly install, run, and update packages and their dependencies. You can isolate different software configurations, switch environments, and publish environments to make your research reproducible.Tip
The fastest way to get started in a notebook session is to choose an existing Data Science Conda environment. The OCI Data Science team manages these environments. Environments are focused on providing specific tools and a framework to do ML work or providing a comprehensive environment to solve business use cases. Each Data Science environment comes with its own set of notebook examples, which help you get started with the libraries installed in the environment. - After installing a Conda environment in your notebook session, access your data and start the ML lifecycle.
All ML models start with data. Data scientists who use OCI Data Science can access and use data sources in any cloud or on-premises environment, which allows for more data features and better models. See the complete list of the data sources and formats supported by the ADS SDK.
When you use the Data Science service, we recommend storing data in the notebook session for quick access. From your notebook session, you can access data from the following sources:
- Object Storage: To retrieve data, you must first set up a connection to Object Storage. After this setup, you can use the OCI Python SDK in a notebook session to retrieve the data. You can also use the ADS SDK to pull data from Object Storage.
- Local storage: To load a dataframe from a local source using the ADS SDK, use functions from pandas directly.
- HTTP and HTTPS endpoints: To load a dataframe from a remote web server source, use pandas directly.
- Databases: You can connect to Autonomous Data Warehouse from a notebook session. The
autonomous_database.ipynb
example notebook interactively illustrates this type of connection. - Streaming data sources: The
kafka-python
client library is available in notebook sessions. The Python client library for the Apache Kafka distributed stream processing system lets data scientists connect to the Streaming service using its Kafka-compatible API. We provide thestreaming.ipynb
notebook example in the notebook session environment. - Reference libraries: To open a dataset from reference libraries, use
DatasetBrowser
. To see supported libraries, useDatasetBrowser.list()
.
Security
You can use the OCI
Vault service to centrally manage the encryption keys that protect your data and the credentials that you use to securely access resources. You can use the vault.ipynb
example notebook to learn how to use vaults with Data Science.
For more information, see the ADS SDK's Vault documentation.
Data can be prepared, transformed, and manipulated with the built-in functions of the ADS SDK. Underlying an ADSDataset
object is a Pandas data frame. Any operation that can be performed on a pandas data frame can also be applied to an ADS dataset.
All ADS datasets are immutable and any transforms that are applied result in a new dataset.
Prepare
Your data might be incomplete, inconsistent, or contain errors. You can use the ADS SDK to perform the following tasks:
- Combine and clean data by using row and column operations.
- Impute data by finding and replacing null values.
- Encode categories.
You can use feature types to separate how data is represented physically from what the data measures. You can create and assign many feature types to data. Read a blog post that explains how feature types improve your workflow.
Transform
You can the following methods in the ADS SDK to automatically transform a dataset:
suggest_recommendations()
displays issues and recommends changes and code to fix the issuesauto_transform()
returns a transformed dataset with all recommendations and optimizations applied automaticallyvisualize_transforms()
visualizes the transformation that has been performed on a dataset
OCI Data Science also supports open source data manipulation tools such as pandas, Dask, and NumPy.
After all data transformations are complete, you can split the data into a train and test set or a train, test, and validation set.
Visualize and Explore
Visualization is one of the initial steps used to derive value from data. It lets analysts to efficiently gain insights from the data and guides exploratory data analysis. The ADS SDK includes a smart visualization tool that automatically detects the data type and renders plots that optimally represent the characteristics of the data. Following are some automatic visualization methods:
show_in_notebook()
provides a comprehensive preview of a dataset's basic informationshow_corr()
includes the correlation ratio, the Pearson method, and the Cramer V methodplot()
uses automatic plotting to explore the relationship between two columnsfeature_plot()
uses custom plotting and visualizations using feature types
You can also use the ADS SDK call()
method to plot data using your preferred libraries and packages, such as seaborn, Matplotlib, Plotly, Bokeh, and Geographic Information System (GIS). See the ADS SDK examples.
Modeling builds the best mathematical representation of the relationship among data points. Models are artifacts created by the training process, which captures this relationship or pattern.
After you training the model, you evaluate it and then deploy it.
You can train a model either by using Automated Machine Learning (AutoML) or from an open source library. You can train using the following methods:
- Notebooks: Write and run Python code by using libraries in the JupyterLab interface
- Conda environments: Use the ADS SDK, AutoML, or Machine Learning Explainability (MLX) to train
- Jobs: Run ML or data science tasks outside of your notebook sessions in JupyterLab
AutoML
Building a successful ML model requires many iterations and experimentation, and a model is rarely achieved using an optimal set of hyperparameters in the first iteration. AutoML automates four steps in the ML modeling process:
- Algorithm selection identifies the best algorithms for the data and problem and is faster than an exhaustive search.
- Adaptive sampling identifies the right sample size and adjusts for unbalanced data.
- Feature selection identifies the optimal feature subset and reduces the number of features.
- Model tuning automatically tunes hyperparameters for the best model accuracy.
For more information, see the ADS SDK AutoML pipeline documentation.
After training a model, you can see how it performs against a series of benchmarks. Use evaluation functions to convert the output of your test data into an interpretable, standardized series of scores and charts.
Automated Evaluation Using the ADS SDK
Automated evaluation generates a comprehensive suite of metrics and visualizations to measure model performance against new data and to compare model candidates. ADS offers a collection of tools, metrics, and charts concerned with the contradistinction of several models. The evaluators are as follow:
- Binary classifier is used for models in which the output is binary, for example, Yes or No, Up or Down, 1 or 0. These models are a special case of multiclass classification, so they have specifically catered metrics.
- Multiclass classifier is used for models in which the output is discrete. These models have a specialized set of charts and metrics for their evaluation.
- Regression is used for models in which the output is continuous, for example, price, height, sales, or length. These models have their own specific metrics that help to benchmark the model.
Validation, Explanations, and Interpretation
Machine learning explainability (MLX) is the process of explaining and interpreting ML and deep learning models. Explainability is the ability to explain the reasons behind a model's prediction. Interpretability is the level at which a human can understand that explanation. MLX can help you to perform the following tasks:
- Better understand and interpret the model's behavior
- Debug and improve the quality of the model
- Increase trust in the model and confidence in deploying the model
Read more about model explainability to familiarize yourself with global explainers, local explainers, and WhatIf explainers.
After the model training and evaluation processes are complete, the best candidate models are saved so they can be deployed. Read about model deployments and their key components.
The ADS SDK has a set of classes that push a model to production in a few steps. For more information, see Model Serialization.
Introduction to the Model Catalog
Before you can deploy a model, you need to save the model in the model catalog. The model catalog is a centralized and managed repository of model artifacts. Models stored in the catalog can be shared across members of a team and they can be loaded back into a notebook session. A model artifact is an archive file that contains the following files and data:
score.py
: A Python script that contains your custom logic for loading serialized model objects to memory, and defines an inference endpoint(predict())
runtime.yaml
: The runtime environment of the model, which provides the necessary Conda environment reference for model deployment purposes- Any additional files that are necessary to run your model in your artifactImportant
Any code used for inference must be archived at the same level asscore.py
or at a lower level. If any required files are present in folder levels higher than thescore.py
file, they are ignored, which could result in deployment failure. - Metadata about the provenance of the model, including any Git-related information
- The script or notebook used to push the model to the catalog
Various model catalog examples and templates, including the
score.py
files, are provided in the GitHub repo.Prepare Model Metadata and Documentation
Model metadata is optional but recommended. See Preparing Model Metadata and Working with Metadata. Metadata includes the following information:
- Model input and output schemas: A description of the features that are necessary to make a successful model prediction
- Provenance: Documentation that helps you improve the model's reproducibility and auditability
- Taxonomy: A description of the model that you 're saving to the model catalog
- Model introspection tests: A series of tests and checks run on a model artifact to test all aspects of the operational health of the model
The ADS SDK automatically populates the provenance and taxonomy when you save a model with ADS.
Save the Model to the Catalog
You can save a model to the catalog by using the ADS SDK, the OCI Python SDK, or the Console. For details, see Saving Models to the Model Catalog.
Model artifacts stored in the catalog are immutable by design to prevent unwanted changes and ensure that any model in production can be tracked to the exact artifact used. You can't change a saved model.
Deploy the Model
The most common way to deploy models to production is as HTTP endpoints to serve predictions in real time. The Data Science service manages model deployments as resources and handles all infrastructure operations, including compute provisioning and load balancing. You can deploy a model by using the ADS SDK or the Console.
You can also deploy models as a function. Functions are highly scalable, on-demand, serverless architectures in OCI. For details, see this blog post.
Invoke the Model
After a model is deployed and active, its endpoint can successfully receive requests made by clients. Invoking a model deployment means that you can pass feature vectors or data samples to the predict endpoint. The model then returns predictions for those data samples. For more information, see Invoking a Model Deployment and then read about editing, deactivating, and otherwise managing a deployed model.
MLOps is the standardization, streamlining, and automation of ML life cycle management. ML assets are treated like other software assets within an iterative, continuous-integration (CI), continuous-deployment (CD) environment.
In DevOps, CI refers to the validation and integration of updated code into a central repository, and CD refers to the redeployment of those changes into production. In MLOps, CI refers to the validation and integration of new data and ML models, and CD refers to releasing that model into production.
Continuous training is unique to MLOps and refers to the automatic retraining of ML models for redeployment. If the model isn't updated, its predictions become increasingly less accurate, but you can use automation to retrain the model on new data as quickly as possible.
Jobs
Data Science jobs enable you to define and run a repeatable ML task on a fully managed infrastructure. Using jobs, you can perform the following tasks:
- Run ML or data science tasks outside of a notebook session
- Operationalize discrete data science and ML tasks, such as reusable runnable operations
- Automate MLOps or the CI/CD pipeline
- Run batch jobs or workloads triggered by events or actions
- Perform batch, mini batch, or distributed batch job inference
- In a JupyterLab notebook session, create long-running tasks or computation-intensive tasks in a Data Science job to keep the notebook free for you to continue your work
Monitoring
Monitoring and logging are the last steps in the job life cycle. They provide insights into a job's performance and metrics, in addition to a record that you can refer to later for each job run. For more information about monitoring, alarms, and metrics, see About notebook session metrics.
- Monitoring consists of metrics and alarms, and it enables you to check the health, capacity, and performance of cloud resources. You can then use this data to determine when to create more instances to manage increased load, troubleshoot issues with an instance, or better understand system behavior.
- Alarms get triggered when a metric breaches set thresholds.
- Metrics track CPU or GPU utilization, the percentage of available job run container memory usage, container network traffic, and container disk utilization. When these numbers reach a certain threshold, you can scale up resources, such as block storage and compute shape, to accommodate the workload.
- Events let you subscribe to changes in resources, like job and job run events and respond to them by using functions, notifications, or streams. See Creating Automation Using Events.
Logging
You can use service logs or custom logs with job runs. A job run emits service logs to the OCI Logging service. With custom logs, you can specify which log events are collected in a particular context and the location where the logs are stored. You can use the Logging service to enable, manage, and browse job run logs for your jobs. For complete information, see Logging and About Logs.
Integrating jobs resources with the Logging service is optional, but recommended, both for debugging potential issues and for monitoring the progress of running your job artifacts.
Full List of ML and AI Services
Although this guide focuses on the OCI Data Science service, you can use other ML and AI services with Data Science as a way to consume the services or as part of broader ML projects.
OCI machine learning services are used primarily by data scientists to build, train, deploy, and manage machine learning models. Data Science provides curated environments so data scientists can access the open source tools they need to solve business problems faster.
- Data Science makes it possible to build, train, and manage ML models using open source Python, with added capabilities for automated ML (AutoML), model evaluation, and model explanation.
- Data Labeling provides labeled datasets to more accurately train AI and ML models. Users can assemble data, create and browse datasets, and apply labels to data records through user interfaces and public APIs. The labeled datasets can be exported and used for model development. When you build ML models that work on images, text, or speech, you need labeled data that can be used to train the models.
- Data Flow provides a scalable environment for developers and data scientists to run Apache Spark applications in batch execution, at scale. You can run applications written in any Spark language to perform various data preparation tasks.
- Machine Learning in Oracle Database supports data exploration and preparation as well as building and deploying ML models using SQL, R, Python, REST, AutoML, and no-code interfaces. It includes more than 30 in-database algorithms that produce models in Oracle Database for immediate use in applications. Build models quickly by simplifying and automating key elements of the ML process.
OCI's AI services contains prebuilt ML models for specific uses. Some of the AI services are pretrained, and some you can train with your own data. To use them, you simply call the API for the service and pass in data to be processed; the service returns a result. There's no infrastructure to manage.
- Digital Assistant offers prebuilt skills and templates to create conversational experiences for business applications and customers through text, chat, and voice interfaces.
- Language makes it possible to perform sophisticated text analysis at scale. The Language service includes pretrained models for sentiment analysis, key phrase extraction, text classification, named entity recognition, and more.
- Speech uses automatic speech recognition (ASR) to convert speech to text. Built on the same AI models used for Digital Assistant, developers can use time-tested acoustic and language models to provide highly accurate transcription for audio or video files across many languages.
- Vision applies computer vision to analyze image-based content. Developers can easily integrate pretrained models into their applications with APIs or custom-train models to meet their specific use cases. These models can be used to detect visual anomalies in manufacturing, extract text from documents to automate business workflows, and tag items in images to count products or shipments.
- Anomaly Detection enables developers to more easily build business-specific anomaly detection models that flag critical incidents, resulting in faster time to detection and resolution. Specialized APIs and automated model selection simplify training and deploying anomaly detection models to applications and operations.