Getting Started

ML Insights helps user throughout their Models lifecycle, starting from conceptualisation all the way to post-production monitoring. Insights does this through a component based architecture, which makes the library easy to use, highly customisable and re-usable. All of the components are also interface-based making them easy to extend.

This section introduces the components available to the user. This section also discusses the different responsibility each of these components serve and the basics of how to use them. This section provides a general overview, and for more details please see the sub-documents.

Builder Object

At the very top layer of the framework, we have the builder object. The responsibility of this object is to accumulate different components from user, run validation logic both to check if all mandatory components are provided as well as to perform component level validation.

At a minimum, the builder expects a reader component or a pre-built dataframe object to ingest the data to be evaluated. It also expects (as of the current version) the schema of the data. Each component described here, can be passed on to the builder object. After, all the mandatory and optional components have been provided, the build API can be called. If there are no errors, the API returns the runner object.

The builder object expects only component interfaces to be passed on. Hence, any custom implementation can be passed safely to the builder as long as they implement the interface correctly.

Runner Object

The runner component represents the core of the framework which is responsible for holding the execution contracts of all its child components. While each of the child components can be passed as a interface, with any implementation logic, the order in which they run, along with when they run, or when they are initialised or destroyed, is handled by the runner component. In essence, the runner object controls the entire life cycle of each of the child components passed through builder.

The runner object also handles other responsibilities, like thread pool management and execution engine abstraction. Details are covered in the Dive Deep Section.

Schema

Schema defines the structure and metadata of the input data, which includes the data type, column type, or column mappings. As of the current version, this is a mandatory information that has to be sent to the framework. The framework, uses this information as the ground truth and any deviation in the actual data is taken as an anomaly and the framework usually will ignore such all such anomaly in data.

Note

In the upcoming versions, the schema becomes an optional parameter and users can chose to use other strategies like auto schema inferencing.

Reader Component

The first component we will be looking at is the Reader component. The reader allows for ingestion of raw data into the framework. This component is primarily responsible for understanding different formats of data (for example, jsonl, csv) and how to properly read them. At its essence, the primary responsibility of this component is that given a set of valid file locations which represents file of a specific type, reader can properly decode the content and load them in memory.

Additionally, Data Source component is an optional subcomponent, which is usually used along side the reader. It is responsible for fetching the list of files location to read from.

Data Source Component

Data Source component is responsible for interacting with a specific data source and returning a list of locations to be read. For e.g. if Insights needs to fetch data from OCI Object Storage, an ObjectStorageFileSearchDataSource is used which returns a list of objects in a specific bucket.

The end result for the component is a list of URLs.

Transformer Component

Transformers are an optional component that can be used to modify, normalize or extend the input data frame. Typically, transformers are used to do data formatting or normalization before the data frame is sent over for computation of metrics. Some examples of use cases for transformers are:

Adding a new column to the input data frame based on existing columns. This can also be used to convert unstructured data to structured one. Verify specific columns are present in the data frame. One can chain multiple transformers to operate on the input data frame and produce a final data frame. An advanced customer can write their own transformer by extending the transformer interface and writing custom logic for doing transformations on their own data.

Metric Component

Metrics are the core construct for the framework. This component is responsible for calculating all statistical metrics and algorithms. Metric components work based on the type of features (for example, input feature or output feature) available, their data type (for example, int, float, or string) as well as additional context (for example, if any previous computation is available to compare against). ML Insights provides commonly used metrics out of the box for different ML observability use cases.

Post Processor Component

Post Processor is a flexible component that can extend the framework in different ways. The most common use case of post processor is to provide additional integration points. For example, post processors can be used to write metric set output to a storage system. However, given its open ended nature, they can be used for any scenarios, like notification or alerting.

In essence, post processors are a set of actions that rely on the metric set output of the framework. There is no limitation on the kind of action these components can take. So they can call any third-party service or process metric set output. Post processor don’t have access to the raw data, so cannot manipulate the raw data in any way (including writing it to any storage).

Tests/Test Suites

Insights Tests/Test Suite feature enables comprehensive validation of customer’s machine learning models and data. Provides a comprehensive suite of test and test suites for various types of use cases such as:

  • Data Integrity

  • Data Quality

  • Model Performance (Classification, Regression)

  • Drift

  • Correlation, etc

Developers can author the tests using either ML Insights Configuration (JSON) or via Python-based APIs. Test Results can be consumed for sending alerts via OCI Monitoring allowing users to do continuous ML monitoring

Note

This feature is available on Insights versions >= 1.1.0