Low Code Setup with Config Reader

Along with the library provided APIs, ML Insight can be set up and customized by authoring and passing a JSON configuration.

This document shows how to use the Insights configuration reader to load the builder and compute the profile.

Begin by loading the required libraries and modules:

from mlm_insights.config_reader.insights_config_reader import InsightsConfigReader

Initialize the Insights builder by specifying the location of the monitor config JSON file (under config_location).

insights_builder = InsightsConfigReader(config_location="monitor_config.json").get_builder()

Run this builder object to get the run result, which yields the profile.

run_result = insights_builder.build().run()
profile = run_result.profile
print(profile.to_pandas())

How to setup configuration for low code ML Insights

Sample Config File:

ml-insights-config.json

{
  "input_schema": {
    "Age": {
      "data_type": "integer",
      "variable_type": "continuous",
      "column_type": "input"
    },
    "EnvironmentSatisfaction": {
      "data_type": "integer",
      "variable_type": "continuous",
      "column_type": "input"
    }
  },
  "engine_detail": {
    "engine_name": "native"
  },
  "reader": {
    "type": "JsonlNativeDataReader",
    "params": {
      "url_path": "oci://<bucket>/*.jsonl",
      "storage_options": {
        "config": "~/.oci/config"
      },
      "orient": "records",
      "lines": "True"
    }
  },
  "dataset_metrics": [
    {
      "type": "RowCount"
    }
  ],
  "feature_metrics": {
    "Age": [
      {
        "type": "Min"
      },
      {
        "type": "Max"
      }
    ],
    "EnvironmentSatisfaction": [
      {
        "type": "Mode"
      },
      {
        "type": "Count"
      }
    ]
  },
  "transformers": [
    {
      "type": "ConditionalFeatureTransformer",
      "params": {
        "conditional_features": [
          {
            "feature_name": "Young",
            "data_type": "integer",
            "variable_type": "ordinal",
            "expression": "df.Age < 30"
          }
        ]
      }
    }
    ],
  "post_processors": [
    {
      "type": "ObjectStorageWriterPostProcessor",
      "params" : {
        "bucket_name": "test-bucket",
        "object_name": "config.json",
        "prefix": "mlinsights"
      }
    },
    {
      "type": "LocalWriterPostProcessor",
      "params" : {
        "file_name": "config.json" ,
        "file_location": "data"
      }
    }
  ],
  "tags": {
    "tag": "value"
  },
  {
    "test_config": {
      "feature_metric_tests": [
        {
          "feature_name": "LotArea",
          "tests": [
            {
              "test_name": "TestGreaterThan",
              "metric_key": "Min",
              "threshold_value": 4.5,
              "tags": {
                "key_1": "value_1"
              }
            }
          ]
        }
      ],
      "dataset_metric_tests": [
      {
          "test_name": "TestGreaterThan",
          "metric_key": "RowCount",
          "threshold_value": 40,
          "tags": {
            "subtype": "falls-xgb"
          }
        }
    ]
    }
  }
}

Insight Components

Input Schema

This is a required component and must be defined in the config.

Input schema is the map of features and their data types, variable types, and column type.

Description

Key	Value	Example
feature_name	object of key value pair of data_type ,variable type and column_type	“Age”: { “data_type”: “integer”, “variable_type”: “continuous”, “column_type”: “input” }

Data Type (Required)

Data types can be provided for each feature of the input dataset which represent the type of the feature value.

Supported data_type - “integer”, “float”, “string”, “boolean”, “text”, “object”

Variable Type (Required)

Variable types can be provided for each feature of the input dataset which represent the type of a statistical random variable.

Supported variable_type - “continuous”, “discrete”, “nominal”, “ordinal”, “binary”, “text”, “object”

Column Type (Optional - Default value “input”)

Insights supports performance metrics for regression and classification models. In addition to these, Insights also supports multivariate metrics like Feature Importance. These metrics require the prediction columns or target columns (ground truth) to be in the input dataset. To make it easier to configure the metrics, Insights allows users to configure the prediction or target columns using the feature schema.

Supported column_type - “input”, “prediction”, “target”, “prediction_score”

Example

{
    "input_schema": {
        "sepal length (cm)": {
          "data_type": "float",
          "variable_type": "continuous",
          "column_type": "input"
        },
        "sepal width (cm)": {
          "data_type": "float",
          "variable_type": "continuous"
          "column_type": "input"
        }
    }
}

Reader

This is a required component and must be defined in the config.

Reader allows for the ingestion of raw data into the framework.

Description

Field Name

Description

Example

type

type of data reader to be used

“type”: “JsonlDaskDataReader”

param

reader params (required)

data_source (optional)

“params”: { “url_path”: “oci://<bucket>/*.jsonl”,

“storage_options”: { “config”: “~/.oci/config” }, “orient”: “records”, “lines”: “True” }

“data_source”: {

“type”: “LocalDatePrefixDataSource”, “params”: {

“base_location”:”mlm_demo”, “file_type”: “csv”, “date_range”: {“start”: “2023-04-04”, “end”: “2023-04-05”}

}

Example using data source for determining the data location

{
    "reader": {
    "type": "CSVNativeDataReader",
        "params": {
          "data_source": {
            "type": "LocalDatePrefixDataSource",
            "params": {
              "base_location":"mlm_demo",
              "file_type": "csv",
              "date_range": {"start": "2023-04-04", "end": "2023-04-05"}
            }
          }
        }
    }
}

Example without using data_source

{
    "reader": {
        "type": "JsonlNativeDataReader",
        "params": {
          "url_path": "oci://<bucket>/*.jsonl",
          "storage_options": {
            "config": "~/.oci/config"
          },
          "orient": "records",
          "lines": "True"
        }
    }
}

Supported Reader

CSVDaskDataReader
JsonlDaskDataReader
CSVNativeDataReader
JsonlNativeDataReader
NestedJsonNativeDataReader
NestedJsonDaskDataReader

We can use reader params to define the location of the files to be read or can specify a data source in the reader.

Data Source

The Data Source component is responsible for interacting with a specific data source and returning a list of locations to be read.

Supported Data Source

FileUrlDataSource
LocalDatePrefixDataSource
LocalFileDataSource
OCIDatePrefixDataSource
OCIObjectStorageDataSource

Feature Metrics

In this section, you need to add metrics that you need for each feature.

Description

Key	Value
feature_name	metric list

Supported Feature Metrics

# Data quality metrics
Count
DistinctCount
DuplicateCount
FrequencyDistribution
Max
Mean
Min
Mode
ProbabilityDistribution
Range
Skewness
StandardDeviation
Sum
IQR
Kurtosis
TopKFrequentElements
TypeMetric
Variance

# Data Integrity
IsConstantFeature
IsQuasiConstantFeature
Quartiles

# Drift Metrics
KullbackLeibler
KolmogorovSmirnov
ChiSquare
JensenShannon
PopulationStabilityIndex

Example

"feature_metric": {
    "sepal length (cm)" : [
    {"type": "Sum"},{"type": "Quartiles"}
    ],
    "sepal width (cm)": [
        {"type": "Min"},{"type": "DistinctCount"}
    ],
    "petal length (cm)": [
        {"type": "Count"},{"type": "Mean"}
    ],
    "petal width (cm)": [
        {"type": "IsQuasiConstantFeature"},{"type": "Kurtosis"}
    ]
}

Dataset Metrics

Description

List of metrics to be calculated on the data set.

Example

"data_set_metric": [
    {
    "type": "RowCount"
    }
]

Supported Data Set Metrics

# Data Quality Metrics
CramersVCorrelation
PearsonCorrelation

# Regression Metrics
RowCount
MeanAbsoluteError
MeanSquaredError
R2Score
RootMeanSquaredError
MeanSquaredLogError
MeanAbsolutePercentageError
MaxError

# Classification metrics
AccuracyScore
PrecisionScore
RecallScore
FBetaScore
FalsePositiveRate
FalseNegativeRate
Specificity
ConfusionMatrix
LogLoss
ROCCurve
ROCAreaUnderCurve
PrecisionRecallCurve
PrecisionRecallAreaUnderCurve

# Conflict Metrics
ConflictPrediction
ConflictLabel

Post Processor:

Post processor components are responsible for running any action after the entire data set is processed and all the metrics are calculated.

Description

Field Name

Description

Example

type

type of post processor

“type”: “ObjectStorageWriterPostProcessor”

param

For LocalWriterPostProcessor - file_name, file_location

For ObjectStorageWriterPostProcessor -bucket_name, object_name, prefix

“params” : { “bucket_name”: “test-bucket”, “object_name”: “config.json”, “prefix”: “mlinsights” }

Example

  "post_processors": [
  {
    "type": "ObjectStorageWriterPostProcessor",
    "params" : {
      "bucket_name": "test-bucket",
      "object_name": "config.json",
      "prefix": "mlinsights"
    }
  },
  {
    "type": "LocalWriterPostProcessor",
    "params" : {
      "file_name": "config.json" ,
      "file_location": "data"
    }
  }
]

Supported Post Processor

LocalWriterPostProcessor
ObjectStorageWriterPostProcessor

Transformer

The transformer component provides an easy way to do simple in-memory transformations on the input data.

The list of transformers to be used to add a conditional feature or transform the data before insights run.

Description

Field Name

Description

Example

type

type of transformer

“type”: “ConditionalFeatureTransformer”

param

conditional_features - List of conditional features

“params”: {

“conditional_features”: [

{: “feature_name”: “Young”, “data_type”: “integer”, “variable_type”: “ordinal”, “expression”: “df.Age < 30”

}

]

}

Conditional Features

Field Name	Value	Remarks
expression	Python expression, to be written using pandas series based functions. Only pandas series level functions are supported in a python expression and the symbol ‘df’.	the expression must return a valid output. For example: “expression”: “df.Age < 30”
feature_name	<any name that suits your feature>
data_type	The data type of the feature.
variable_type	The variable type of the feature.

Example

"transformers": [
{
  "type": "ConditionalFeatureTransformer",
  "params": {
    "conditional_features": [
      {
        "feature_name": "Young",
        "data_type": "integer",
        "variable_type": "continuous",
        "expression": "int(json_row['Age'] < 30)"
      }
    ]
  }
}
]

Engine

The underlying distributed framework used to run the computations. By default the execution engine is native and is based on native python and pandas.

Description

Field Name	Description	Example
engine_name	type of engine	“engine_name”: “native”

Example

"engine_detail": {
        "engine_name": "native"
}

Supported Engine

native
dask

Tags

Note, this is a library internal concept and should not be confused with OCI resource tags.

User provided key value pair
Applications consuming the library can provide tags to be associated with a profile. For eg: when running the baseline/evaluation run, we can store:
<monitor_id: value>

Example

"tags": {
    "tenancy": "tenancy-xyz",
    "monitor_id": "ocid-********"
}

Tests/Test Suites

For detailed documentation, please refer to section: Test/Test Suites Component