Low Code Setup with Config Reader

Along with the library provided APIs, ML Insight can be set up and customized by authoring and passing a JSON configuration.

This document shows how to use the Insights configuration reader to load the builder and compute the profile.

Begin by loading the required libraries and modules:

from mlm_insights.config_reader.insights_config_reader import InsightsConfigReader

Initialize the Insights builder by specifying the location of the monitor config JSON file (under config_location).

insights_builder = InsightsConfigReader(config_location="monitor_config.json").get_builder()

Run this builder object to get the run result, which yields the profile.

run_result = insights_builder.build().run()
profile = run_result.profile
print(profile.to_pandas())

How to setup configuration for low code ML Insights

Sample Config File:

ml-insights-config.json
{
  "input_schema": {
    "Age": {
      "data_type": "integer",
      "variable_type": "continuous",
      "column_type": "input"
    },
    "EnvironmentSatisfaction": {
      "data_type": "integer",
      "variable_type": "continuous",
      "column_type": "input"
    }
  },
  "engine_detail": {
    "engine_name": "native"
  },
  "reader": {
    "type": "JsonlNativeDataReader",
    "params": {
      "url_path": "oci://<bucket>/*.jsonl",
      "storage_options": {
        "config": "~/.oci/config"
      },
      "orient": "records",
      "lines": "True"
    }
  },
  "dataset_metrics": [
    {
      "type": "RowCount"
    }
  ],
  "feature_metrics": {
    "Age": [
      {
        "type": "Min"
      },
      {
        "type": "Max"
      }
    ],
    "EnvironmentSatisfaction": [
      {
        "type": "Mode"
      },
      {
        "type": "Count"
      }
    ]
  },
  "transformers": [
    {
      "type": "ConditionalFeatureTransformer",
      "params": {
        "conditional_features": [
          {
            "feature_name": "Young",
            "data_type": "integer",
            "variable_type": "ordinal",
            "expression": "df.Age < 30"
          }
        ]
      }
    }
    ],
  "post_processors": [
    {
      "type": "ObjectStorageWriterPostProcessor",
      "params" : {
        "bucket_name": "test-bucket",
        "object_name": "config.json",
        "prefix": "mlinsights"
      }
    },
    {
      "type": "LocalWriterPostProcessor",
      "params" : {
        "file_name": "config.json" ,
        "file_location": "data"
      }
    }
  ],
  "tags": {
    "tag": "value"
  },
  {
    "test_config": {
      "feature_metric_tests": [
        {
          "feature_name": "LotArea",
          "tests": [
            {
              "test_name": "TestGreaterThan",
              "metric_key": "Min",
              "threshold_value": 4.5,
              "tags": {
                "key_1": "value_1"
              }
            }
          ]
        }
      ],
      "dataset_metric_tests": [
      {
          "test_name": "TestGreaterThan",
          "metric_key": "RowCount",
          "threshold_value": 40,
          "tags": {
            "subtype": "falls-xgb"
          }
        }
    ]
    }
  }
}

Insight Components

Input Schema

This is a required component and must be defined in the config.

Input schema is the map of features and their data types, variable types, and column type.

Description

Key

Value

Example

feature_name

object of key value pair of data_type ,variable type and column_type

“Age”: { “data_type”: “integer”, “variable_type”: “continuous”, “column_type”: “input” }

  • Data Type (Required)

Data types can be provided for each feature of the input dataset which represent the type of the feature value.

Supported data_type - “integer”, “float”, “string”, “boolean”, “text”, “object”

  • Variable Type (Required)

Variable types can be provided for each feature of the input dataset which represent the type of a statistical random variable.

Supported variable_type - “continuous”, “discrete”, “nominal”, “ordinal”, “binary”, “text”, “object”

  • Column Type (Optional - Default value “input”)

Insights supports performance metrics for regression and classification models. In addition to these, Insights also supports multivariate metrics like Feature Importance. These metrics require the prediction columns or target columns (ground truth) to be in the input dataset. To make it easier to configure the metrics, Insights allows users to configure the prediction or target columns using the feature schema.

Supported column_type - “input”, “prediction”, “target”, “prediction_score”

Example
{
    "input_schema": {
        "sepal length (cm)": {
          "data_type": "float",
          "variable_type": "continuous",
          "column_type": "input"
        },
        "sepal width (cm)": {
          "data_type": "float",
          "variable_type": "continuous"
          "column_type": "input"
        }
    }
}

Reader

This is a required component and must be defined in the config.

Reader allows for the ingestion of raw data into the framework.

Description

Field Name

Description

Example

type

type of data reader to be used

“type”: “JsonlDaskDataReader”

param

reader params (required)

data_source (optional)

“params”: { “url_path”: “oci://<bucket>/*.jsonl”,

“storage_options”: { “config”: “~/.oci/config” }, “orient”: “records”, “lines”: “True” }

“data_source”: {

“type”: “LocalDatePrefixDataSource”, “params”: {

“base_location”:”mlm_demo”, “file_type”: “csv”, “date_range”: {“start”: “2023-04-04”, “end”: “2023-04-05”}

}

}

Example using data source for determining the data location
{
    "reader": {
    "type": "CSVNativeDataReader",
        "params": {
          "data_source": {
            "type": "LocalDatePrefixDataSource",
            "params": {
              "base_location":"mlm_demo",
              "file_type": "csv",
              "date_range": {"start": "2023-04-04", "end": "2023-04-05"}
            }
          }
        }
    }
}
Example without using data_source
{
    "reader": {
        "type": "JsonlNativeDataReader",
        "params": {
          "url_path": "oci://<bucket>/*.jsonl",
          "storage_options": {
            "config": "~/.oci/config"
          },
          "orient": "records",
          "lines": "True"
        }
    }
}
Supported Reader
CSVDaskDataReader
JsonlDaskDataReader
CSVNativeDataReader
JsonlNativeDataReader
NestedJsonNativeDataReader
NestedJsonDaskDataReader

We can use reader params to define the location of the files to be read or can specify a data source in the reader.

Data Source

The Data Source component is responsible for interacting with a specific data source and returning a list of locations to be read.

Supported Data Source
FileUrlDataSource
LocalDatePrefixDataSource
LocalFileDataSource
OCIDatePrefixDataSource
OCIObjectStorageDataSource

Feature Metrics

In this section, you need to add metrics that you need for each feature.

Description

Key

Value

feature_name

metric list

Supported Feature Metrics
# Data quality metrics
Count
DistinctCount
DuplicateCount
FrequencyDistribution
Max
Mean
Min
Mode
ProbabilityDistribution
Range
Skewness
StandardDeviation
Sum
IQR
Kurtosis
TopKFrequentElements
TypeMetric
Variance

# Data Integrity
IsConstantFeature
IsQuasiConstantFeature
Quartiles

# Drift Metrics
KullbackLeibler
KolmogorovSmirnov
ChiSquare
JensenShannon
PopulationStabilityIndex

Example
"feature_metric": {
    "sepal length (cm)" : [
    {"type": "Sum"},{"type": "Quartiles"}
    ],
    "sepal width (cm)": [
        {"type": "Min"},{"type": "DistinctCount"}
    ],
    "petal length (cm)": [
        {"type": "Count"},{"type": "Mean"}
    ],
    "petal width (cm)": [
        {"type": "IsQuasiConstantFeature"},{"type": "Kurtosis"}
    ]
}

Dataset Metrics

Description

List of metrics to be calculated on the data set.

Example
"data_set_metric": [
    {
    "type": "RowCount"
    }
]
Supported Data Set Metrics
# Data Quality Metrics
CramersVCorrelation
PearsonCorrelation

# Regression Metrics
RowCount
MeanAbsoluteError
MeanSquaredError
R2Score
RootMeanSquaredError
MeanSquaredLogError
MeanAbsolutePercentageError
MaxError

# Classification metrics
AccuracyScore
PrecisionScore
RecallScore
FBetaScore
FalsePositiveRate
FalseNegativeRate
Specificity
ConfusionMatrix
LogLoss
ROCCurve
ROCAreaUnderCurve
PrecisionRecallCurve
PrecisionRecallAreaUnderCurve

# Conflict Metrics
ConflictPrediction
ConflictLabel

Post Processor:

Post processor components are responsible for running any action after the entire data set is processed and all the metrics are calculated.

Description

Field Name

Description

Example

type

type of post processor

“type”: “ObjectStorageWriterPostProcessor”

param

For LocalWriterPostProcessor - file_name, file_location

For ObjectStorageWriterPostProcessor -bucket_name, object_name, prefix

“params” : { “bucket_name”: “test-bucket”, “object_name”: “config.json”, “prefix”: “mlinsights” }

Example
  "post_processors": [
  {
    "type": "ObjectStorageWriterPostProcessor",
    "params" : {
      "bucket_name": "test-bucket",
      "object_name": "config.json",
      "prefix": "mlinsights"
    }
  },
  {
    "type": "LocalWriterPostProcessor",
    "params" : {
      "file_name": "config.json" ,
      "file_location": "data"
    }
  }
]
Supported Post Processor
LocalWriterPostProcessor
ObjectStorageWriterPostProcessor

Transformer

The transformer component provides an easy way to do simple in-memory transformations on the input data.

The list of transformers to be used to add a conditional feature or transform the data before insights run.

Description

Field Name

Description

Example

type

type of transformer

“type”: “ConditionalFeatureTransformer”

param

conditional_features - List of conditional features

“params”: {
“conditional_features”: [
{

“feature_name”: “Young”, “data_type”: “integer”, “variable_type”: “ordinal”, “expression”: “df.Age < 30”

}

]

}

Conditional Features

Field Name

Value

Remarks

expression

Python expression, to be written using pandas series based functions. Only pandas series level functions are supported in a python expression and the symbol ‘df’.

the expression must return a valid output. For example: “expression”: “df.Age < 30”

feature_name

<any name that suits your feature>

data_type

The data type of the feature.

variable_type

The variable type of the feature.

Example
"transformers": [
{
  "type": "ConditionalFeatureTransformer",
  "params": {
    "conditional_features": [
      {
        "feature_name": "Young",
        "data_type": "integer",
        "variable_type": "continuous",
        "expression": "int(json_row['Age'] < 30)"
      }
    ]
  }
}
]

Engine

The underlying distributed framework used to run the computations. By default the execution engine is native and is based on native python and pandas.

Description

Field Name

Description

Example

engine_name

type of engine

“engine_name”: “native”

Example
"engine_detail": {
        "engine_name": "native"
}
Supported Engine
native
dask

Tags

Note, this is a library internal concept and should not be confused with OCI resource tags.

  • User provided key value pair

  • Applications consuming the library can provide tags to be associated with a profile. For eg: when running the baseline/evaluation run, we can store:

    <monitor_id: value>

Example
"tags": {
    "tenancy": "tenancy-xyz",
    "monitor_id": "ocid-********"
}

Tests/Test Suites

For detailed documentation, please refer to section: Test/Test Suites Component