Application Configuration
=========================

The ML Monitoring Application can be set up and customized by authoring a JSON configuration.
Then save the configuration in an object store location and pass in the CONFIG_FILE variable of RUNTIME_PARAMETER when starting a job run.

This document demonstrates how to define the application components and create a application configuration.

Sample Config File
----------------------


.. collapse:: ml-monitoring-configuration.json

    .. code-block:: json

        {
          "monitor_id": "<monitor_id>",
          "storage_details": {
            "storage_type": "OciObjectStorage",
            "params": {
              "namespace": "<namespace>",
              "bucket_name": "<bucket_name>",
              "object_prefix": "<prefix>"
            }
          },
          "input_schema": {
            "Age": {
              "data_type": "integer",
              "variable_type": "continuous",
              "column_type": "input"
            },
            "EnvironmentSatisfaction": {
              "data_type": "integer",
              "variable_type": "continuous",
              "column_type": "input"
            }
          },
            "baseline_reader": {
              "type": "CSVDaskDataReader",
              "params": {
                "file_path": "oci://<path>"
              }
            },
            "prediction_reader": {
              "type": "CSVDaskDataReader",
              "params": {
                  "data_source": {
                      "type": "ObjectStorageFileSearchDataSource",
                      "params": {
                          "file_path": ["oci://<path>"],
                          "filter_arg": [
                            {
                              "partition_based_date_range": {
                                "start": "2023-06-26",
                                "end": "2023-06-27",
                                "data_format": ".d{4}-d{2}-d{2}."
                              }
                            }
                          ]
                      }
                  }
              },
          "dataset_metrics": [
            {
              "type": "RowCount"
            }
          ],
          "feature_metrics": {
            "Age": [
              {
                "type": "Min"
              },
              {
                "type": "Max"
              }
            ],
            "EnvironmentSatisfaction": [
              {
                "type": "Mode"
              },
              {
                "type": "Count"
              }
            ]
          },
          "transformers": [
            {
              "type": "ConditionalFeatureTransformer",
              "params": {
                "conditional_features": [
                  {
                    "feature_name": "Young",
                    "data_type": "integer",
                    "variable_type": "ordinal",
                    "expression": "df.Age < 30"
                  }
                ]
              }
            }
            ],
          "post_processors": [
          {
             "type": "SaveMetricOutputAsJsonPostProcessor",
             "params": {
               "file_name": "<file_name>",
               "test_results_file_name": "<test_result_file_name>",
               "file_location_expression": "<expression>",
               "date_range": {
                  "start": "2023-08-01",
                  "end": "2023-08-05"
               },
               "can_overwrite_profile_json": false,
               "can_overwrite_test_results_json": false,
               "namespace": "<namespace>",
               "bucket_name": "<bucket_name>"
              }
            },
            {
              "type": "OCIMonitoringApplicationPostProcessor",
              "params": {
                 "compartment_id": "<COMPARTMENT_ID>",
                 "namespace": "<NAMESPACE>",
                 "date_range": {
                    "start": "2023-08-01",
                    "end": "2023-08-05"
                 },
                 "dimensions": {
                    "key1": "value1",
                    "key2": "value2"
                 }
              }
            }
        ],
          "tags": {
            "tag": "value"
          }
        },
        "test_config": {
            "tags": {
              "key_1": "these tags are sent in test results"
            },
            "feature_metric_tests": [
              {
                "feature_name": "Age",
                "tests": [
                  {
                    "test_name": "TestGreaterThan",
                    "metric_key": "Min",
                    "threshold_value": 17
                  },
                  {
                    "test_name": "TestIsComplete"
                  }
                ]
              }
            ],
            "dataset_metric_tests": [
              {
                  "test_name": "TestGreaterThan",
                  "metric_key": "RowCount",
                  "threshold_value": 40,
                  "tags": {
                    "subtype": "falls-xgb"
                  }
                }
            ]
          }
        }
|

ML Monitoring Application Components
------------------------------------

Monitor ID
~~~~~~~~~~~~~~~~

**This is a required component and must be defined in the configuration.**

It's a user-provided id used to uniquely identify a monitor configuration.

The rules to define a monitor_id are:

* The length is a minimum of 8 characters and a maximum of 48 characters.
* Valid characters are letters (upper or lowercase), numbers, hyphens, underscores, and periods.

Description
^^^^^^^^^^^

.. list-table::
   :widths: 25 25 50
   :header-rows: 1

   * - Key
     - Value
     - Example
   * - monitor_id
     - user defined string
     -  "monitor_id": "speech_model_monitor"

Example
^^^^^^^^^^^^^

.. code-block:: json

    {"monitor_id": "speech_model_monitor"}


Storage Details
~~~~~~~~~~~~~~~~

**This is a required component and must be defined in the configuration.**

Details of the type of storage and location for retrieving the baseline profile (in case of a prediction run) and persisting the internal state of a run.

Description
^^^^^^^^^^^^^

.. list-table::
    :widths: 100 100 100
    :header-rows: 1

    * - Field Name
      - Description
      - Example
    * - storage_type
      - type of storage to be used for storing the internal state
      - "storage_type": "OciObjectStorage"
    * - param
      - params (required)
      -
        "params": {
          "namespace": "<namespace>",
          "bucket_name": "<bucket_name>",
          "object_prefix": "<prefix>"
        }

Supported Storage Details
^^^^^^^^^^^^^^^^^^^^^^^^^^

* OciObjectStorage

    Required Parameters
        * namespace - namespace of the bucket
        * bucket_name - bucket name

    Optional Parameters
        * object_prefix - the prefix for creating the directory for saving the internal state of the runs

Example
^^^^^^^^^^^^^
.. collapse:: storage_details

	.. code-block:: json

		"storage_details": {
			"storage_type": "OciObjectStorage",
			"params": {
			  "namespace": "<namespace>",
			  "bucket_name": "<bucket_name>",
			  "object_prefix": "<object_prefix>"
			}
		  }


Input Schema
~~~~~~~~~~~~~~~~

**This is a required component and must be defined in the configuration.**

The input schema is the map of features and their data types, variable types, and column type.

Description
^^^^^^^^^^^

.. list-table::
   :widths: 25 25 50
   :header-rows: 1

   * - Key
     - Value
     - Example
   * - feature_name
     - object of key value pair of data_type ,variable type, and column_type
     -  "Age": {
        "data_type": "integer",
        "variable_type": "continuous",
        "column_type": "input"
        }

* **Data Type (Required)**
    Data types can be provided for each feature of the input dataset which represent the type of the feature value.

    *Supported data_type* - "integer", "float", "string", "boolean", "text", "object"

* **Variable Type (Required)**
    Variable types can be provided for each feature of the input dataset which represent the type of a statistical random variable.

    *Supported variable_type* - "continuous", "discrete", "nominal", "ordinal", "binary", "text", "object"

* **Column Type (Optional - Default value "input")**
    Insights supports performance metrics for regression and classification models. Insights also supports multivariate metrics like Feature Importance. These metrics require the prediction columns or target columns (ground truth) to be in the input dataset. To make it easier to configure the metrics, Insights lets users configure the prediction or target columns using the feature schema.

    *Supported column_type* - "input", "prediction", "target", "prediction_score"


Example
^^^^^^^^^^^^^

.. code-block:: json

    {
        "input_schema": {
            "sepal length (cm)": {
              "data_type": "float",
              "variable_type": "continuous",
              "column_type": "input"
            },
            "sepal width (cm)": {
              "data_type": "float",
              "variable_type": "continuous"
              "column_type": "input"
            }
        }
    }


BASELINE READER
~~~~~~~~~~~~~~~~~

**If the action type is RUN_BASELINE, this is a required component and must be defined in the configuration.**

  The baseline_reader lets the ingestion of raw data into the framework for a baseline run.

Description
^^^^^^^^^^^^^

.. list-table::
    :widths: 100 100 100
    :header-rows: 1

    * - Field Name
      - Description
      - Example
    * - type
      - type of reader to be used
      - "type": "JsonlDaskDataReader"
    * - param
      - reader params

        data_source (optional)
      -
        "params": {
              "file_path": "oci://<path>.csv"
            }

        "data_source": {
            "type": "ObjectStorageFileSearchDataSource",
            "params": {
              "file_path": [
                "oci://<bucket_name>@<namespace>/<object_prefix>/dataset.csv"
              ],
              "filter_arg": [
                {
                  "partition_based_date_range": {
                    "start": "2023-06-26",
                    "end": "2023-06-27",
                    "data_format": ".d{4}-d{2}-d{2}."
                  }
                }
              ]
            }
          }

Example using data source for determining the data location
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: json

        "baseline_reader": {
            "type": "CSVDaskDataReader",
                "params": {
                  "data_source": {
                    "type": "ObjectStorageFileSearchDataSource",
                    "params": {
                      "file_path": [
                        "oci://<bucket_name>@<namespace>/<object_prefix>/dataset.csv"
                      ],
                      "filter_arg": [
                        {
                          "partition_based_date_range": {
                            "start": "2023-06-26",
                            "end": "2023-06-27",
                            "data_format": ".d{4}-d{2}-d{2}."
                          }
                        }
                      ]
                    }
                  }
            }
        }

Example without using data_source
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: json

    {
        "baseline_reader": {
            "type": "CSVDaskDataReader",
            "params": {
              "file_path": "oci://<path>.csv"
            }
          }
    }


**Supported Reader**

.. collapse:: Supported Readers

    .. code-block:: python

        CSVDaskDataReader
        JsonlDaskDataReader
        NestedJsonDaskDataReader


Use reader parameters to define the location of the files to be read or specify a data source in the reader.


**Data Source**

The Data Source component is responsible for interacting with a specific data source and returning a list of locations to be read.

.. collapse:: Supported Data Sources

    .. code-block:: python

        OCIObjectStorageDataSource
        OCIDatePrefixDataSource
        ObjectStorageFileSearchDataSource


PREDICTION READER
~~~~~~~~~~~~~~~~~

**If the action type is RUN_PREDICTION, this is a required component and must be defined in the configuration.**

The prediction_reader lets the ingestion of raw data into the framework for a prediction run.

Description
^^^^^^^^^^^^^

.. list-table::
    :widths: 100 100 100
    :header-rows: 1

    * - Field Name
      - Description
      - Example
    * - type
      - type of reader to be used
      - "type": "JsonlDaskDataReader"
    * - param
      - reader params (required)

        data_source (optional)
      -
        "params": {
              "file_path": "oci://<path>.csv"
            }

        "data_source": {
            "type": "ObjectStorageFileSearchDataSource",
            "params": {
              "file_path": [
                "oci://<bucket_name>@<namespace>/<object_prefix>/dataset.csv"
              ],
              "filter_arg": [
                {
                  "partition_based_date_range": {
                    "start": "2023-06-26",
                    "end": "2023-06-27",
                    "data_format": ".d{4}-d{2}-d{2}."
                  }
                }
              ]
            }
          }

Example using data source for determining the data location
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: json

        "prediction_reader": {
            "type": "CSVDaskDataReader",
                "params": {
                  "data_source": {
                    "type": "ObjectStorageFileSearchDataSource",
                    "params": {
                      "file_path": [
                        "oci://<bucket_name>@<namespace>/<object_prefix>/dataset.csv"
                      ],
                      "filter_arg": [
                        {
                          "partition_based_date_range": {
                            "start": "2023-06-26",
                            "end": "2023-06-27",
                            "data_format": ".d{4}-d{2}-d{2}."
                          }
                        }
                      ]
                    }
                  }
            }
        }

Example without using data_source
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: json

    {
        "prediction_reader": {
            "type": "CSVDaskDataReader",
            "params": {
              "file_path": "oci://<path>.csv"
            }
          }
    }

**Supported Reader**

.. collapse:: Supported Readers

    .. code-block:: python

        CSVDaskDataReader
        JsonlDaskDataReader
        NestedJsonDaskDataReader


Use reader parameters to define the location of the files to be read or specify a data source in the reader.


**Data Source**

The Data Source component is responsible for interacting with a specific data source and returning a list of locations to be read.

.. collapse:: Supported Data Sources

    .. code-block:: python

        OCIObjectStorageDataSource
        OCIDatePrefixDataSource
        ObjectStorageFileSearchDataSource


Feature Metrics
~~~~~~~~~~~~~~~~~~

In this section, you add the metrics neeedd for each feature.

Description
^^^^^^^^^^^^^

.. list-table::
   :widths: 25 25
   :header-rows: 1

   * - Key
     - Value
   * - feature_name
     - metric list


**Supported Feature Metrics**

`For more metric details, see: <https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/mlm_insights.core.metrics.html>`_

.. collapse:: Supported Feature Metrics

    .. code-block:: python

        # Data quality metrics
        Count
        DistinctCount
        DuplicateCount
        FrequencyDistribution
        Max
        Mean
        Min
        Mode
        ProbabilityDistribution
        Range
        Skewness
        StandardDeviation
        Sum
        IQR
        Kurtosis
        TopKFrequentElements
        TypeMetric
        Variance
        IsPositive
        IsNegative
        IsNonZero
        Percentiles

        # Data Integrity
        IsConstantFeature
        IsQuasiConstantFeature
        Quartiles

        # Drift Metrics
        KullbackLeibler
        KolmogorovSmirnov
        ChiSquare
        JensenShannon
        PopulationStabilityIndex

        # Bias and Fairness
        ClassImbalance

        # Date Time Metrics
        DateTimeMin
        DateTimeMax
        DateTimeDuration
|

Example
^^^^^^^^^^^^^

.. code-block:: json

    "feature_metric": {
        "sepal length (cm)" : [
        {"type": "Sum"},{"type": "Quartiles"}
        ],
        "sepal width (cm)": [
            {"type": "Min"},{"type": "DistinctCount"}
        ],
        "petal length (cm)": [
            {"type": "Count"},{"type": "Mean"}
        ],
        "petal width (cm)": [
            {"type": "IsQuasiConstantFeature"},{"type": "Kurtosis"}
        ]
    }

Dataset Metrics
~~~~~~~~~~~~~~~~

Description
^^^^^^^^^^^^^

The list of metrics to be calculated on the data set.

Example
^^^^^^^^
.. code-block:: json

    "data_set_metric": [
        {
        "type": "RowCount"
        }
    ]

**Supported Data Set Metrics**

`More Metric details <https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/mlm_insights.core.metrics.html>`_

.. collapse:: Supported Data Set Metrics

    .. code-block:: python

        # Data Quality Metrics
        CramersVCorrelation
        PearsonCorrelation
        CorrelationRatio

        # Regression Metrics
        RowCount
        MeanAbsoluteError
        MeanSquaredError
        R2Score
        RootMeanSquaredError
        MeanSquaredLogError
        MeanAbsolutePercentageError
        MaxError

        # Classification metrics
        AccuracyScore
        PrecisionScore
        RecallScore
        FBetaScore
        FalsePositiveRate
        FalseNegativeRate
        Specificity
        ConfusionMatrix
        LogLoss
        ROCCurve
        ROCAreaUnderCurve
        PrecisionRecallCurve
        PrecisionRecallAreaUnderCurve

        # Conflict Metrics
        ConflictPrediction
        ConflictLabel
|

Post Processor
~~~~~~~~~~~~~~~~~

Post processor components are responsible for running any action after the entire data set is processed and all the metrics are calculated.

Description
^^^^^^^^^^^^^
.. list-table::
    :widths: 100 100 100 100
    :header-rows: 1

    * - Field Name
      - Description
      - Example1
      - Example2
    * - type
      - type of post processor
      - "type": "SaveMetricOutputAsJsonPostProcessor"
      - "type": "OCIMonitoringApplicationPostProcessor"
    * - param
      - post processor params (required)

      - "params": {
        "file_name": "profile.json",
			"test_results_file_name": "test_result.json",
			"file_location_expression": "bug-bash/mlm/profile-$start_$end.json",
			"date_range": {
			  "start": "2023-08-01",
			  "end": "2023-08-05"
			},
			"can_overwrite_profile_json": false,
			"can_overwrite_test_results_json": false,
			"namespace": "<namespace>",
			"bucket_name": "<bucket_name>"
        }

      - "params": {
			"compartment_id": "<COMPARTMENT_ID>",
			"namespace": "<NAMESPACE>",
			"date_range": {
			  "start": "2023-08-01",
			  "end": "2023-08-05"
			},
			"dimensions": {
			  "key1": "value1",
			  "key2": "value2"
			}
		  }

Example
^^^^^^^^

.. code-block:: json

    "post_processors": [
    {
      "type": "SaveMetricOutputAsJsonPostProcessor",
      "params": {
        "file_name": "profile.json",
        "test_results_file_name": "test_result.json",
        "file_location_expression": "bug-bash/mlm/profile-$start_$end.json",
        "date_range": {
          "start": "2023-08-01",
          "end": "2023-08-05"
        },
        "can_overwrite_profile_json": false,
        "can_overwrite_test_results_json": false,
        "namespace": "<namespace>",
        "bucket_name": "<bucket_name>"
      }
    }
  ]

Supported Post Processor
^^^^^^^^^^^^^^^^^^^^^^^^^

* SaveMetricOutputAsJsonPostProcessor
    This stores the metric result output in the user-provided Object storage location in a json format.

    Required Parameters
        * **bucket_name** - The name of the OCI Object Storage bucket.
        * **namespace** - The OCI Object Storage namespace.

    Optional Parameters
        * file_location_expression - The expression of the object location within the bucket, which is configured as per the date_range argument.
            * if file_location_expression isn't provided and no date_range is provided in runtime parameter, the object location is generated by the application as '<location>/MLM/<monitorId>/<action_type>/file_name.json'
            * if file_location_expression isn't provided and date_range, the object location is generated by the application as '<location>/MLM/<monitorId>/<action_type>/$start-$end/'
        * **file_name** - A filename for the object name. The default value for file_name is 'profile.json'
        * **can_overwrite_profile_json** - A boolean of whether the existing profile file is overwritten or not. By default, the profile file isn't overwritten.
        * **test_results_file_name** - A filename for the Test result object name. Default value for file_name is 'test_result.json'
        * **can_overwrite_test_results_json** - A boolean whether the existing test result file should be overwritten. By default the test result file would Not be overwritten.
        * **date_range** - A dictionary containing the optional date range which is configured in the file location. It can be overwritten by passing START and END DATE in RUNTIME_PARAMETER.

**For example**

.. code-block:: json

	  "post_processors": [
		{
		  "type": "SaveMetricOutputAsJsonPostProcessor",
		  "params": {
			"file_name": "profile.json",
            "test_results_file_name": "test_result.json",
			"file_location_expression": "/usecase/$start_$end",
			"date_range": {
			  "start": "2023-08-01",
			  "end": "2023-08-05"
			},
			"can_overwrite_profile_json": true,
            "can_overwrite_test_results_json": false,
			"namespace": "<namespace>",
			"bucket_name": "<bucket_name>"
		  }
		}
	  ]

In the above example, the JSON result would be stored at the location - /usecase/2023-08-01_2023-08-05/profile.json
and Test Results would be stored at the location - /usecase/2023-08-01_2023-08-05/test_result.json


* OCIMonitoringApplicationPostProcessor
    This will will push the Ml Insight Test Suite results to OCI Monitoring Service in user provided Compartment Id

    Required Parameters
        * **compartment_id** - The OCID of the compartment to use for metrics.

    Optional Parameters
        * **dimensions** - Additional dimensions for the metrics (default is an empty).
        * **namespace** - The namespace for the OCI monitoring (default is 'ml_monitoring').
        * **date_range** - A dictionary containing optional date range which would be configured in file location. This can be overwritten by passing START and END DATE in RUNTIME_PARAMETER.

**For example**

.. code-block:: json

	  "post_processors": [
		{
		  "type": "OCIMonitoringApplicationPostProcessor",
		  "params": {
			"compartment_id": "<COMPARTMENT_ID>",
			"namespace": "<NAMESPACE>",
			"date_range": {
			  "start": "2023-08-01",
			  "end": "2023-08-05"
			},
			"dimensions": {
			  "key1": "value1",
			  "key2": "value2"
			}
		  }
		}
	  ]

In the above example, Ml Insight Test Suite results will be pushed to user provided compartment_id


Transformer
~~~~~~~~~~~~~

The transformer component provides an easy way to do in-memory transformations on the input data.

The list of transformers to be used to add a conditional feature or transform the data before insights run.

Description
^^^^^^^^^^^^

.. list-table::
    :widths: 100 100 100
    :header-rows: 1

    * - Field Name
      - Description
      - Example
    * - type
      - type of transformer
      - "type": "ConditionalFeatureTransformer"
    * - param
      - conditional_features - List of conditional features
      - "params": {
            "conditional_features": [
                {
                    "feature_name": "Young",
                    "data_type": "integer",
                    "variable_type": "ordinal",
                    "expression": "df.Age < 30"
                }
            ]
        }

Conditional Features
~~~~~~~~~~~~~~~~~~~~~

.. list-table::
    :widths: 100 100 100
    :header-rows: 1

    * - Field Name
      - Value
      - Remarks
    * - expression
      - A python expression, to be written using pandas series based functions. Only pandas series level functions are supported in a python expression and the symbol 'df'.
      - The expression must return a valid output. For example: "expression": "df.Age < 30"
    * - feature_name
      - <any name that suits your feature>
      -
    * - data_type
      - The data type of the feature.
      -
    * - variable_type
      - The variable type of the feature.
      -

Example
^^^^^^^^
.. code-block:: json

    "transformers": [
    {
      "type": "ConditionalFeatureTransformer",
      "params": {
        "conditional_features": [
          {
            "feature_name": "Young",
            "data_type": "integer",
            "variable_type": "continuous",
            "expression": "int(json_row['Age'] < 30)"
          }
        ]
      }
    }
    ]

Tags
~~~~~~~

Note, this is an application internal concept, not to be confused with OCI resource tags.

* User provided key value pair
* Users can provide tags to be associated with a profile. For example, when running the baseline/prediction run, you can store:
    <"tenancy": "tenancy-xyz">

Example
^^^^^^^^

.. code-block:: json

    "tags": {
        "tenancy": "tenancy-xyz"
    }

Tests Config
~~~~~~~~~~~~
For detailed documentation, please refer to section: `Test Config <test_config.html>`_