mlm_insights.core.metrics.data_quality package

Submodules

mlm_insights.core.metrics.data_quality.cramers_v_correlation module

class mlm_insights.core.metrics.data_quality.cramers_v_correlation.CorrelationSummary(cramers_v_correlation: float, p_value: float = nan)

Bases: object

cramers_v_correlation: float
p_value: float = nan
class mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelation(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, feature_pair_mapping: ~typing.Dict[str, ~mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelationState] = <factory>, feature_list: ~typing.List[str] = <factory>)

Bases: DatasetMetricBase, Serializable

This metric computes the Cramers_V correlation matrix and P_value matrix for the user provided feature inputs.
It is a dataset level metric which can process categorical data types.
This is an approximate multivariate metric.
Internally, it uses a sketch data structure with a default K value of 1024.
We use cramer’s V measure of association for correlation metric between n categorical features
This metric handles NaN values, Used for feature importance

NaN handling Example

a = [1, 2, 8, np.nan, 9]
b = [5, np.nan, 7, np.nan, 10]
valid_corresponding_column_values = pd.core.nanops.notna(a) & pd.core.nanops.notna(b)
valid_corresponding_column_values= [ True False True False True]

Applying valid_corresponding_column_values over column_a and column_b:
a = a[valid_corresponding_column_values]
b = b[valid_corresponding_column_values]
a = [1, 8, 9]
b = [5, 7, 10]
It ranges from 0 to 1 where:
  • 0 indicates no association between the two variables.

  • 1 indicates a perfect association between the two variables.

Cramer’s V is computed by taking the square root of the chi-squared statistic divided by the sample size and the minimum dimension minus 1

Configuration

lg_max_k: int, default=10
  • Maximum size, in log2, of k. The value must be between 7 and 21, inclusive

ignore_invalid_data_types: bool, default=True
  • Flag for ignoring invalid data types

  • If set to True, non-categorical features will be ignored, else, metric will throw an error For example: Cramers only deals with Categorical data types so drop all non-categorical data types

feature_list: List[str]
  • list of feature names for computing the correlation between each provided feature pairs, number of features supported is between 2 and 50 inclusive

Returns

feature_list: List[str]
  • list of user provided feature inputs

matrix: numpy.typing.NDArray[np.float64]
  • correlation matrix

p_values: numpy.typing.NDArray[np.float64]
  • The p-value is the probability of observing a non-zero correlation coefficient in our sample data when in fact the null hypothesis is true. A low p-value would lead you to reject the null hypothesis. A typical threshold for rejection of the null hypothesis is a p-value of 0.05.

Limitations

Currently we support only maximum MAX_FEATURE_THRESHOLD_DEFAULT = 50 categorical feature for computation

Exceptions

  • InvalidParameterException - in case Column Name is not present in provided dataset

  • MissingRequiredParameterException - on breaching MAX_FEATURE_THRESHOLD_DEFAULT

  • ValueError - When comparison columns have no corresponding data to compare, all are NaN

  • TypeError - in case user do not passed feature_list in list format

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
from mlm_insights.core.metrics.data_quality.cramers_v_correlation import CramersVCorrelation
import pandas as pd

def main():
    input_schema = {
        'transport': FeatureType(data_type=DataType.STRING,
                                 variable_type=VariableType.NOMINAL,
                                 column_type=ColumnType.INPUT),
        'gender': FeatureType(data_type=DataType.STRING,
                              variable_type=VariableType.NOMINAL,
                              column_type=ColumnType.INPUT)
    }

    data_frame = pd.DataFrame({'transport': ['bus', 'bus', 'train', 'walk', 'walk', 'car', 'car'],
                               'gender': ['M', 'M', 'F', 'F', 'M', 'M', 'F']})
    feature1: str = 'transport'
    feature2: str = 'gender'
    correlation_metrics = [
        MetricMetadata(klass=CramersVCorrelation, config={FEATURE_LIST: [feature1, feature2]})
    ]

    metric_details = MetricDetail(univariate_metric={},
                                  dataset_metrics=correlation_metrics)

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    run_result = runner.run()
    profile = run_result.profile

    dataset_metrics = profile.get_dataset_metric(correlation_metrics[0])
    assert dataset_metrics is not None

    cramers_actual_value = dataset_metrics.get_result()['value']
    cramers_correlation_matrix = cramers_actual_value['matrix']
    p_value_matrix = cramers_actual_value['p_values']

    feature_map = {value: index for index, value in enumerate(cramers_actual_value[FEATURE_LIST])}

    cramers_v_value_for_feature1_feature2 = round(
        cramers_correlation_matrix[feature_map[feature1]][feature_map[feature2]], 4)
    p_value_for_feature1_feature2 = round(
        p_value_matrix[feature_map[feature1]][feature_map[feature2]], 4)

    Returns the metric result as:
      return {
      'value':  {
            'matrix': array([[1.        , 0.64549722],
                           [0.64549722, 1.        ]]),
           'p_values': array([[0.00815097, 0.40465279],
                        [0.40465279, 0.01265042]]),
           'feature_list': ['transport', 'gender']
           }
       }
compute(dataset: DataFrame, **kwargs: Any) None

Update the state of the CramersVCorrelation using dataset

Parameters

dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which a Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) CramersVCorrelation

Create a CramersVCorrelation data quality metric using the configuration and kwargs

Parameters

config : Metric configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:

  • features: Contains list of input feature column names

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) CramersVCorrelation

Create a new instance from serialized bytes.

Parameters

serialized_bytesbytes

Serialized bytes as input.

Returns

Serializable

New instance of Serializable

feature_list: List[str]
feature_pair_mapping: Dict[str, CramersVCorrelationState]
get_result(**kwargs: Any) Dict[str, Any]

Returns CramersVCorrelation data quality metric

Returns

Json object: CramersVCorrelation of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns CramersVCorrelation Metric and P_values in Standard format.

Returns

StandardMetricResult: CramersVCorrelation Metric and P_values in standard format.

merge(other: CramersVCorrelation, **kwargs: Any) CramersVCorrelation

Merge two CramersVCorrelation into one, without mutating the others. Update sketch with new partition pair values from column1 and column2

Parameters

otherCramersVCorrelation

Other CramersVCorrelation that need be merged.

Returns

CramersVCorrelation

A new instance of CramersVCorrelation

serialize(**kwargs: Any) bytes

Serialize the class to bytes. The bytes output must return the instance of the same class when deserialized.

Returns

bytes: Byte representation of object

class mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelationState(sketch: _datasketches.frequent_strings_sketch, total_count: int = 0, feature1: str = '', feature2: str = '')

Bases: object

feature1: str = ''
feature2: str = ''
sketch: frequent_strings_sketch
total_count: int = 0

mlm_insights.core.metrics.data_quality.pearson_correlation module

class mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelation(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, feature_pair_mapping: ~typing.Dict[str, ~mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelationState] = <factory>, feature_list: ~typing.List[str] = <factory>)

Bases: DatasetMetricBase, Serializable

This metric computes Pearson’s Correlation Coefficient matrix for the user provided feature inputs.
Pearson’s Correlation coefficient has value between -1 to 1.
It is a dataset level metric which can process numeric data types.
This is an exact multivariate metric.
This metric handles NaN values

NaN handling Example

a = [1, 2, 8, np.nan, 9]
b = [5, np.nan, 7, np.nan, 10]
valid_corresponding_column_values = pd.core.nanops.notna(a) & pd.core.nanops.notna(b)
valid_corresponding_column_values= [ True False True False True]

Applying valid_corresponding_column_values over column_a and column_b:
a = a[valid_corresponding_column_values]
b = b[valid_corresponding_column_values]
a = [1, 8, 9]
b = [5, 7, 10]

Used for feature importance It ranges from -1 to 1 where:

  • -1 indicates a perfect negative linear relationship between variables

  • 0 indicates no linear relationship between variables

  • 1 indicates a perfect positive linear relationship between variables

Pearson’s is computed taking Covariance and Variance of both variables

Configuration

ignore_invalid_data_types: bool, default=True
  • Flag for ignoring invalid data types

  • If set to True, non-numeric features will be ignored, else, metric will throw an error

For example: Pearson only deals with numerical data types so drop all non-numerical data types

feature_list: List[str]
  • list of feature names for computing the correlation between each provided feature pairs, number of features supported is between 2 and 50 inclusive

Returns

feature_list: List[str]
  • list of user provided feature inputs

matrix: numpy.typing.NDArray[np.float64]
  • correlation matrix

Limitations

Currently we support only maximum MAX_FEATURE_THRESHOLD_DEFAULT = 50 numerical feature for computation

Exceptions

  • InvalidParameterException - in case Column Name is not present in provided dataset

  • MissingRequiredParameterException - on breaching MAX_FEATURE_THRESHOLD_DEFAULT

  • TypeError - in case user do not passed feature_list in list format

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
from mlm_insights.core.metrics.data_quality.pearson_correlation import PearsonCorrelation
import pandas as pd

def main():
    input_schema = {
        'square_feet': FeatureType(data_type=DataType.INTEGER,
                                   variable_type=VariableType.CONTINUOUS),
        'house_price': FeatureType(data_type=DataType.INTEGER,
                                   variable_type=VariableType.CONTINUOUS),
    }

    data_frame = pd.DataFrame({'house_price': [1, 2, 3, 4, 5, 6, 7, 8, 5, 6, 7],
                               'square_feet': [5, 6, 7, 8, 9, 10, 11, 12, 9, 10, 11]})
    feature1: str = 'house_price'
    feature2: str = 'square_feet'
    correlation_metrics = [
        MetricMetadata(klass=PearsonCorrelation, config={FEATURE_LIST: [feature1, feature2]})
    ]

    metric_details = MetricDetail(univariate_metric={},
                                  dataset_metrics=correlation_metrics)

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    run_result = runner.run()
    profile = run_result.profile

    dataset_metrics = profile.get_dataset_metric(correlation_metrics[0])
    assert dataset_metrics is not None

    Returns the metric result as:
      return {
      'value':  {
            'matrix': array([[1.        , 0.64549722],
                           [0.64549722, 1.        ]]),
           'feature_list': ['house_price', 'square_feet']
           }
       }
compute(dataset: DataFrame, **kwargs: Any) None

Update the state of the PearsonCorrelation metric using dataset

Parameters

dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) PearsonCorrelation

Create a PearsonCorrelation data quality metric using the configuration and kwargs

Parameters

config : Metric configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:

  • features: Contains list of input feature column names

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) PearsonCorrelation

Create a new instance from serialized bytes.

Parameters

serialized_bytesbytes

Serialized bytes as input.

Returns

Serializable

New instance of Serializable

feature_list: List[str]
feature_pair_mapping: Dict[str, PearsonCorrelationState]
get_required_shareable_feature_components(**kwargs: Any) Dict[str, List[SFCMetaData]]

Returns the Shareable Feature Components for 2 input features

get_result(**kwargs: Any) Dict[str, Any]

Returns Pearson’s Correlation 2-D matrix for set of features, using the DescriptiveStatisticsSFC

Returns

Json object: Pearson’s Correlation 2-D matrix for n features

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns Pearson’s Correlation Metric and P_values in Standard format.

Returns

StandardMetricResult: Pearson’s Correlation Metric and P_values in standard format.

merge(other: PearsonCorrelation, **kwargs: Any) PearsonCorrelation

Merge two PearsonCorrelation into one, without mutating the others. 1. Calculate cumulative_col12_count 2. Calculate combined mean for feature column1 and column2 3. Calculate numerator of covariance column1 and column2

Parameters

otherPearsonCorrelation

Other PearsonCorrelation that need be merged.

Returns

PearsonCorrelation

A new instance of PearsonCorrelation

serialize(**kwargs: Any) bytes

Serialize the class to bytes. The bytes output must return the instance of the same class when deserialized.

Returns

bytes: Byte representation of object

class mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelationState(cumulative_partition_count: int = 0, column1_mean: float = nan, column2_mean: float = nan, covariance_col1_col2: float = nan, feature1: str = '', feature2: str = '')

Bases: object

column1_mean: float = nan
column2_mean: float = nan
covariance_col1_col2: float = nan
cumulative_partition_count: int = 0
feature1: str = ''
feature2: str = ''

mlm_insights.core.metrics.data_quality.correlation_ratio module

class mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatio(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, feature_pair_mapping: ~typing.Dict[str, ~mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioState] = <factory>, categorical_features: ~typing.List[str] = <factory>, numerical_features: ~typing.List[str] = <factory>)

Bases: DatasetMetricBase, Serializable

Dataset level metric computes correlation matrix for user provided categorical and numerical features.
This is an approximate multivariate metric.
We use Correlation Ratio for correlation metric between n categorical and m numerical features
This metric handles NaN values

NaN handling Example

a = [1, 2, 8, np.nan, 9]
b = [5, np.nan, 7, np.nan, 10]
valid_corresponding_column_values = pd.core.nanops.notna(a) & pd.core.nanops.notna(b)
valid_corresponding_column_values= [ True False True False True]

Applying valid_corresponding_column_values over column_a and column_b:
a = a[valid_corresponding_column_values]
b = b[valid_corresponding_column_values]
a = [1, 8, 9]
b = [5, 7, 10]
It ranges from 0 to 1 where:
  • 0 indicates no dispersion among the means of the different categories

  • 1 indicates dispersion within the respective categories

  • NaN when all data points of the complete population take the same value

Correlation ratio (η) is a measure of the relationship between statistical dispersion within individual categories and dispersion across the whole population or sample.

Configuration

feature_list: List[str]
  • list of feature names for computing the correlation between each provided feature pairs, number of features supported is between 2 and 50 inclusive

Returns

matrix: numpy.typing.NDArray[np.float64]
  • correlation matrix

categorical_features: List[str]
  • list of user provided categorical feature inputs

numerical_features: List[str]
  • list of user provided numerical feature inputs

Limitations

Currently we support only maximum MAX_FEATURE_THRESHOLD_DEFAULT = 50 features including both categorical and numerical features

Exceptions

  • InvalidParameterException - in case Column Name is not present in provided dataset

  • MissingRequiredParameterException - on breaching MAX_FEATURE_THRESHOLD_DEFAULT or Minimum 1 Numerical and 1 Categorical feature column names not provided

  • ValueError - When comparison columns have no corresponding data to compare, all are NaN

  • TypeError - in case user do not passed feature_list in list format

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.definitions import FEATURE_LIST, CATEGORICAL_FEATURES, NUMERICAL_FEATURES
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.data_quality.correlation_ratio import CorrelationRatio
from mlm_insights.core.metrics.metric_metadata import MetricMetadata


def main():
    input_schema = {
        "Pclass": FeatureType(data_type=DataType.STRING, variable_type=VariableType.NOMINAL),
        "age": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS)
    }

    data_frame = pd.DataFrame({'Pclass': [3, 3, 2, 3, 3, 3, 3, 2, 3, 3],
                               'age': [34.5, 47, 62, 27, 22, 14, 30, 26, 18, 21]})
    feature1: str = 'Pclass'
    feature2: str = 'age'
    correlation_metrics = [
        MetricMetadata(klass=CorrelationRatio, config={FEATURE_LIST: [feature1, feature2]})
    ]

    metric_details = MetricDetail(univariate_metric={},
                                  dataset_metrics=correlation_metrics)
    runner = InsightsBuilder().                     with_input_schema(input_schema).                     with_data_frame(data_frame=data_frame).                     with_metrics(metrics=metric_details).                     with_engine(engine=EngineDetail(engine_name="native")).                     build()
    run_result = runner.run()
    profile = run_result.profile

    dataset_metrics = profile.get_dataset_metric(correlation_metrics[0])
    assert dataset_metrics is not None

    sfc_registry = {}
    for feature in profile.features.values():
        sfc_registry[feature.get_name()] = feature.sfc_registry

    correlation_ratio_actual_value = dataset_metrics.get_result(sfc_registry=sfc_registry)['value']
    correlation_matrix = correlation_ratio_actual_value['matrix']
    assert correlation_matrix is not None

    categorical_feature_map = {value: index for index, value in
                               enumerate(correlation_ratio_actual_value[CATEGORICAL_FEATURES])}
    numerical_feature_map = {value: index for index, value in
                             enumerate(correlation_ratio_actual_value[NUMERICAL_FEATURES])}

    correlation_ratio_value = round(
        correlation_ratio_actual_value['matrix'][categorical_feature_map[feature1]][
            numerical_feature_map[feature2]], 4)

Returns the metric result as:
  return {
  'value':  {
        'matrix': array([  [0.50199]
                        ]),
       'categorical_features': ['Pclass'],
       'numerical_features': ['age']
       }
   }
categorical_features: List[str]
compute(dataset: DataFrame, **kwargs: Any) None

Update the state of the CorrelationRatio using dataset

Parameters

dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which a Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) CorrelationRatio

Create a CorrelationRatio data quality metric using the configuration and kwargs

Parameters

config : Metric configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:

  • features: Contains list of input feature column names

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) CorrelationRatio

Create a new instance from serialized bytes.

Parameters

serialized_bytesbytes

Serialized bytes as input.

Returns

Serializable

New instance of Serializable

feature_pair_mapping: Dict[str, CorrelationRatioState]
get_required_shareable_feature_components(**kwargs: Any) Dict[str, List[SFCMetaData]]

Returns the Shareable Feature Components that a Metric requires to compute its state and values Metrics which do not require SFC need not override this property

Returns

Dict where feature_name as key and List of SFCMetadata as value. Each SFCMetadata must contain the klass attribute which points to the SFC class

get_result(**kwargs: Any) Dict[str, Any]

Returns CorrelationRatio data quality metric

Returns

Json object: CorrelationRatio of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns CorrelationRatio Metric in Standard format.

Returns

StandardMetricResult: CorrelationRatio Metric in standard format.

merge(other: CorrelationRatio, **kwargs: Any) CorrelationRatio

Merge two CorrelationRatio into one, without mutating the others.

Parameters

otherCorrelationRatio

Other CorrelationRatio that need be merged.

Returns

CorrelationRatio

A new instance of CorrelationRatio

numerical_features: List[str]
serialize(**kwargs: Any) bytes

Serialize the class to bytes. The bytes output must return the instance of the same class when deserialized.

Returns

bytes: Byte representation of object

class mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioDetails(total_sum: float = 0.0, total_count: int = 0)

Bases: object

total_count: int = 0
total_sum: float = 0.0
class mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioState(category_details: Dict[Union[int, str, float], mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioDetails] = <factory>, categorical_feature: str = '', numerical_feature: str = '')

Bases: object

categorical_feature: str = ''
category_details: Dict[int | str | float, CorrelationRatioDetails]
numerical_feature: str = ''

Module contents