mlm_insights.core.metrics.data_quality package¶
Submodules¶
mlm_insights.core.metrics.data_quality.cramers_v_correlation module¶
- class mlm_insights.core.metrics.data_quality.cramers_v_correlation.CorrelationSummary(cramers_v_correlation: float, p_value: float = nan)¶
Bases:
object
- cramers_v_correlation: float¶
- p_value: float = nan¶
- class mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelation(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, feature_pair_mapping: ~typing.Dict[str, ~mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelationState] = <factory>, feature_list: ~typing.List[str] = <factory>)¶
Bases:
DatasetMetricBase
,Serializable
This metric computes the Cramers_V correlation matrix and P_value matrix for the user provided feature inputs.It is a dataset level metric which can process categorical data types.This is an approximate multivariate metric.Internally, it uses a sketch data structure with a default K value of 1024.We use cramer’s V measure of association for correlation metric between n categorical featuresThis metric handles NaN values, Used for feature importanceNaN handling Example
a = [1, 2, 8, np.nan, 9] b = [5, np.nan, 7, np.nan, 10] valid_corresponding_column_values = pd.core.nanops.notna(a) & pd.core.nanops.notna(b) valid_corresponding_column_values= [ True False True False True] Applying valid_corresponding_column_values over column_a and column_b: a = a[valid_corresponding_column_values] b = b[valid_corresponding_column_values] a = [1, 8, 9] b = [5, 7, 10]
- It ranges from 0 to 1 where:
0 indicates no association between the two variables.
1 indicates a perfect association between the two variables.
Cramer’s V is computed by taking the square root of the chi-squared statistic divided by the sample size and the minimum dimension minus 1
Configuration¶
- lg_max_k: int, default=10
Maximum size, in log2, of k. The value must be between 7 and 21, inclusive
- ignore_invalid_data_types: bool, default=True
Flag for ignoring invalid data types
If set to True, non-categorical features will be ignored, else, metric will throw an error For example: Cramers only deals with Categorical data types so drop all non-categorical data types
- feature_list: List[str]
list of feature names for computing the correlation between each provided feature pairs, number of features supported is between 2 and 50 inclusive
Returns¶
- feature_list: List[str]
list of user provided feature inputs
- matrix: numpy.typing.NDArray[np.float64]
correlation matrix
- p_values: numpy.typing.NDArray[np.float64]
The p-value is the probability of observing a non-zero correlation coefficient in our sample data when in fact the null hypothesis is true. A low p-value would lead you to reject the null hypothesis. A typical threshold for rejection of the null hypothesis is a p-value of 0.05.
Limitations¶
Currently we support only maximum MAX_FEATURE_THRESHOLD_DEFAULT = 50 categorical feature for computation
Exceptions¶
InvalidParameterException - in case Column Name is not present in provided dataset
MissingRequiredParameterException - on breaching MAX_FEATURE_THRESHOLD_DEFAULT
ValueError - When comparison columns have no corresponding data to compare, all are NaN
TypeError - in case user do not passed feature_list in list format
Examples
from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.metric_metadata import MetricMetadata from mlm_insights.core.metrics.data_quality.cramers_v_correlation import CramersVCorrelation import pandas as pd def main(): input_schema = { 'transport': FeatureType(data_type=DataType.STRING, variable_type=VariableType.NOMINAL, column_type=ColumnType.INPUT), 'gender': FeatureType(data_type=DataType.STRING, variable_type=VariableType.NOMINAL, column_type=ColumnType.INPUT) } data_frame = pd.DataFrame({'transport': ['bus', 'bus', 'train', 'walk', 'walk', 'car', 'car'], 'gender': ['M', 'M', 'F', 'F', 'M', 'M', 'F']}) feature1: str = 'transport' feature2: str = 'gender' correlation_metrics = [ MetricMetadata(klass=CramersVCorrelation, config={FEATURE_LIST: [feature1, feature2]}) ] metric_details = MetricDetail(univariate_metric={}, dataset_metrics=correlation_metrics) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() run_result = runner.run() profile = run_result.profile dataset_metrics = profile.get_dataset_metric(correlation_metrics[0]) assert dataset_metrics is not None cramers_actual_value = dataset_metrics.get_result()['value'] cramers_correlation_matrix = cramers_actual_value['matrix'] p_value_matrix = cramers_actual_value['p_values'] feature_map = {value: index for index, value in enumerate(cramers_actual_value[FEATURE_LIST])} cramers_v_value_for_feature1_feature2 = round( cramers_correlation_matrix[feature_map[feature1]][feature_map[feature2]], 4) p_value_for_feature1_feature2 = round( p_value_matrix[feature_map[feature1]][feature_map[feature2]], 4) Returns the metric result as: return { 'value': { 'matrix': array([[1. , 0.64549722], [0.64549722, 1. ]]), 'p_values': array([[0.00815097, 0.40465279], [0.40465279, 0.01265042]]), 'feature_list': ['transport', 'gender'] } }
- compute(dataset: DataFrame, **kwargs: Any) None ¶
Update the state of the CramersVCorrelation using dataset
Parameters¶
dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which a Metric is being computed
- classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) CramersVCorrelation ¶
Create a CramersVCorrelation data quality metric using the configuration and kwargs
Parameters¶
config : Metric configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:
features: Contains list of input feature column names
- classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) CramersVCorrelation ¶
Create a new instance from serialized bytes.
Parameters¶
- serialized_bytesbytes
Serialized bytes as input.
Returns¶
- Serializable
New instance of Serializable
- feature_list: List[str]¶
- feature_pair_mapping: Dict[str, CramersVCorrelationState]¶
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns CramersVCorrelation data quality metric
Returns¶
Json object: CramersVCorrelation of the data.
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
Returns CramersVCorrelation Metric and P_values in Standard format.
Returns¶
StandardMetricResult: CramersVCorrelation Metric and P_values in standard format.
- merge(other: CramersVCorrelation, **kwargs: Any) CramersVCorrelation ¶
Merge two CramersVCorrelation into one, without mutating the others. Update sketch with new partition pair values from column1 and column2
Parameters¶
- otherCramersVCorrelation
Other CramersVCorrelation that need be merged.
Returns¶
- CramersVCorrelation
A new instance of CramersVCorrelation
- class mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelationState(sketch: _datasketches.frequent_strings_sketch, total_count: int = 0, feature1: str = '', feature2: str = '')¶
Bases:
object
- feature1: str = ''¶
- feature2: str = ''¶
- sketch: frequent_strings_sketch¶
- total_count: int = 0¶
mlm_insights.core.metrics.data_quality.pearson_correlation module¶
- class mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelation(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, feature_pair_mapping: ~typing.Dict[str, ~mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelationState] = <factory>, feature_list: ~typing.List[str] = <factory>)¶
Bases:
DatasetMetricBase
,Serializable
This metric computes Pearson’s Correlation Coefficient matrix for the user provided feature inputs.Pearson’s Correlation coefficient has value between -1 to 1.It is a dataset level metric which can process numeric data types.This is an exact multivariate metric.This metric handles NaN valuesNaN handling Example
a = [1, 2, 8, np.nan, 9] b = [5, np.nan, 7, np.nan, 10] valid_corresponding_column_values = pd.core.nanops.notna(a) & pd.core.nanops.notna(b) valid_corresponding_column_values= [ True False True False True] Applying valid_corresponding_column_values over column_a and column_b: a = a[valid_corresponding_column_values] b = b[valid_corresponding_column_values] a = [1, 8, 9] b = [5, 7, 10]
Used for feature importance It ranges from -1 to 1 where:
-1 indicates a perfect negative linear relationship between variables
0 indicates no linear relationship between variables
1 indicates a perfect positive linear relationship between variables
Pearson’s is computed taking Covariance and Variance of both variables
Configuration¶
- ignore_invalid_data_types: bool, default=True
Flag for ignoring invalid data types
If set to True, non-numeric features will be ignored, else, metric will throw an error
For example: Pearson only deals with numerical data types so drop all non-numerical data types
- feature_list: List[str]
list of feature names for computing the correlation between each provided feature pairs, number of features supported is between 2 and 50 inclusive
Returns¶
- feature_list: List[str]
list of user provided feature inputs
- matrix: numpy.typing.NDArray[np.float64]
correlation matrix
Limitations¶
Currently we support only maximum MAX_FEATURE_THRESHOLD_DEFAULT = 50 numerical feature for computation
Exceptions¶
InvalidParameterException - in case Column Name is not present in provided dataset
MissingRequiredParameterException - on breaching MAX_FEATURE_THRESHOLD_DEFAULT
TypeError - in case user do not passed feature_list in list format
Examples
from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.metric_metadata import MetricMetadata from mlm_insights.core.metrics.data_quality.pearson_correlation import PearsonCorrelation import pandas as pd def main(): input_schema = { 'square_feet': FeatureType(data_type=DataType.INTEGER, variable_type=VariableType.CONTINUOUS), 'house_price': FeatureType(data_type=DataType.INTEGER, variable_type=VariableType.CONTINUOUS), } data_frame = pd.DataFrame({'house_price': [1, 2, 3, 4, 5, 6, 7, 8, 5, 6, 7], 'square_feet': [5, 6, 7, 8, 9, 10, 11, 12, 9, 10, 11]}) feature1: str = 'house_price' feature2: str = 'square_feet' correlation_metrics = [ MetricMetadata(klass=PearsonCorrelation, config={FEATURE_LIST: [feature1, feature2]}) ] metric_details = MetricDetail(univariate_metric={}, dataset_metrics=correlation_metrics) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() run_result = runner.run() profile = run_result.profile dataset_metrics = profile.get_dataset_metric(correlation_metrics[0]) assert dataset_metrics is not None Returns the metric result as: return { 'value': { 'matrix': array([[1. , 0.64549722], [0.64549722, 1. ]]), 'feature_list': ['house_price', 'square_feet'] } }
- compute(dataset: DataFrame, **kwargs: Any) None ¶
Update the state of the PearsonCorrelation metric using dataset
Parameters¶
dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which Metric is being computed
- classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) PearsonCorrelation ¶
Create a PearsonCorrelation data quality metric using the configuration and kwargs
Parameters¶
config : Metric configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:
features: Contains list of input feature column names
- classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) PearsonCorrelation ¶
Create a new instance from serialized bytes.
Parameters¶
- serialized_bytesbytes
Serialized bytes as input.
Returns¶
- Serializable
New instance of Serializable
- feature_list: List[str]¶
- feature_pair_mapping: Dict[str, PearsonCorrelationState]¶
Returns the Shareable Feature Components for 2 input features
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns Pearson’s Correlation 2-D matrix for set of features, using the DescriptiveStatisticsSFC
Returns¶
Json object: Pearson’s Correlation 2-D matrix for n features
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
Returns Pearson’s Correlation Metric and P_values in Standard format.
Returns¶
StandardMetricResult: Pearson’s Correlation Metric and P_values in standard format.
- merge(other: PearsonCorrelation, **kwargs: Any) PearsonCorrelation ¶
Merge two PearsonCorrelation into one, without mutating the others. 1. Calculate cumulative_col12_count 2. Calculate combined mean for feature column1 and column2 3. Calculate numerator of covariance column1 and column2
Parameters¶
- otherPearsonCorrelation
Other PearsonCorrelation that need be merged.
Returns¶
- PearsonCorrelation
A new instance of PearsonCorrelation
- class mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelationState(cumulative_partition_count: int = 0, column1_mean: float = nan, column2_mean: float = nan, covariance_col1_col2: float = nan, feature1: str = '', feature2: str = '')¶
Bases:
object
- column1_mean: float = nan¶
- column2_mean: float = nan¶
- covariance_col1_col2: float = nan¶
- cumulative_partition_count: int = 0¶
- feature1: str = ''¶
- feature2: str = ''¶
mlm_insights.core.metrics.data_quality.correlation_ratio module¶
- class mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatio(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, feature_pair_mapping: ~typing.Dict[str, ~mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioState] = <factory>, categorical_features: ~typing.List[str] = <factory>, numerical_features: ~typing.List[str] = <factory>)¶
Bases:
DatasetMetricBase
,Serializable
Dataset level metric computes correlation matrix for user provided categorical and numerical features.This is an approximate multivariate metric.We use Correlation Ratio for correlation metric between n categorical and m numerical featuresThis metric handles NaN valuesNaN handling Example
a = [1, 2, 8, np.nan, 9] b = [5, np.nan, 7, np.nan, 10] valid_corresponding_column_values = pd.core.nanops.notna(a) & pd.core.nanops.notna(b) valid_corresponding_column_values= [ True False True False True] Applying valid_corresponding_column_values over column_a and column_b: a = a[valid_corresponding_column_values] b = b[valid_corresponding_column_values] a = [1, 8, 9] b = [5, 7, 10]
- It ranges from 0 to 1 where:
0 indicates no dispersion among the means of the different categories
1 indicates dispersion within the respective categories
NaN when all data points of the complete population take the same value
Correlation ratio (η) is a measure of the relationship between statistical dispersion within individual categories and dispersion across the whole population or sample.
Configuration¶
- feature_list: List[str]
list of feature names for computing the correlation between each provided feature pairs, number of features supported is between 2 and 50 inclusive
Returns¶
- matrix: numpy.typing.NDArray[np.float64]
correlation matrix
- categorical_features: List[str]
list of user provided categorical feature inputs
- numerical_features: List[str]
list of user provided numerical feature inputs
Limitations¶
Currently we support only maximum MAX_FEATURE_THRESHOLD_DEFAULT = 50 features including both categorical and numerical features
Exceptions¶
InvalidParameterException - in case Column Name is not present in provided dataset
MissingRequiredParameterException - on breaching MAX_FEATURE_THRESHOLD_DEFAULT or Minimum 1 Numerical and 1 Categorical feature column names not provided
ValueError - When comparison columns have no corresponding data to compare, all are NaN
TypeError - in case user do not passed feature_list in list format
Examples
import pandas as pd from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.definitions import FEATURE_LIST, CATEGORICAL_FEATURES, NUMERICAL_FEATURES from mlm_insights.constants.types import FeatureType, DataType, VariableType from mlm_insights.core.metrics.data_quality.correlation_ratio import CorrelationRatio from mlm_insights.core.metrics.metric_metadata import MetricMetadata def main(): input_schema = { "Pclass": FeatureType(data_type=DataType.STRING, variable_type=VariableType.NOMINAL), "age": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS) } data_frame = pd.DataFrame({'Pclass': [3, 3, 2, 3, 3, 3, 3, 2, 3, 3], 'age': [34.5, 47, 62, 27, 22, 14, 30, 26, 18, 21]}) feature1: str = 'Pclass' feature2: str = 'age' correlation_metrics = [ MetricMetadata(klass=CorrelationRatio, config={FEATURE_LIST: [feature1, feature2]}) ] metric_details = MetricDetail(univariate_metric={}, dataset_metrics=correlation_metrics) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() run_result = runner.run() profile = run_result.profile dataset_metrics = profile.get_dataset_metric(correlation_metrics[0]) assert dataset_metrics is not None sfc_registry = {} for feature in profile.features.values(): sfc_registry[feature.get_name()] = feature.sfc_registry correlation_ratio_actual_value = dataset_metrics.get_result(sfc_registry=sfc_registry)['value'] correlation_matrix = correlation_ratio_actual_value['matrix'] assert correlation_matrix is not None categorical_feature_map = {value: index for index, value in enumerate(correlation_ratio_actual_value[CATEGORICAL_FEATURES])} numerical_feature_map = {value: index for index, value in enumerate(correlation_ratio_actual_value[NUMERICAL_FEATURES])} correlation_ratio_value = round( correlation_ratio_actual_value['matrix'][categorical_feature_map[feature1]][ numerical_feature_map[feature2]], 4) Returns the metric result as: return { 'value': { 'matrix': array([ [0.50199] ]), 'categorical_features': ['Pclass'], 'numerical_features': ['age'] } }
- categorical_features: List[str]¶
- compute(dataset: DataFrame, **kwargs: Any) None ¶
Update the state of the CorrelationRatio using dataset
Parameters¶
dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which a Metric is being computed
- classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) CorrelationRatio ¶
Create a CorrelationRatio data quality metric using the configuration and kwargs
Parameters¶
config : Metric configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:
features: Contains list of input feature column names
- classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) CorrelationRatio ¶
Create a new instance from serialized bytes.
Parameters¶
- serialized_bytesbytes
Serialized bytes as input.
Returns¶
- Serializable
New instance of Serializable
- feature_pair_mapping: Dict[str, CorrelationRatioState]¶
Returns the Shareable Feature Components that a Metric requires to compute its state and values Metrics which do not require SFC need not override this property
Returns¶
Dict where feature_name as key and List of SFCMetadata as value. Each SFCMetadata must contain the klass attribute which points to the SFC class
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns CorrelationRatio data quality metric
Returns¶
Json object: CorrelationRatio of the data.
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
Returns CorrelationRatio Metric in Standard format.
Returns¶
StandardMetricResult: CorrelationRatio Metric in standard format.
- merge(other: CorrelationRatio, **kwargs: Any) CorrelationRatio ¶
Merge two CorrelationRatio into one, without mutating the others.
Parameters¶
- otherCorrelationRatio
Other CorrelationRatio that need be merged.
Returns¶
- CorrelationRatio
A new instance of CorrelationRatio
- numerical_features: List[str]¶
- class mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioDetails(total_sum: float = 0.0, total_count: int = 0)¶
Bases:
object
- total_count: int = 0¶
- total_sum: float = 0.0¶
- class mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioState(category_details: Dict[Union[int, str, float], mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioDetails] = <factory>, categorical_feature: str = '', numerical_feature: str = '')¶
Bases:
object
- categorical_feature: str = ''¶
- category_details: Dict[int | str | float, CorrelationRatioDetails]¶
- numerical_feature: str = ''¶