mlm_insights.core.metrics.drift_metrics package¶
Submodules¶
mlm_insights.core.metrics.drift_metrics.chi_square module¶
- class mlm_insights.core.metrics.drift_metrics.chi_square.ChiSquare(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, epsilon_value: float = 0.0001, _max_size_k: int = 7)¶
Bases:
MetricBase
Data Drift Metric to compute Chi-square goodness of fit testThe chi-square tests the null hypothesis that the categorical data has the given frequencies.It can process only categorical data types (nominal, ordinal, binary).It is an approximate metricThis is used for Model Drift computation, taking into consideration reference and current profilesConfiguration¶
- epsilon_value: float, default = 0.0001
This function replaces the 0 values in an array with a smaller value. If the array contains any elements <= smaller value, it replaces with epsilon. This is required for certain drift algorithms to ensure the value generated is not an invalid one. For eg: if a denominator is a zero, this leads to division by zero error
- _max_size_k: int, default = 7
Maximum size, in log2, of k. The value must be between 7 and 21, inclusive
Returns¶
- algorithm: string: Drift Algorithm Name
“Chi Squared Goodness of Fit Test”
- test_statistic: float: Test Statistic
The chi-squared test statistic
- p_value: float: p value
The P-value is the area under the density curve of this chi-square distribution to the right of the value of the test statistic.
Examples
import pandas as pd from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.drift_metrics.chi_square import ChiSquare from mlm_insights.core.metrics.metric_metadata import MetricMetadata input_schema = { 'mode_of_transport': FeatureType( data_type=DataType.TEXT, variable_type=VariableType.NOMINAL column_type=ColumnType.INPUT) } def get_metrics(): uni_variate_metrics = { "mode_of_transport": [MetricMetadata(klass=ChiSquare)] } metric_details = MetricDetail(univariate_metric=uni_variate_metrics, dataset_metrics=[]) return metric_details def do_run(data_frame): runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=get_metrics()). with_engine(engine=EngineDetail(engine_name="native")). build() return runner.run().profile def main(): reference_data_frame = pd.DataFrame({'mode_of_transport': ['bus', 'bus', 'train', 'walk', 'bus', 'car']}) target_data_frame = pd.DataFrame({'mode_of_transport': ['bus', 'bus', 'bus', 'cycle', 'bus', 'car']}) # do a reference run reference_profile = do_run(data_frame=reference_data_frame) target_profile = do_run(data_frame=target_data_frame) profile_json = target_profile.to_json(reference_profile=reference_profile) feature_metrics = profile_json['feature_metrics'] print(feature_metrics['mode_of_transport']["ChiSquare"]) if __name__ == "__main__": main() Returns the standard metric result as: { 'metric_name': 'ChiSquare', 'metric_description': 'Data Drift Metric to compute Chi-square goodness of fit test', 'variable_count': 3, 'variable_names': ['algorithm', 'test_statistic', 'p_value'], 'variable_types': [TEXT, CONTINUOUS, CONTINUOUS], 'variable_dtypes': [STRING, FLOAT, FLOAT], 'variable_dimensions': [0, 0, 0], 'metric_data': ['ChiSquare', 0.5, 0.5], 'metadata': {}, 'error': None }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) ChiSquare ¶
Factory Method to create an object.
Returns¶
Object: number of items that are duplicate of another item in the data and percentage of duplicate count out of the total count.
- epsilon_value: float = 0.0001¶
Returns a list of Shareable Feature Components containing 1 SFC that is Frequent Items SFC.
Returns¶
List of SFCMetadata, containing only 1 SFC i.e. Frequent Items SFC
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns the computed value of the metric Shareable Feature Component(s) can be accessed using kwargs
Returns¶
Dict[str, Any]: Dictionary with key as string and value as any metric property.
mlm_insights.core.metrics.drift_metrics.drift_metrics_helper module¶
- mlm_insights.core.metrics.drift_metrics.drift_metrics_helper.get_quantiles_sfcs(metric_metadata: MetricMetadata, metric: MetricBase, kwargs: Any) Tuple[Any, Any, float, float] ¶
- mlm_insights.core.metrics.drift_metrics.drift_metrics_helper.validate_metric_can_be_computed(current_profile: Profile, reference_profile: Profile, feature_name: str, metric_metadata: MetricMetadata) None ¶
mlm_insights.core.metrics.drift_metrics.jensen_shannon module¶
- class mlm_insights.core.metrics.drift_metrics.jensen_shannon.JensenShannon(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, bins: str | int | ~typing.List[float] = 'sturges')¶
Bases:
MetricBase
Data Drift Metric to compute Jensen Shannon distance between 2 probability distributionsThis is the square root of the Jensen-Shannon divergence.It can process only numerical data types (int, float).It is an approximate metricThis is used for Model Drift computation, taking into consideration reference and current profilesConfiguration¶
- bin: Union[str, int, List[float]], default=’sturges’
- One of the following values- Number of bins- Binning algorithm. Default is Sturges- Bins: List of floats
Returns¶
- algorithm: string: Drift Algorithm Name
“Jensen Shannon Distance”
- drift_score: float: Drift Score
The Jensen-Shannon distances between 2 probability distributions
Examples
import pandas as pd from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.drift_metrics.jensen_shannon import JensenShannon from mlm_insights.core.metrics.metric_metadata import MetricMetadata input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } def get_metrics(): uni_variate_metrics = { "square_feet": [MetricMetadata(klass=JensenShannon)] } metric_details = MetricDetail(univariate_metric=uni_variate_metrics, dataset_metrics=[]) return metric_details def do_run(data_frame): runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=get_metrics()). with_engine(engine=EngineDetail(engine_name="native")). build() return runner.run().profile def main(): reference_data_frame = pd.DataFrame({'square_feet': [10, 10, 10, 10]}) target_data_frame = pd.DataFrame({'square_feet': [20, 21.2, 10, 11.3]}) # do a reference run reference_profile = do_run(data_frame=reference_data_frame) target_profile = do_run(data_frame=target_data_frame) profile_json = target_profile.to_json(reference_profile=reference_profile) feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["JensenShannon"]) if __name__ == "__main__": main() Returns the standard metric result as: { 'metric_name': 'JensenShannon', 'metric_description': 'Data Drift Metric to compute Jensen Shannon distance between 2 probability distributions', 'variable_count': 2, 'variable_names': ['algorithm', 'drift_score'], 'variable_types': [TEXT, CONTINUOUS], 'variable_dtypes': [STRING, FLOAT], 'variable_dimensions': [0, 0], 'metric_data': ['JensenShannon', 0.5], 'metadata': {}, 'error': None }
- bins: str | int | List[float] = 'sturges'¶
- classmethod create(config: Dict[str, ConfigParameter] | None = None) JensenShannon ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
- MetricBase
An Instance of MetricBase.
Returns list of SFCs required to compute KL metric.
Returns¶
List: list of SFCs
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns the computed value of the metric Shareable Feature Component(s) can be accessed using kwargs
Returns¶
Dict[str, Any]: Dictionary with key as string and value as any metric property.
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
This method returns metric output in standard format.
Returns¶
StandardMetricResult
- merge(other_metric: JensenShannon, **kwargs: Any) JensenShannon ¶
Merge two JensenShannon into one, without mutating the others.
Parameters¶
- other_metricJensenShannon
Other JensenShannon that need be merged.
Returns¶
- TypeMetric
A new instance of JensenShannon
mlm_insights.core.metrics.drift_metrics.kolmogorov_smirnov module¶
- class mlm_insights.core.metrics.drift_metrics.kolmogorov_smirnov.KolmogorovSmirnov(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, bins: str | int | ~typing.List[float] = 100, _kll_k: int = 500)¶
Bases:
MetricBase
Performs the two-sample Kolmogorov-Smirnov test for goodness of fit.The asymptotic Kolmogorov-Smirnov distribution is used to compute an approximate p-value.Kolmogorov Smirnov Test is Nonparametric statistical test to identify whether 2 probability distributions differor whether the two data samples come from the same distributionTest Statistic for the 2-sample test is the greatest distance between the CDFs (Cumulative Distribution Function) of each sampleNull Hypothesis: samples are drawn from the same distributionIt can process only numerical data types (int, float).It is an approximate metricInternally, it uses a sketch data structure with a default K value of 500.This is used for Model Drift computation, taking into consideration reference and current profilesConfiguration¶
- bin: Union[str, int, List[float]], default=’sturges’
- One of the following values- Number of bins- Binning algorithm. Default is Sturges- Bins: List of floats
- _KLL_K: int, default= 500
buffer size for kll sketch
Returns¶
- algorithm: string: Drift Algorithm Name
“Kolmogorov Smirnov”
- test_statistic: float: Test Statistic
The KS test statistic
- p_value: float: p value
show us the chance of getting the two samples, assuming the null hypothesis is true
Examples
import pandas as pd from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.drift_metrics.kolmogorov_smirnov import KolmogorovSmirnov from mlm_insights.core.metrics.metric_metadata import MetricMetadata input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } def get_metrics(): uni_variate_metrics = { "square_feet": [MetricMetadata(klass=KolmogorovSmirnov)] } metric_details = MetricDetail(univariate_metric=uni_variate_metrics, dataset_metrics=[]) return metric_details def do_run(data_frame): runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=get_metrics()). with_engine(engine=EngineDetail(engine_name="native")). build() return runner.run().profile def main(): reference_data_frame = pd.DataFrame({'square_feet': [10, 10, 10, 10]}) target_data_frame = pd.DataFrame({'square_feet': [20, 21.2, 10, 11.3]}) # do a reference run reference_profile = do_run(data_frame=reference_data_frame) target_profile = do_run(data_frame=target_data_frame) profile_json = target_profile.to_json(reference_profile=reference_profile) feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["KolmogorovSmirnov"]) if __name__ == "__main__": main() Returns the standard metric result as: { 'metric_name': 'KolmogorovSmirnov', 'metric_description': 'Data Drift Metric to compute two-sample Kolmogorov-Smirnov test for goodness of fit', 'variable_count': 3, 'variable_names': ['algorithm', 'test_statistic', 'p_value'], 'variable_types': [TEXT, CONTINUOUS, CONTINUOUS], 'variable_dtypes': [STRING, FLOAT, FLOAT], 'variable_dimensions': [0, 0, 0], 'metric_data': ['KolmogorovSmirnov', 0.5, 0.5], 'metadata': {}, 'error': None }
- bins: str | int | List[float] = 100¶
- classmethod create(config: Dict[str, ConfigParameter] | None = None) KolmogorovSmirnov ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
- MetricBase
An Instance of MetricBase.
Returns list of SFCs required to compute KS metric.
Returns¶
List: list of SFCs
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns the computed value of the metric Shareable Feature Component(s) can be accessed using kwargs
Returns¶
Dict[str, Any]: Dictionary with key as string and value as any metric property.
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
This method returns metric output in standard format.
Returns¶
StandardMetricResult
- merge(other_metric: KolmogorovSmirnov, **kwargs: Any) KolmogorovSmirnov ¶
Merge two KolmogorovSmirnov metric into one, without mutating the others.
Parameters¶
- other_metricKolmogorovSmirnov
Other KolmogorovSmirnov metric that need be merged.
Returns¶
- TypeMetric
A new instance of KolmogorovSmirnov
mlm_insights.core.metrics.drift_metrics.kullback_leibler module¶
- class mlm_insights.core.metrics.drift_metrics.kullback_leibler.KullbackLeibler(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, bins: str | int | ~typing.List[float] = 'sturges')¶
Bases:
MetricBase
Metric to compute Kullback-Leibler divergence between 2 probability distributionsIt is an approximate metric. It can process only numerical data types (int, float).This is used for Model Drift computation, taking into consideration reference and current profilesConfiguration¶
- bin: Union[str, int, List[float]], default=’sturges’
- One of the following values- Number of bins- Binning algorithm. Default is Sturges- Bins: List of floats
Returns¶
- algorithm: string: Drift Algorithm Name
“Kullback Leibler Divergence”
- drift_score: float: Drift Score
The KL distances between 2 probability distributions
Examples
import pandas as pd from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.drift_metrics.kullback_leibler import KullbackLeibler from mlm_insights.core.metrics.metric_metadata import MetricMetadata input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } def get_metrics(): uni_variate_metrics = { "square_feet": [MetricMetadata(klass=KullbackLeibler)] } metric_details = MetricDetail(univariate_metric=uni_variate_metrics, dataset_metrics=[]) return metric_details def do_run(data_frame): runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=get_metrics()). with_engine(engine=EngineDetail(engine_name="native")). build() return runner.run().profile def main(): reference_data_frame = pd.DataFrame({'square_feet': [10, 10, 10, 10]}) target_data_frame = pd.DataFrame({'square_feet': [20, 21.2, 10, 11.3]}) # do a reference run reference_profile = do_run(data_frame=reference_data_frame) target_profile = do_run(data_frame=target_data_frame) profile_json = target_profile.to_json(reference_profile=reference_profile) feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["KullbackLeibler"]) if __name__ == "__main__": main() Returns the standard metric result as: { 'metric_name': 'KullbackLeibler', 'metric_description': 'Data Drift Metric to compute Kullback-Leibler divergence between 2 probability distributions', 'variable_count': 2, 'variable_names': ['algorithm', 'drift_score'], 'variable_types': [TEXT, CONTINUOUS], 'variable_dtypes': [STRING, FLOAT], 'variable_dimensions': [0, 0], 'metric_data': ['KullbackLeibler', 0.5], 'metadata': {}, 'error': None }
- bins: str | int | List[float] = 'sturges'¶
- classmethod create(config: Dict[str, ConfigParameter] | None = None) KullbackLeibler ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
- MetricBase
An Instance of MetricBase.
Returns list of SFCs required to compute KL metric.
Returns¶
List: list of SFCs
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns the computed value of the metric Shareable Feature Component(s) can be accessed using kwargs
Returns¶
Dict[str, Any]: Dictionary with key as string and value as any metric property.
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
This method returns metric output in standard format.
Returns¶
StandardMetricResult
- merge(other_metric: KullbackLeibler, **kwargs: Any) KullbackLeibler ¶
Merge two KullbackLeibler into one, without mutating the others.
Parameters¶
- other_metricKullbackLeibler
Other KullbackLeibler that need be merged.
Returns¶
- TypeMetric
A new instance of KullbackLeibler
mlm_insights.core.metrics.drift_metrics.population_stability_index module¶
- class mlm_insights.core.metrics.drift_metrics.population_stability_index.PopulationStabilityIndex(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, bins: str | int | ~typing.List[float] = 'sturges', _kll_k: int = 500)¶
Bases:
MetricBase
Data Drift Metric to compute Population Stability Index (PSI) distance between 2 probability distributionsIt can process only numerical data types (int, float).It is an approximate metricThis is used for Model Drift computation, taking into consideration reference and current profilesConfiguration¶
- bin: Union[str, int, List[float]], default=’sturges’
- One of the following values- Number of bins- Binning algorithm. Default is Sturges- Bins: List of floats
Returns¶
- algorithm: string: Drift Algorithm Name
“Population Stability Index”
- drift_score: float: Drift Score
The PSI distances between one probability distribution from a reference probability distribution
Examples
import pandas as pd from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.drift_metrics.population_stability_index import PopulationStabilityIndex from mlm_insights.core.metrics.metric_metadata import MetricMetadata input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } def get_metrics(): uni_variate_metrics = { "square_feet": [MetricMetadata(klass=PopulationStabilityIndex)] } metric_details = MetricDetail(univariate_metric=uni_variate_metrics, dataset_metrics=[]) return metric_details def do_run(data_frame): runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=get_metrics()). with_engine(engine=EngineDetail(engine_name="native")). build() return runner.run().profile def main(): reference_data_frame = pd.DataFrame({'square_feet': [10, 10, 10, 10]}) target_data_frame = pd.DataFrame({'square_feet': [20, 21.2, 10, 11.3]}) # do a reference run reference_profile = do_run(data_frame=reference_data_frame) target_profile = do_run(data_frame=target_data_frame) profile_json = target_profile.to_json(reference_profile=reference_profile) feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["PopulationStabilityIndex"]) if __name__ == "__main__": main() Returns the standard metric result as: { 'metric_name': 'PopulationStabilityIndex', 'metric_description': 'Data Drift Metric to compute Population Stability Index(PSI) distance between 2 probability distributions', 'variable_count': 2, 'variable_names': ['algorithm', 'drift_score'], 'variable_types': [TEXT, CONTINUOUS], 'variable_dtypes': [STRING, FLOAT], 'variable_dimensions': [0, 0], 'metric_data': ['PopulationStabilityIndex', 0.5], 'metadata': {}, 'error': None }
- bins: str | int | List[float] = 'sturges'¶
- classmethod create(config: Dict[str, ConfigParameter] | None = None) PopulationStabilityIndex ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
- MetricBase
An Instance of MetricBase.
Returns list of SFCs required to compute KL metric.
Returns¶
List: list of SFCs
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns the computed value of the metric Shareable Feature Component(s) can be accessed using kwargs
Returns¶
Dict[str, Any]: Dictionary with key as string and value as any metric property.
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
This method returns metric output in standard format.
Returns¶
StandardMetricResult
- merge(other_metric: PopulationStabilityIndex, **kwargs: Any) PopulationStabilityIndex ¶
Merge two PopulationStabilityIndex into one, without mutating the others.
Parameters¶
- other_metricPopulationStabilityIndex
Other JensenShannon that need be merged.
Returns¶
- TypeMetric
A new instance of JensenShannon