mlm_insights.core.metrics package¶
Subpackages¶
- mlm_insights.core.metrics.classification_metrics package
- Submodules
- mlm_insights.core.metrics.classification_metrics.accuracy_score module
- mlm_insights.core.metrics.classification_metrics.common module
- mlm_insights.core.metrics.classification_metrics.confusion_matrix module
- mlm_insights.core.metrics.classification_metrics.false_negative_rate module
- mlm_insights.core.metrics.classification_metrics.false_positive_rate module
- mlm_insights.core.metrics.classification_metrics.fbeta_score module
- mlm_insights.core.metrics.classification_metrics.log_loss module
- mlm_insights.core.metrics.classification_metrics.precision_recall_auc module
- mlm_insights.core.metrics.classification_metrics.precision_recall_curve module
- mlm_insights.core.metrics.classification_metrics.precision_score module
- mlm_insights.core.metrics.classification_metrics.recall_score module
- mlm_insights.core.metrics.classification_metrics.roc module
- mlm_insights.core.metrics.classification_metrics.roc_auc module
- mlm_insights.core.metrics.classification_metrics.specificity module
- Module contents
- mlm_insights.core.metrics.drift_metrics package
- Submodules
- mlm_insights.core.metrics.drift_metrics.chi_square module
- mlm_insights.core.metrics.drift_metrics.drift_metrics_helper module
- mlm_insights.core.metrics.drift_metrics.jensen_shannon module
- mlm_insights.core.metrics.drift_metrics.kolmogorov_smirnov module
- mlm_insights.core.metrics.drift_metrics.kullback_leibler module
- mlm_insights.core.metrics.drift_metrics.population_stability_index module
- mlm_insights.core.metrics.interfaces package
- Submodules
- mlm_insights.core.metrics.interfaces.dataset_metric_base module
DatasetMetricBase
DatasetMetricBase.compute()
DatasetMetricBase.config
DatasetMetricBase.create()
DatasetMetricBase.do_deserialize()
DatasetMetricBase.do_serialize()
DatasetMetricBase.get_config()
DatasetMetricBase.get_name()
DatasetMetricBase.get_required_shareable_dataset_components()
DatasetMetricBase.get_required_shareable_feature_components()
DatasetMetricBase.get_result()
DatasetMetricBase.get_standard_metric_result()
DatasetMetricBase.merge()
DatasetMetricBase.set_config()
- mlm_insights.core.metrics.interfaces.metric_base module
MetricBase
MetricBase.compute()
MetricBase.config
MetricBase.create()
MetricBase.do_deserialize()
MetricBase.do_serialize()
MetricBase.get_config()
MetricBase.get_name()
MetricBase.get_required_shareable_feature_components()
MetricBase.get_result()
MetricBase.get_standard_metric_result()
MetricBase.get_supported_variable_types()
MetricBase.merge()
MetricBase.set_config()
- mlm_insights.core.metrics.regression_metrics package
- Submodules
- mlm_insights.core.metrics.regression_metrics.max_error module
- mlm_insights.core.metrics.regression_metrics.mean_absolute_error module
- mlm_insights.core.metrics.regression_metrics.mean_absolute_percentage_error module
MeanAbsolutePercentageError
MeanAbsolutePercentageError.compute()
MeanAbsolutePercentageError.create()
MeanAbsolutePercentageError.get_result()
MeanAbsolutePercentageError.get_standard_metric_result()
MeanAbsolutePercentageError.merge()
MeanAbsolutePercentageError.prediction_column
MeanAbsolutePercentageError.sum_of_relative_error
MeanAbsolutePercentageError.target_column
MeanAbsolutePercentageError.total_count
- mlm_insights.core.metrics.regression_metrics.mean_squared_error module
- mlm_insights.core.metrics.regression_metrics.mean_squared_log_error module
MeanSquaredLogError
MeanSquaredLogError.compute()
MeanSquaredLogError.create()
MeanSquaredLogError.get_result()
MeanSquaredLogError.get_standard_metric_result()
MeanSquaredLogError.merge()
MeanSquaredLogError.prediction_column
MeanSquaredLogError.sum_of_squared_log
MeanSquaredLogError.target_column
MeanSquaredLogError.total_count
- mlm_insights.core.metrics.regression_metrics.r2_score module
- mlm_insights.core.metrics.regression_metrics.root_mean_squared_error module
RootMeanSquaredError
RootMeanSquaredError.compute()
RootMeanSquaredError.create()
RootMeanSquaredError.get_result()
RootMeanSquaredError.get_standard_metric_result()
RootMeanSquaredError.merge()
RootMeanSquaredError.prediction_column
RootMeanSquaredError.prediction_score_column
RootMeanSquaredError.sum_of_squared_residuals
RootMeanSquaredError.target_column
RootMeanSquaredError.total_count
- mlm_insights.core.metrics.data_quality package
- Submodules
- mlm_insights.core.metrics.data_quality.cramers_v_correlation module
CorrelationSummary
CramersVCorrelation
CramersVCorrelation.compute()
CramersVCorrelation.create()
CramersVCorrelation.deserialize()
CramersVCorrelation.feature_list
CramersVCorrelation.feature_pair_mapping
CramersVCorrelation.get_result()
CramersVCorrelation.get_standard_metric_result()
CramersVCorrelation.merge()
CramersVCorrelation.serialize()
CramersVCorrelationState
- mlm_insights.core.metrics.data_quality.pearson_correlation module
PearsonCorrelation
PearsonCorrelation.compute()
PearsonCorrelation.create()
PearsonCorrelation.deserialize()
PearsonCorrelation.feature_list
PearsonCorrelation.feature_pair_mapping
PearsonCorrelation.get_required_shareable_feature_components()
PearsonCorrelation.get_result()
PearsonCorrelation.get_standard_metric_result()
PearsonCorrelation.merge()
PearsonCorrelation.serialize()
PearsonCorrelationState
- mlm_insights.core.metrics.data_quality.correlation_ratio module
CorrelationRatio
CorrelationRatio.categorical_features
CorrelationRatio.compute()
CorrelationRatio.create()
CorrelationRatio.deserialize()
CorrelationRatio.feature_pair_mapping
CorrelationRatio.get_required_shareable_feature_components()
CorrelationRatio.get_result()
CorrelationRatio.get_standard_metric_result()
CorrelationRatio.merge()
CorrelationRatio.numerical_features
CorrelationRatio.serialize()
CorrelationRatioDetails
CorrelationRatioState
- Module contents
- mlm_insights.core.metrics.conflict_metrics package
Submodules¶
mlm_insights.core.metrics.count module¶
- class mlm_insights.core.metrics.count.Count(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, missing_count: int = 0, total_count: int = 0)¶
Bases:
MetricBase
Feature Metric to compute total rows count, missing count and missing count percentageIt takes into consideration removing NaN values while computing total countIt is an exact univariate metric which can process any column type and for all data typesConfiguration¶
None
Returns¶
- total count: int
Number of records processed for the feature.
- missing count: int
Number of records which have missing values.
- missing_count_percentage: float
The percentage of missing records in the data
Examples
import pandas as pd from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.count import Count from mlm_insights.core.metrics.metric_metadata import MetricMetadata import pandas as pd def main(): input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23, None]}) metric_details = MetricDetail(univariate_metric= {"square_feet": [MetricMetadata(klass=Count)]}, dataset_metrics=[]) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["Count"]) # {'total_count': 6, 'missing_count': 1, 'missing_count_percentage': 16.666666666666664} if __name__ == "__main__": main() Returns the standard metric result as: { 'metric_name': 'Count', 'metric_description': 'Feature metric that returns total count, missing count and missing count percentage', 'variable_count': 3, 'variable_names': ['total_count', 'missing_count', 'missing_count_percentage'], 'variable_types': [CONTINUOUS, CONTINUOUS, CONTINUOUS], 'variable_dtypes': [INTEGER, INTEGER, FLOAT], 'variable_dimensions': [0, 0, 0], 'metric_data': [0, 0, 0.0], 'metadata': {}, 'error': None }
- compute(column: Series, **kwargs: Any) None ¶
Computes the count of missing records for the passed in dataset, as well as the total number of processed records. In case of a partitioned dataset, computes the count of missing records for the specific partition
Parameters¶
- columnpd.Series
Input column.
- classmethod create(config: Dict[str, ConfigParameter] | None = None) Count ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
- Count
An Instance of Count.
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns the total count, count of missing data, and the percentage of missing data for the feature
Returns¶
- total count: int
total number of records processed in the data.
- missing count: int
number of records in the data having missing values.
- missing_count_percentage: float
percentage of missing records (the number of missing records for the feature divided by the total number of records)
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
Returns Standard Metric for Count.
Returns¶
StandardMetricResult: Count Metric in standard format.
- merge(other_metric: Count, **kwargs: Any) Count ¶
Merge two Count metrics into one, without mutating the others.
Parameters¶
- other_metricCount
Other Count metric that needs to be merged.
Returns¶
- Count
A new instance of Count containing missing_count and total_count after merging.
- missing_count: int = 0¶
- total_count: int = 0¶
mlm_insights.core.metrics.dataset_metric_registry module¶
- class mlm_insights.core.metrics.dataset_metric_registry.DatasetMetricRegistry¶
Bases:
object
- add_metric(dataset_metric_metadata: MetricMetadata, **kwargs: Any) DatasetMetricRegistry ¶
- static create_from_metrics_map(dataset_metrics_map: Dict[str, DatasetMetricBase]) DatasetMetricRegistry ¶
Factory method to create Dataset Metric Registry using Dataset Metric Map. Use this method to create metric registry directly from the dataset metric map.
Parameters¶
- dataset_metrics_mapDict[str, DatasetMetricBase]
Dictionary of metrics_map, hash as the Key and DatasetMetricBase as value.
- classmethod deserialize(metric_registry_message: MetricRegistryMessage) DatasetMetricRegistry ¶
- get_dataset_metrics() Any ¶
- get_dataset_metrics_map() Dict[str, DatasetMetricBase] ¶
- get_metric(dataset_metric_metadata: MetricMetadata) DatasetMetricBase ¶
- serialize() MetricRegistryMessage ¶
mlm_insights.core.metrics.dataset_summary module¶
- class mlm_insights.core.metrics.dataset_summary.DatasetSummary¶
Bases:
object
- compute(input_schema: Dict[str, FeatureType]) None ¶
- classmethod deserialize(serialized: DatasetSummaryMessage) DatasetSummary ¶
- serialize() DatasetSummaryMessage ¶
mlm_insights.core.metrics.distinct_count module¶
- class mlm_insights.core.metrics.distinct_count.DistinctCount(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶
Bases:
MetricBase
Feature Metric to compute distinct count of elements present in that columnIt is an approximate univariate metric which can process any column type and for all data typesInternally, it uses a sketch data structure with a default K value of 4096.Supports all data types, it does not consider NaN values while doing the computationReturns¶
- distinct_count: int
the distinct count of the data.
Example
import pandas as pd from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType from mlm_insights.core.metrics.distinct_count import DistinctCount from mlm_insights.core.metrics.metric_metadata import MetricMetadata df = pd.DataFrame({"Age": [1, 4, 6, 1]}) metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=DistinctCount)]}, dataset_metrics=[]) input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)} runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=df). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics["Age"]) { "DistinctCount": { "metric_name": "DistinctCount", "metric_description": "Approximate Distinct Count", "variable_count": 1, "variable_names": ["distinct_count"], "variable_types": ["CONTINUOUS"], "variable_dtypes": ["FLOAT"], "variable_dimensions": [0], "metric_data": [3], "metadata": {} } }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) DistinctCount ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
int: distinct count of the data.
Returns a list of Shareable Feature Components containing 1 SFC that is Quantiles SFC.
Returns¶
List of SFCMetadata, containing only 1 SFC i.e. DistinctCountSFC
mlm_insights.core.metrics.duplicate_count module¶
- class mlm_insights.core.metrics.duplicate_count.DuplicateCount(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶
Bases:
MetricBase
Feature Metric to compute duplicate count and duplicate count percentage of elements present in that columnIt is an approximate univariate metric which can process any column type and for all data typesInternally, it uses a sketch data structure with a default K value of 1024.Supports all data types, it does not consider NaN while computationConfiguration¶
None
Returns¶
- count: int
Number of duplicate items in the feature data
- percentage: float
The percentage of duplicate records in the data
Examples
import pandas as pd from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.duplicate_count import DuplicateCount from mlm_insights.core.metrics.metric_metadata import MetricMetadata def main(): input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]}) metric_details = MetricDetail(univariate_metric= {"square_feet": [MetricMetadata(klass=DuplicateCount)]}, dataset_metrics=[]) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["DuplicateCount"]) # {'count': 2, 'percentage': 40.0} if __name__ == "__main__": main() Returns the standard metric result as: { 'metric_name': 'DuplicateCount', 'metric_description': 'Feature Metric to compute duplicate count and duplicate count percentage', 'variable_count': 2, 'variable_names': ['count', 'percentage'], 'variable_types': [CONTINUOUS, CONTINUOUS], 'variable_dtypes': [INTEGER, FLOAT], 'variable_dimensions': [0, 0], 'metric_data': [23, 15.5], 'metadata': {}, 'error': None }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) DuplicateCount ¶
Factory Method to create an object.
Returns¶
Object: number of items that are duplicate of another item in the data and percentage of duplicate count out of the total count.
Returns a list of Shareable Feature Components containing 1 SFC that is Frequent Items SFC.
Returns¶
List of SFCMetadata, containing only 1 SFC i.e. Frequent Items SFC
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns the number of items that are duplicate of another item in the data and percentage of duplicate count out of total count.
Returns¶
Object: number of items that are duplicate of another item in the data and percentage of duplicate count out of total count.
mlm_insights.core.metrics.framework_metrics_enum module¶
- class mlm_insights.core.metrics.framework_metrics_enum.FrameworkMetrics(value)¶
Bases:
Enum
Define all Insights-provided Metric here.
- AccuracyScore = <class 'mlm_insights.core.metrics.classification_metrics.accuracy_score.AccuracyScore'>¶
- ChiSquare = <class 'mlm_insights.core.metrics.drift_metrics.chi_square.ChiSquare'>¶
- ClassImbalance = <class 'mlm_insights.core.metrics.bias_and_fairness.class_imbalance.ClassImbalance'>¶
- ConflictLabel = <class 'mlm_insights.core.metrics.conflict_metrics.conflict_label.ConflictLabel'>¶
- ConflictPrediction = <class 'mlm_insights.core.metrics.conflict_metrics.conflict_prediction.ConflictPrediction'>¶
- ConfusionMatrix = <class 'mlm_insights.core.metrics.classification_metrics.confusion_matrix.ConfusionMatrix'>¶
- CorrelationRatio = <class 'mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatio'>¶
- Count = <class 'mlm_insights.core.metrics.count.Count'>¶
- CramersVCorrelation = <class 'mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelation'>¶
- DateTimeDuration = <class 'mlm_insights.core.metrics.datetime_metrics.datetime_duration.DateTimeDuration'>¶
- DateTimeMax = <class 'mlm_insights.core.metrics.datetime_metrics.datetime_max.DateTimeMax'>¶
- DateTimeMin = <class 'mlm_insights.core.metrics.datetime_metrics.datetime_min.DateTimeMin'>¶
- DistinctCount = <class 'mlm_insights.core.metrics.distinct_count.DistinctCount'>¶
- DuplicateCount = <class 'mlm_insights.core.metrics.duplicate_count.DuplicateCount'>¶
- FBetaScore = <class 'mlm_insights.core.metrics.classification_metrics.fbeta_score.FBetaScore'>¶
- FalseNegativeRate = <class 'mlm_insights.core.metrics.classification_metrics.false_negative_rate.FalseNegativeRate'>¶
- FalsePositiveRate = <class 'mlm_insights.core.metrics.classification_metrics.false_positive_rate.FalsePositiveRate'>¶
- FrequencyDistribution = <class 'mlm_insights.core.metrics.frequency_distribution.FrequencyDistribution'>¶
- IQR = <class 'mlm_insights.core.metrics.iqr.IQR'>¶
- IsConstantFeature = <class 'mlm_insights.core.metrics.is_constant_feature.IsConstantFeature'>¶
- IsNegative = <class 'mlm_insights.core.metrics.is_negative.IsNegative'>¶
- IsNonZero = <class 'mlm_insights.core.metrics.is_non_zero.IsNonZero'>¶
- IsPositive = <class 'mlm_insights.core.metrics.is_positive.IsPositive'>¶
- IsQuasiConstantFeature = <class 'mlm_insights.core.metrics.is_quasi_constant_feature.IsQuasiConstantFeature'>¶
- JensenShannon = <class 'mlm_insights.core.metrics.drift_metrics.jensen_shannon.JensenShannon'>¶
- KolmogorovSmirnov = <class 'mlm_insights.core.metrics.drift_metrics.kolmogorov_smirnov.KolmogorovSmirnov'>¶
- KullbackLeibler = <class 'mlm_insights.core.metrics.drift_metrics.kullback_leibler.KullbackLeibler'>¶
- Kurtosis = <class 'mlm_insights.core.metrics.kurtosis.Kurtosis'>¶
- LogLoss = <class 'mlm_insights.core.metrics.classification_metrics.log_loss.LogLoss'>¶
- Max = <class 'mlm_insights.core.metrics.max.Max'>¶
- MaxError = <class 'mlm_insights.core.metrics.regression_metrics.max_error.MaxError'>¶
- Mean = <class 'mlm_insights.core.metrics.mean.Mean'>¶
- MeanAbsoluteError = <class 'mlm_insights.core.metrics.regression_metrics.mean_absolute_error.MeanAbsoluteError'>¶
- MeanAbsolutePercentageError = <class 'mlm_insights.core.metrics.regression_metrics.mean_absolute_percentage_error.MeanAbsolutePercentageError'>¶
- MeanSquaredError = <class 'mlm_insights.core.metrics.regression_metrics.mean_squared_error.MeanSquaredError'>¶
- MeanSquaredLogError = <class 'mlm_insights.core.metrics.regression_metrics.mean_squared_log_error.MeanSquaredLogError'>¶
- Min = <class 'mlm_insights.core.metrics.min.Min'>¶
- Mode = <class 'mlm_insights.core.metrics.mode.Mode'>¶
- PearsonCorrelation = <class 'mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelation'>¶
- Percentiles = <class 'mlm_insights.core.metrics.percentiles.Percentiles'>¶
- PopulationStabilityIndex = <class 'mlm_insights.core.metrics.drift_metrics.population_stability_index.PopulationStabilityIndex'>¶
- PrecisionRecallAreaUnderCurve = <class 'mlm_insights.core.metrics.classification_metrics.precision_recall_auc.PrecisionRecallAreaUnderCurve'>¶
- PrecisionRecallCurve = <class 'mlm_insights.core.metrics.classification_metrics.precision_recall_curve.PrecisionRecallCurve'>¶
- PrecisionScore = <class 'mlm_insights.core.metrics.classification_metrics.precision_score.PrecisionScore'>¶
- ProbabilityDistribution = <class 'mlm_insights.core.metrics.probablity_distribution.ProbabilityDistribution'>¶
- Quartiles = <class 'mlm_insights.core.metrics.quartiles.Quartiles'>¶
- R2Score = <class 'mlm_insights.core.metrics.regression_metrics.r2_score.R2Score'>¶
- ROCAreaUnderCurve = <class 'mlm_insights.core.metrics.classification_metrics.roc_auc.ROCAreaUnderCurve'>¶
- ROCCurve = <class 'mlm_insights.core.metrics.classification_metrics.roc.ROCCurve'>¶
- Range = <class 'mlm_insights.core.metrics.range.Range'>¶
- RecallScore = <class 'mlm_insights.core.metrics.classification_metrics.recall_score.RecallScore'>¶
- RootMeanSquaredError = <class 'mlm_insights.core.metrics.regression_metrics.root_mean_squared_error.RootMeanSquaredError'>¶
- RowCount = <class 'mlm_insights.core.metrics.rows_count.RowCount'>¶
- Skewness = <class 'mlm_insights.core.metrics.skewness.Skewness'>¶
- Specificity = <class 'mlm_insights.core.metrics.classification_metrics.specificity.Specificity'>¶
- StandardDeviation = <class 'mlm_insights.core.metrics.standard_deviation.StandardDeviation'>¶
- Sum = <class 'mlm_insights.core.metrics.sum.Sum'>¶
- TopKFrequentElements = <class 'mlm_insights.core.metrics.top_k_frequent_elements.TopKFrequentElements'>¶
- TypeMetric = <class 'mlm_insights.core.metrics.type_metric.TypeMetric'>¶
- Variance = <class 'mlm_insights.core.metrics.variance.Variance'>¶
mlm_insights.core.metrics.frequency_distribution module¶
- class mlm_insights.core.metrics.frequency_distribution.FrequencyDistribution(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, bins: str | int | ~typing.List[float] = 'sturges')¶
Bases:
MetricBase
Frequency DistributionThis metric calculates FrequencyDistribution of a single data columnThis is a feature level metric which can process any column and only numerical (int, float) data types.This is an approximate metricInternally, it uses a sketch data structure with a default K value of 200.Configuration¶
- bin: Union[str, int, List[float]], default=’sturges’
- One of the following values- Number of bins- Binning algorithm. Default is Sturges- Bins: List of floats
Returns¶
- bins: List[int]
bins of the data.
- frequency: List[int]
frequency of the data.
Example
from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.frequency_distribution import FrequencyDistribution from mlm_insights.core.metrics.metric_metadata import MetricMetadata import pandas as pd def main(): input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } data_frame = pd.DataFrame({'square_feet': [1, 1, 2, 3, 4, 5, 7, 10, 11, 20]}) metric_details = MetricDetail(univariate_metric= {"square_feet": [MetricMetadata(klass=FrequencyDistribution)]}, dataset_metrics=[]) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["FrequencyDistribution"]) # {'bins': [1.0, 4.8, 8.6, 12.399999999999999, 16.2, 20.0], 'frequency': [5, 2, 2, 0, 1]} { "metric_name": "FrequencyDistribution", "metric_description": "Feature Metric to compute Frequency distribution", "variable_count": 2, "variable_names": ['bins', 'frequency'], "variable_types": ["CONTINUOUS", "CONTINUOUS"], "variable_dtypes": ["FLOAT", "FLOAT"], "variable_dimensions": [1, 1], "metric_data": [[1.0, 4.8, 8.6, 12.399999999999999, 16.2, 20.0], [5, 2, 2, 0, 1]], "metadata": {}, "error": null }
- bins: str | int | List[float] = 'sturges'¶
- classmethod create(config: Dict[str, ConfigParameter] | None = None) FrequencyDistribution ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
- FrequencyDistribution
An Instance of FrequencyDistribution.
Returns a list of Shareable Feature Components containing 1 SFC that is Quantiles SFC.
Returns¶
- List[SFCMetaData]
List of SFCMetadata, containing only 1 SFC i.e. QuantilesSFC
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns the FrequencyDistribution of data.
Returns¶
- Dict
The frequency distribution of the data.
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
This method returns metric output in standard format.
Returns¶
StandardMetricResult
- merge(other_metric: FrequencyDistribution, **kwargs: Any) FrequencyDistribution ¶
Merge two Frequency Distribution Metric into one, without mutating the others.
Parameters¶
- other_metricFrequencyDistribution
Other Frequency Distribution that need be merged.
Returns¶
- FrequencyDistribution
A new instance of Frequency Distribution metric after merging.
mlm_insights.core.metrics.iqr module¶
- class mlm_insights.core.metrics.iqr.IQR(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶
Bases:
MetricBase
Inter Quartile RangeThis metric calculates inter quartile range of a single numerical data column, namely Q3 - Q1.Each quartile represent ((n + 1)/4)th Term of the overall dataset.This is a feature level metric which can process any column type and only numerical (int, float) data types.This is an approximate metric.Internally, it uses a sketch data structure with a default K value of 200.Configuration¶
None
Returns¶
- iqr: float
the IQR of the data (Q3 - Q1).
Examples
from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.iqr import IQR from mlm_insights.core.metrics.metric_metadata import MetricMetadata import pandas as pd def main(): input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } data_frame = pd.DataFrame({'square_feet': [-1, -2, -3, -4]}) metric_details = MetricDetail(univariate_metric= {"square_feet": [MetricMetadata(klass=IQR)]}, dataset_metrics=[]) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["IQR"]) # {'value': 2} if __name__ == "__main__": main() Returns the standard metric result as: { 'metric_name': 'IQR', 'metric_description': 'Feature Metric to compute IQR', 'variable_count': 1, 'variable_names': ['i_q_r'], 'variable_types': [CONTINUOUS], 'variable_dtypes': [FLOAT], 'variable_dimensions': [0], 'metric_data': [2], 'metadata': {}, 'error': None }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) IQR ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
An Instance of IQR.
Returns a list of Shareable Feature Components containing 1 SFC that is Quantiles SFC.
Returns¶
List of SFCMetadata, containing only 1 SFC i.e. QuantilesSFC
mlm_insights.core.metrics.is_constant_feature module¶
- class mlm_insights.core.metrics.is_constant_feature.IsConstantFeature(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶
Bases:
MetricBase
Constant Feature metric computes whether all the values are same or notThis metric returns Constant as True when all the values within feature are sameThis is a Univariate, feature level metric which can process any column type and any data types.This is an approximate metricInternally, it uses a sketch data structure with a default K value of 1024.Configuration¶
None
Returns¶
- is_constant: boolean
If all values are same
Example
import pandas as pd from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType from mlm_insights.core.metrics.is_constant_feature import IsConstantFeature from mlm_insights.core.metrics.metric_metadata import MetricMetadata df = pd.DataFrame({"Age": [1, 1, 1, 1]}) metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=IsConstantFeature)]}, dataset_metrics=[]) input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)} runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=df). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics["Age"]) { "IsConstantFeature": { "metric_name": "IsConstantFeature", "metric_description": "Feature Metric to compute if all values are same", "variable_count": 1, "variable_names": ["is_constant"], "variable_types": ["BINARY"], "variable_dtypes": ["BOOLEAN"], "variable_dimensions": [0], "metric_data": [true], "metadata": {}, "error": null } }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) IsConstantFeature ¶
Factory Method to create an object.
Returns¶
An Instance of IsConstantFeature Univariate Metric.
Returns list of SFCs required to compute IsConstantFeature Univariate Metric.
Returns¶
List: list of SFCs
mlm_insights.core.metrics.is_quasi_constant_feature module¶
- class mlm_insights.core.metrics.is_quasi_constant_feature.IsQuasiConstantFeature(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, quasi_constant_threshold: float = 0.99)¶
Bases:
MetricBase
Quasi Constant metric computes whether all the values are almost same or notThis metric returns Quasi Constant as True when one single value occurs at higher frequency compared to Quasi Constant ThresholdThis is a Univariate, feature level metric which can process any column type and any data types.This is an approximate metricInternally, it uses a sketch data structure with a default K value of 1024.Configuration¶
- quasi_constant_threshold: str, default=0.99
Define Quasi Constant Threshold value, if the first element value count percentage is >= this threshold, it is Quasi Constant Feature
Returns¶
- is_quasi_constant: boolean
If all values are almost same
Example
from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType from mlm_insights.core.metrics.is_quasi_constant_feature import IsQuasiConstantFeature from mlm_insights.core.metrics.metric_metadata import MetricMetadata import pandas as pd df = pd.DataFrame({"Age": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]}) metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=IsQuasiConstantFeature, config={"quasi_constant_threshold":0.8})]}, dataset_metrics=[]) input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)} runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=df). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics["Age"]) { "IsQuasiConstantFeature": { "metric_name": "IsQuasiConstantFeature", "metric_description": "Feature Metric to compute if all values are almost same", "variable_count": 1, "variable_names": ["is_quasi_constant"], "variable_types": ["BINARY"], "variable_dtypes": ["BOOLEAN"], "variable_dimensions": [0], "metric_data": [true], "metadata": {}, "error": null } }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) IsQuasiConstantFeature ¶
Factory Method to create an object.
Returns¶
An Instance of IsQuasiConstantFeature Univariate Metric.The configuration will be available in config.
Returns list of SFCs required to compute IsQuasiConstantFeature Univariate Metric.
Returns¶
List: list of SFCs
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns IsQuasiConstantFeature Univariate Metric for the data using the FrequentItemsSFC.
Returns¶
boolean: IsQuasiConstantFeature Univariate Metric of the data.
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
Returns Minimum Metric in Standard format.
Returns¶
StandardMetricResult: Minimum Metric in Standard format.
- merge(other_metric: IsQuasiConstantFeature, **kwargs: Any) IsQuasiConstantFeature ¶
Merge two IsQuasiConstantFeature Metric into one, without mutating the others.
Parameters¶
- other_metricIsQuasiConstantFeature
Other IsQuasiConstantFeature that need be merged.
Returns¶
- IsQuasiConstantFeature
A new instance of IsQuasiConstantFeature after merging.
- quasi_constant_threshold: float = 0.99¶
mlm_insights.core.metrics.kurtosis module¶
- class mlm_insights.core.metrics.kurtosis.Kurtosis(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶
Bases:
MetricBase
This metric calculates Kurtosis of a single numerical data columnThis is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.This is an exact metricMathematically: central_moments[i] = sum{( x - mean )^i} /NKurtosis is 4th Central MomentConfiguration¶
None
Returns¶
- kurtosis: float
Kurtosis of the data, if data not present returns None
Examples
from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.kurtosis import Kurtosis from mlm_insights.core.metrics.metric_metadata import MetricMetadata import pandas as pd def main(): input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]}) metric_details = MetricDetail(univariate_metric= {"square_feet": [MetricMetadata(klass=Kurtosis)]}, dataset_metrics=[]) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["Kurtosis"]) # {'value': -0.4098628688922519} if __name__ == "__main__": main() Returns the standard metric result as: { 'metric_name': 'Kurtosis', 'metric_description': 'Feature Metric to compute Kurtosis', 'variable_count': 1, 'variable_names': ['kurtosis'], 'variable_types': [CONTINUOUS], 'variable_dtypes': [FLOAT], 'variable_dimensions': [0], 'metric_data': [0.45], 'metadata': {}, 'error': None }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) Kurtosis ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
An Instance of Kurtosis.
Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.
Returns¶
List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC
mlm_insights.core.metrics.max module¶
- class mlm_insights.core.metrics.max.Max(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶
Bases:
MetricBase
This metric calculates Max of a single numerical data columnThis is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.This is an exact metricConfiguration¶
None
Returns¶
- max: float
Maximum of the data, if data not present returns None
Examples
import pandas as pd from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType from mlm_insights.core.metrics.max import Max from mlm_insights.core.metrics.metric_metadata import MetricMetadata df = pd.DataFrame({"Age": [1, 4, 6, 1]}) metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=Max)]}, dataset_metrics=[]) input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)} runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=df). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics["Age"]) { "Max": { "metric_name": "Max", "metric_description": "Feature Metric to compute maximum value", "variable_count": 1, "variable_names": ["maximum"], "variable_types": ["CONTINUOUS"], "variable_dtypes": ["FLOAT"], "variable_dimensions": [0], "metric_data": [6.0], "metadata": {}, "error": null } }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) Max ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
- MetricBase
An Instance of MetricBase.
Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.
Returns¶
List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC
mlm_insights.core.metrics.mean module¶
- class mlm_insights.core.metrics.mean.Mean(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶
Bases:
MetricBase
This metric calculates Mean of a single numerical data columnThis is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.This is an exact metricConfiguration¶
None
Returns¶
- mean: float
Mean of the data, if data not present returns None
Examples
from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.mean import Mean from mlm_insights.core.metrics.metric_metadata import MetricMetadata import pandas as pd def main(): input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]}) metric_details = MetricDetail(univariate_metric= {"square_feet": [MetricMetadata(klass=Mean)]}, dataset_metrics=[]) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["Mean"]) # {'value': 20.54} if __name__ == "__main__": main() Returns the standard metric result as: { 'metric_name': 'Mean', 'metric_description': 'Feature Metric to compute Mean', 'variable_count': 1, 'variable_names': ['mean'], 'variable_types': [CONTINUOUS], 'variable_dtypes': [FLOAT], 'variable_dimensions': [0], 'metric_data': [20.54], 'metadata': {}, 'error': None }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) Mean ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
An Instance of Mean.
Returns list of SFCs required to compute mean metric.
Returns¶
List: list of SFCs
mlm_insights.core.metrics.metric_metadata module¶
- class mlm_insights.core.metrics.metric_metadata.MetricMetadata(klass: ~typing.Type[~typing.Any], config: ~typing.Dict[str, ~typing.Any] = <factory>)¶
Bases:
object
Represents dataset metric metadata used to define and configure a metric
- config: Dict[str, Any]¶
- get_key() str ¶
Returns key which uniquely identifies this metric. Since a metric can be added only once to a feature/profile, key only contains the name which uniquely identifies the metric
- klass: Type[Any]¶
mlm_insights.core.metrics.metric_registry module¶
- class mlm_insights.core.metrics.metric_registry.MetricRegistry¶
Bases:
object
- add_metric(metric_metadata: MetricMetadata, **kwargs: Any) MetricRegistry ¶
- static create_from_metrics_map(metrics_map: Dict[str, MetricBase]) MetricRegistry ¶
Factory method to create Metric Registry using Metric Map. Use this method to create metric registry directly from the metric map.
Parameters¶
- metrics_mapDict[str, MetricBase]
Dictionary of metrics_map, hash as the Key and MetricBase as value.
- classmethod deserialize(metric_registry_message: MetricRegistryMessage) MetricRegistry ¶
- get_metric(metric_metadata: MetricMetadata) MetricBase ¶
- get_metrics() Any ¶
- get_metrics_map() Dict[str, MetricBase] ¶
- serialize() MetricRegistryMessage ¶
mlm_insights.core.metrics.min module¶
- class mlm_insights.core.metrics.min.Min(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶
Bases:
MetricBase
This metric calculates Min of a single numerical data columnThis is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.This is an exact metricConfiguration¶
None
Returns¶
- min: float
Minimum of the data
Examples
import pandas as pd from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType from mlm_insights.core.metrics.metric_metadata import MetricMetadata from mlm_insights.core.metrics.min import Min df = pd.DataFrame({"Age": [1, 4, 6, 1]}) metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=Min)]}, dataset_metrics=[]) input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)} runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=df). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics["Age"]) { "Min": { "metric_name": "Min", "metric_description": "Feature Metric to compute minimum value", "variable_count": 1, "variable_names": ["minimum"], "variable_types": ["CONTINUOUS"], "variable_dtypes": ["FLOAT"], "variable_dimensions": [0], "metric_data": [1.0], "metadata": {}, "error": null } }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) Min ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
An Instance of Min.
Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.
Returns¶
List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC
mlm_insights.core.metrics.mode module¶
- class mlm_insights.core.metrics.mode.Mode(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, _lg_max_k: int = 10)¶
Bases:
MetricBase
This metric calculates the Mode for the given data column. It returns the most frequently occurring item as the mode. In bi-modal or multi-modal cases, two modes are returned.This is a feature level metric which can process both numerical and categorical data types.This is an approximate metric which uses a Frequent Items Sketch to calculate the most frequently occurring item(s).This metric handles NaN values by dropping them from the given data columnThe Frequent Items Sketch is initialized with a maxMapSize that specifies the maximum physical length of the internal hash map of the form (<T> item, long count). The maxMapSize must be a power of 2. If fewer than 0.75 * maxMapSize different items are inserted into the sketch the estimated frequencies returned by the sketch will be exact, hence exact mode will be returned. Otherwise, items are returned with their estimated frequencies and mode will be approximate.
NOTE: In case the metric result doesn’t contain any output, then the user will need to tweak the maxMapSize by providing a higher value for ‘lg_max_k’.
Please refer here for more details: https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html
Configuration¶
- lg_max_k: int, default=10
log of max map size (max map size = 2^lg_max_k). So, with default value of lg_max_k as 10, map size used by Frequent items sketch will be 2^10 = 1024
Returns¶
- mode: List[String]
The mode of the given data column. In bi-modal or multi-modal cases, two modes are returned.
Exceptions¶
InvalidParameterException - in case lg_max_k < 7 or lg_max_k > 21
Examples
# To declare Mode metric, without any config metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=Mode)]}, dataset_metrics=[]) # To declare Mode metric, along with config options metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=Mode, config={"lg_max_k": 12})]}, dataset_metrics=[]) Returns the standard metric result as: { 'metric_name': 'Mode', 'metric_description': 'Mode', 'variable_count': 1, 'variable_names': ['mode'], 'variable_types': [NOMINAL], 'variable_dtypes': [STRING], 'variable_dimensions': [1], 'metric_data': [['1', '3']], 'metadata': {}, 'error': None }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) Mode ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
An Instance of Mode.
Returns list of SFCs required to compute Mode metric.
Returns¶
List: list of SFCs
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns Mode metric for the data using the FrequentItemsSFC.
Returns¶
float: Mode metric of the data.
mlm_insights.core.metrics.probablity_distribution module¶
- class mlm_insights.core.metrics.probablity_distribution.ProbabilityDistribution(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, bins: str | int | ~typing.List[float] = 'sturges')¶
Bases:
MetricBase
This metric calculates Probability Distribution of a single data column. Probability Distribution of a Random Variable (X) shows how the Probabilities of the events are distributed over different values of the Random Variable.This is a feature level metric which can process only numerical data types.This is an approximate metric. Internally, it uses a KLL sketch data structure with default k value as 200This metric handles NaN values by dropping them from the given data columnConfiguration¶
- bin: Union[str, int, List[float]], default=’sturges’
- One of the following values
Number of bins
Binning algorithm. Default is Sturges, also the only algorithm supported as of now. Other algorithms will be supported in the near future
Bins: List of floats
Returns¶
- bins: List[float]
bins of the data.
- density: List[float]
Density/probabilities of occurrence for the respective bins
Example
# To declare ProbabilityDistribution metric, without any config metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=ProbabilityDistribution) ]}, dataset_metrics=[]) # To declare ProbabilityDistribution metric, along with config options metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=ProbabilityDistribution, config={"bins":10})]}, dataset_metrics=[]) Returns the standard metric result as: { "metric_name": "ProbabilityDistribution", "metric_description": "Feature Metric to compute probability density", "variable_count": 2, "variable_names": ['bins', 'density'], "variable_types": [CONTINUOUS, CONTINUOUS], "variable_dtypes": [FLOAT, FLOAT], "variable_dimensions": [1, 1], "metric_data": [[1.0, 1.5, 2.0], [0.5, 0.5]], "metadata": {}, "error": null }
- bins: str | int | List[float] = 'sturges'¶
- classmethod create(config: Dict[str, ConfigParameter] | None = None) ProbabilityDistribution ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
- ProbabilityDistribution
An Instance of ProbabilityDistribution.
Returns list of SFCs required to compute PDF metric.
Returns¶
- List[SFCMetaData]
list of SFCs
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns Probability Distribution for the data using the QuantilesSFC.
Returns¶
- Dict
Probability Distribution for the data.
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
Returns Standard Metric for Probability Distribution.
Returns¶
StandardMetricResult: Probability Distribution Metric in standard format.
- merge(other_metric: ProbabilityDistribution, **kwargs: Any) ProbabilityDistribution ¶
Merge two Probability Distribution Metric into one, without mutating the others.
Parameters¶
- other_metricProbability Distribution
Other PDF that need be merged.
Returns¶
- ProbabilityDistribution
A new instance of Probability Distribution after merging.
mlm_insights.core.metrics.quartiles module¶
- class mlm_insights.core.metrics.quartiles.Quartiles(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶
Bases:
MetricBase
This metric calculates the Quartiles (Q1, Q2, Q3) of the given data column.This is a feature level metric which can process only numerical (int, float) data types.This is an approximate metric. Internally, it uses a KLL sketch data structure with default k value as 200This metric handles NaN values by dropping them from the given data columnConfiguration¶
None
Returns¶
Q1: Lower quartile which is a number halfway between the lowest number and the middle number.
Q2: Second quartile (also known as median) which is a middle number halfway between the lowest number and the highest number
Q3: Upper quartile which is a number halfway between the median and the highest number.
Examples
import pandas as pd from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.quartiles import Quartiles from mlm_insights.core.metrics.metric_metadata import MetricMetadata def main(): input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } data_frame = pd.DataFrame({'square_feet': [3, 5, 1, 7, 8, 4, 9]}) metric_details = MetricDetail(univariate_metric= {"square_feet": [MetricMetadata(klass=Quartiles)]}, dataset_metrics=[]) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["Quartiles"]) # {'q1': 3.0, 'q2': 5.0, 'q3': 8.0} if __name__ == "__main__": main() Returns the standard metric result as: { metric_name: 'Quartiles', metric_description: 'Feature Metric to compute Quartiles (Q1, Q2, Q3)', variable_count: 3, variable_names: ['q1', 'q2', 'q3'], variable_types: [CONTINUOUS, CONTINUOUS, CONTINUOUS], variable_dtypes: [FLOAT, FLOAT, FLOAT], variable_dimensions: [0, 0, 0], metric_data=[3.0, 5.0, 8.0], metadata={}, error=None }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) Quartiles ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
An Instance of Quartiles.
Returns list of SFCs required to compute quartiles metric.
Returns¶
List: list of SFCs
mlm_insights.core.metrics.range module¶
- class mlm_insights.core.metrics.range.Range(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶
Bases:
MetricBase
This metric calculates Range of a single numerical data column. Range is the difference between the smallest and highest numbersThis is a feature level metric which can process only numerical (int, float) data types.This is an exact metric.This metric handles NaN values by dropping them from the given data columnReturns¶
float: Range of the data.
Example
import pandas as pd from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType from mlm_insights.core.metrics.metric_metadata import MetricMetadata from mlm_insights.core.metrics.range import Range df = pd.DataFrame({"Age": [1, 4, 6, 1]}) metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=Range)]}, dataset_metrics=[]) input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)} runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=df). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics["Age"]) { "Range": { "metric_name": "Range", "metric_description": "Feature Metric to compute range value", "variable_count": 1, "variable_names": ["range"], "variable_types": ["CONTINUOUS"], "variable_dtypes": ["FLOAT"], "variable_dimensions": [0], "metric_data": [5.0], "metadata": {}, "error": null } }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) Range ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
- MetricBase
An Instance of MetricBase.
Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.
Returns¶
List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC
mlm_insights.core.metrics.rows_count module¶
- class mlm_insights.core.metrics.rows_count.RowCount(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, row_count: int = 0)¶
Bases:
DatasetMetricBase
This metric calculates the total row count of the DatasetThis Dataset level metric is an exact metric.This metric doesn’t handle NaN values. If certain rows have NaN values, there is no impact on the RowCountConfiguration¶
None
Returns¶
integer: total row count of the dataset
Examples
from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.rows_count import RowCount from mlm_insights.core.metrics.metric_metadata import MetricMetadata def main(): input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]}) metric_details = MetricDetail(univariate_metric={}, dataset_metrics=[MetricMetadata(klass=RowCount)]) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() dataset_metrics = profile_json['dataset_metrics'] print(dataset_metrics["RowCount"]) # {'value': 5} if __name__ == "__main__": main() Returns the standard metric result as: { 'metric_name': 'RowCount', 'metric_description': 'Dataset-level Metric to compute the total row count of the dataset', 'variable_count': 1, 'variable_names': ['rows_count'], 'variable_types': [DISCRETE], 'variable_dtypes': [INTEGER], 'variable_dimensions': [0], 'metric_data': [5], 'metadata': {}, 'error': None }
- compute(dataset: DataFrame, **kwargs: Any) None ¶
Calculate the metric value(s) from the passed DataFrame , set the internal state with the value(s). When a metric is being computed for a partitioned data set, this method is invoked for each partition. Write logic required to derive the metric value in this method for that specific partition
Parameters¶
dataset : pd.DataFrame DataFrame object for either the entire dataset or a partition on which a Metric is being computed
- classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) RowCount ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
- MetricBase
An Instance of MetricBase.
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns the computed value of the metric
Returns¶
Dict[str, Any]: Dictionary with key as string and value as any metric property.
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
This method returns metric output in standard format.
Returns¶
StandardMetricResult
- merge(other_metric: RowCount, **kwargs: Any) RowCount ¶
Merge the other metric with the current metric and return a new instance of metric. Use this method to merge the states of the 2 metrics to produce a statistically-correctly state Note: you should not mutate the current metric but create a new instance.
Parameters¶
other_metric : DatasetMetricBase The second metric which the current metric is being merged with
Returns¶
DatasetMetricBase: New, merged DatasetMetricBase instance
- row_count: int = 0¶
mlm_insights.core.metrics.serializer module¶
- mlm_insights.core.metrics.serializer.do_metric_deserialize(klass: Any, metric_message: MetricMessage) Any ¶
- mlm_insights.core.metrics.serializer.do_metric_serialize(metric: Any) MetricMessage ¶
- mlm_insights.core.metrics.serializer.get_metric_class(metric_name: str, klass: Any | None = None) Any ¶
mlm_insights.core.metrics.skewness module¶
- class mlm_insights.core.metrics.skewness.Skewness(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶
Bases:
MetricBase
This metric calculates Skewness of a single numerical data column. The skewness is a parameter to measure the symmetry of a data setThis is a feature level metric which can process only numerical (int, float) data types.This is an exact metric.This metric handles NaN values by dropping them from the given data column- Distribution of data on the basis of skewness value:
Skewness = 0: Then normally distributed.
Skewness > 0: Then right tail of the distribution is longer
Skewness < 0: Then left tail of the distribution is longer
Configuration¶
None
Returns¶
float: Skewness of the data
Examples
from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.skewness import Skewness from mlm_insights.core.metrics.metric_metadata import MetricMetadata def main(): input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]}) metric_details = MetricDetail(univariate_metric= {"square_feet": [MetricMetadata(klass=Skewness)]}, dataset_metrics=[]) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["Skewness"]) # {'value': 1.1088349707251306} if __name__ == "__main__": main() Returns the standard metric result as: { 'metric_name': 'Skewness', 'metric_description': 'Feature Metric to compute Skewness', 'variable_count': 1, 'variable_names': ['skewness'], 'variable_types': [CONTINUOUS], 'variable_dtypes': [FLOAT], 'variable_dimensions': [0], 'metric_data': [1.1088349707251306], 'metadata': {}, 'error': None }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) Skewness ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
An Instance of Skewness.
Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.
Returns¶
List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC
mlm_insights.core.metrics.standard_deviation module¶
- class mlm_insights.core.metrics.standard_deviation.StandardDeviation(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶
Bases:
MetricBase
This metric calculates Standard Deviation of a single numerical data column, a measure of the spread of a distribution.This is a feature level metric which can process only numerical (int, float) data types.This is an exact metric.This metric handles NaN values by dropping them from the given data columnConfiguration¶
None
Returns¶
float: Standard Deviation of the feature
Examples
from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.standard_deviation import StandardDeviation from mlm_insights.core.metrics.metric_metadata import MetricMetadata def main(): input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]}) metric_details = MetricDetail(univariate_metric= {"square_feet": [MetricMetadata(klass=StandardDeviation)]}, dataset_metrics=[]) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["StandardDeviation"]) # {'value': 13.375326538070016} if __name__ == "__main__": main() Returns the standard metric result as: { 'metric_name': 'StandardDeviation', 'metric_description': 'Feature Metric to compute Standard Deviation', 'variable_count': 1, 'variable_names': ['standard_deviation'], 'variable_types': [CONTINUOUS], 'variable_dtypes': [FLOAT], 'variable_dimensions': [0], 'metric_data': [13.375326538070016], 'metadata': {}, 'error': None }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) StandardDeviation ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
- MetricBase
An Instance of MetricBase.
Returns list of SFCs required to compute Standard Deviation metric.
Returns¶
List: list of SFCs
mlm_insights.core.metrics.sum module¶
- class mlm_insights.core.metrics.sum.Sum(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, sum: float = 0.0)¶
Bases:
MetricBase
This metric calculates Sum of a single numerical data column.This is a feature level metric which can process only numerical (int, float) data types.This is an exact metric.This metric handles NaN values by dropping them from the given data columnReturns¶
float: Sum of the feature data
Examples
# To declare Sum metric: metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=Sum)]}, dataset_metrics=[])
- compute(column: Series, **kwargs: Any) None ¶
Computes the sum for the passed in dataset. In case of a partitioned dataset, computes the sum for the specific partition
Parameters¶
- columnpd.Series
Input column.
- classmethod create(config: Dict[str, ConfigParameter] | None = None) Sum ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
- MetricBase
An Instance of MetricBase.
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns sum of input data.
Returns¶
float: Sum of the data.
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
Returns Standard Metric for sum.
Returns¶
StandardMetricResult: Sum Metric in standard format.
- classmethod get_supported_variable_types() List[VariableType] ¶
Method to retrieve the list of Feature Variable type supported for the metric
Returns¶
List of Feature Variable type supported by the Sum metric
- merge(other_metric: Sum, **kwargs: Any) Sum ¶
Merge two Sum metrics into one, without mutating the others.
Parameters¶
- other_metricSum
Other Sum metric that need be merged.
Returns¶
- Sum
A new instance of Sum metric after merging.
- sum: float = 0.0¶
mlm_insights.core.metrics.top_k_frequent_elements module¶
- class mlm_insights.core.metrics.top_k_frequent_elements.TopKFrequentElements(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, _lg_max_k: int = 10, k: int = 10)¶
Bases:
MetricBase
This metric calculates the Top K frequent elements for the given data column. It returns the most frequent items (aka heavy hitters) and also returns the frequency of occurrence for an item i.This is a feature level metric which can process both numerical and categorical data types.This is an approximate metric which uses a Frequent Items Sketch to return estimated frequency of the items.This metric handles NaN values by dropping them from the given data columnThe Frequent Items Sketch is initialized with a maxMapSize that specifies the maximum physical length of the internal hash map of the form (<T> item, long count). The maxMapSize must be a power of 2. If fewer than 0.75 * maxMapSize different items are inserted into the sketch the estimated frequencies returned by the sketch will be exact. Otherwise, items are returned with their estimated frequencies.
NOTE: In case the metric result doesn’t contain any output, then the user will need to tweak the maxMapSize by providing a higher value for ‘lg_max_k’. Also, the results might be returned with large difference between the upper bounds and lower bounds. At that time, the results are approximate. Again, to get a better result, the user should increase the maxMapSize to get a better result.
Please refer here for more details: https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html
Configuration¶
- k: int, default=10
k value for how many top items to be returned by the metric
- lg_max_k: int, default=10
log of max map size (max map size = 2^lg_max_k). So, with default value of lg_max_k as 10, map size used by Frequent items sketch will be 2^10 = 1024
Returns¶
- categories: String
The different categories (item name)
- estimate: int
The estimated frequency for the given category (item name)
- lower_bound: int
The lower bound for frequency of the given category. True frequency is always guaranteed to lie between lower bound and upper bound
- upper_bound: int
The upper bound for frequency of the given category.
Exceptions¶
InvalidParameterException - in case lg_max_k < 7 or lg_max_k > 21
Examples
# To declare TopKFrequentElements metric, without any config metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=TopKFrequentElements)]}, dataset_metrics=[]) # To declare TopKFrequentElements metric, along with config options metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=TopKFrequentElements, config={"k":20, "lg_max_k": 12})]}, dataset_metrics=[]) Returns the standard metric result as: { 'metric_name': 'TopKFrequentElements', 'metric_description': 'Top K Frequent Elements', 'variable_count': 4, 'variable_names': ['categories', 'estimate', 'lower_bound', 'upper_bound'], 'variable_types': [NOMINAL, CONTINUOUS, CONTINUOUS, CONTINUOUS], 'variable_dtypes': [STRING, INTEGER, INTEGER, INTEGER], 'variable_dimensions': [1,1,1,1], 'metric_data': [['3', '1', '2'], [5, 4, 3], [5, 4, 3], [5, 4, 3]], 'metadata': {}, 'error': None }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) TopKFrequentElements ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
An Instance of TopKFrequentElements.
Returns list of SFCs required to compute Top K Frequent Elements Metric.
Returns¶
List: list of SFCs
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns Top K Frequent Elements Metric for the data using the FrequentItemsSFC.
Returns¶
float: Top K Frequent Elements Metric of the data.
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
Returns Top K Frequent Elements Metric in Standard format.
Returns¶
StandardMetricResult: Top K Frequent Elements Metric in Standard format.
- k: int = 10¶
- merge(other_metric: TopKFrequentElements, **kwargs: Any) TopKFrequentElements ¶
Merge two Top K Frequent Elements Metric into one, without mutating the others.
Parameters¶
- other_metricTopKFrequentElements
Other TopKFrequentElements that need be merged.
Returns¶
- TopKFrequentElements
A new instance of TopKFrequentElements after merging.
mlm_insights.core.metrics.type_metric module¶
- class mlm_insights.core.metrics.type_metric.TypeMetric(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, string_type_count: int = 0, integral_type_count: int = 0, fractional_type_count: int = 0, boolean_type_count: int = 0)¶
Bases:
MetricBase
This metric calculates the count of data types for feature values. For a given feature, it returns how many strings, integers, floats and booleans are there within the feature data.This is a feature level metric which can process a feature having any data types.This is an exact metric.This metric handles NaN values by dropping them from the given data columnConfiguration¶
None
Returns¶
string_type_count: Count of number of feature values of type string
integral_type_count: Count of number of feature values of type integer
fractional_type_count: Count of number of feature values of type float
boolean_type_count: Count of number of feature values of type boolean
Examples
import pandas as pd import numpy as np from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.duplicate_count import TypeMetric from mlm_insights.core.metrics.metric_metadata import MetricMetadata def main(): input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } data_frame = pd.DataFrame({'square_feet': [0, 1, 2.0, 3, 4.4, True, False, 5, np.nan, 6.0, 7, None]}) metric_details = MetricDetail(univariate_metric= {"square_feet": [MetricMetadata(klass=TypeMetric)]}, dataset_metrics=[]) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["TypeMetric"]) # {'string_type_count': 0, 'integral_type_count': 5, 'fractional_type_count': 3, 'boolean_type_count': 2} if __name__ == "__main__": main() Returns the standard metric result as: { metric_name: 'TypeMetric', metric_description: 'Feature Metric to compute count of data types for feature values', variable_count: 4, variable_names: ['string_type_count', 'integral_type_count', 'fractional_type_count', 'boolean_type_count], variable_types: [DISCRETE, DISCRETE, DISCRETE, DISCRETE], variable_dtypes: [INTEGER, INTEGER, INTEGER, INTEGER], variable_dimensions: [0, 0, 0, 0], metric_data=[0, 5, 3, 2], metadata={}, error=None }
- boolean_type_count: int = 0¶
- compute(column: Series, **kwargs: Any) None ¶
Computes TypeMetric for the passed in dataset.
In case of a partitioned dataset, the TypeMetric for the specific partition is computed.
Parameters¶
- columnpd.Series
Input column.
- classmethod create(config: Dict[str, ConfigParameter] | None = None) TypeMetric ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
- MetricBase
An Instance of MetricBase.
- fractional_type_count: int = 0¶
- get_result(**kwargs: Any) Dict[str, Any] ¶
Returns Map containing string_type_count, integral_type_count, fractional_type_count, boolean_type_count of input data.
Returns¶
- Map: Map containing string_type_count, integral_type_count, fractional_type_count,
boolean_type_count of input data.
- get_standard_metric_result(**kwargs: Any) StandardMetricResult ¶
This method returns metric output in standard format.
Returns¶
StandardMetricResult
- integral_type_count: int = 0¶
- merge(other_metric: TypeMetric, **kwargs: Any) TypeMetric ¶
Merge two TypeMetric into one, without mutating the others.
Parameters¶
- other_metricTypeMetric
Other TypeMetric that need be merged.
Returns¶
- TypeMetric
A new instance of TypeMetric containing string_type_count, integral_type_count, fractional_type_count, boolean_type_count after merging.
- string_type_count: int = 0¶
mlm_insights.core.metrics.variance module¶
- class mlm_insights.core.metrics.variance.Variance(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶
Bases:
MetricBase
This metric calculates Variance of a single numerical data column, a measure of the spread of a distribution.This is a feature level metric which can process only numerical (int, float) data types.This is an exact metric.This metric handles NaN values by dropping them from the given data columnConfiguration¶
None
Returns¶
float: Variance of the feature
Examples
from mlm_insights.builder.builder_component import MetricDetail, EngineDetail from mlm_insights.builder.insights_builder import InsightsBuilder from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType from mlm_insights.core.metrics.variance import Variance from mlm_insights.core.metrics.metric_metadata import MetricMetadata def main(): input_schema = { 'square_feet': FeatureType( data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT) } data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]}) metric_details = MetricDetail(univariate_metric= {"square_feet": [MetricMetadata(klass=Variance)]}, dataset_metrics=[]) runner = InsightsBuilder(). with_input_schema(input_schema). with_data_frame(data_frame=data_frame). with_metrics(metrics=metric_details). with_engine(engine=EngineDetail(engine_name="native")). build() profile_json = runner.run().profile.to_json() feature_metrics = profile_json['feature_metrics'] print(feature_metrics['square_feet']["Variance"]) # {'value': 178.89936000000003} if __name__ == "__main__": main() Returns the standard metric result as: { 'metric_name': 'Variance', 'metric_description': 'Feature Metric to compute Variance', 'variable_count': 1, 'variable_names': ['variance'], 'variable_types': [CONTINUOUS], 'variable_dtypes': [FLOAT], 'variable_dimensions': [0], 'metric_data': [178.89936000000003], 'metadata': {}, 'error': None }
- classmethod create(config: Dict[str, ConfigParameter] | None = None) Variance ¶
Factory Method to create an object. The configuration will be available in config.
Returns¶
- MetricBase
An Instance of MetricBase.
Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.
Returns¶
List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC