mlm_insights.core.sdcs package¶
Subpackages¶
- mlm_insights.core.sdcs.confusion_matrix package
- Submodules
- mlm_insights.core.sdcs.confusion_matrix.multi_class_confusion_matrix module
ConfusionMatrixCounts
MultiClassConfusionMatrix
MultiClassConfusionMatrix.class_map
MultiClassConfusionMatrix.get_average_by_weights()
MultiClassConfusionMatrix.get_classes()
MultiClassConfusionMatrix.get_counts()
MultiClassConfusionMatrix.get_encoded_value()
MultiClassConfusionMatrix.get_false_negative_counts()
MultiClassConfusionMatrix.get_false_positive_counts()
MultiClassConfusionMatrix.get_number_of_classes()
MultiClassConfusionMatrix.get_precision_scores_per_class()
MultiClassConfusionMatrix.get_recall_scores_per_class()
MultiClassConfusionMatrix.get_true_negative_counts()
MultiClassConfusionMatrix.get_true_positive_counts()
MultiClassConfusionMatrix.is_binary_classes()
MultiClassConfusionMatrix.matrix
- mlm_insights.core.sdcs.confusion_matrix.multi_threshold_metric_summary module
- mlm_insights.core.sdcs.interfaces package
Submodules¶
mlm_insights.core.sdcs.confusion_matrix_component module¶
- class mlm_insights.core.sdcs.confusion_matrix_component.ConfusionMatrixComponent(target_column: str = 'y_true', prediction_column: str = 'y_predict', class_map: ~typing.Dict[str, int] = <factory>, matrix: ~typing.List[~typing.List[int]] = <factory>, max_label_threshold: int = 256)¶
Bases:
ShareableDatasetComponent
ConfusionMatrixComponent store the confusion matrix between Target column & Prediction column, along with all labels. This can be used for both Binary and multiclass classification.
- class_map: Dict[str, int]¶
- compute(dataset: DataFrame, **kwargs: Any) None ¶
- This method computes confusion matrix and list of unique classes for a partition.
Get a list of sorted unique classes from target and prediction classes
Check if number of unique classes exceeds the max threshold. This is done to ensure that the resulting profile doesn’t exceed a certain size
3. Create a map of class: encoded integer value 4 Iterate target and prediction columns to fill the cell value in the n*n matrix
Parameters¶
dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which a Metric is being computed
- classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) ShareableDatasetComponent ¶
Create a ConfusionMatrixComponent using the configuration and kwargs
Parameters¶
config : Configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:
features_metadata: Contains input schema for each feature
- get_confusion_between(category_a: str, category_b: str) int ¶
Get confusion between two category, if any one of them do not exist in matrix, 0 will be returned.
Parameters¶
- category_astr
First category
- category_bstr
Second category
Returns¶
Confusion between two categories
- get_confusion_matrix() MultiClassConfusionMatrix ¶
Get the Confusion Matrix along with all labels.
Returns¶
Tuple of Confusion Matrix and list of all labels.
- matrix: List[List[int]]¶
- max_label_threshold: int = 256¶
- merge(other: ConfusionMatrixComponent, **kwargs: Any) ConfusionMatrixComponent ¶
Merge two ConfusionMatrixComponent into one, without mutating the others. 1. Get unique classes from both the partition. 2. Recreate class map based on merged classed. 3. Iterate over all N*N classes and get confusion between two class and add them to create new N*N matrix.
Parameters¶
- otherConfusionMatrixComponent
Other ConfusionMatrixComponent that need be merged.
Returns¶
- ConfusionMatrixComponent
A new instance of ConfusionMatrixComponent
- prediction_column: str = 'y_predict'¶
- target_column: str = 'y_true'¶
mlm_insights.core.sdcs.framework_sdc module¶
Bases:
Enum
Enum to store all framework specific SharableFeatureComponent
mlm_insights.core.sdcs.multi_threshold_confusion_matrix module¶
- class mlm_insights.core.sdcs.multi_threshold_confusion_matrix.MultiThresholdConfusionMatrix(positive_sketch: kll_doubles_sketch, negative_sketch: kll_doubles_sketch, target_column: str = 'y_true', prediction_score_column: str = 'y_score', positive_label: str | int = 1, threshold_count: int = 100)¶
Bases:
ShareableDatasetComponent
,Serializable
MultiThresholdConfusionMatrix stores the frequency distribution of prediction score for both positive and negative labels.
These frequencies are then used to compute TP, TN, FP, FN at different threshold.
- compute(dataset: DataFrame, **kwargs: Any) None ¶
- This method compute the frequency distribution for both positive and negative labels using kll sketch.
Iterate over all instance.
If target value is positive label, update the Positive KLL sketch with prediction score
Else if target value is not positive label, update the Negative KLL sketch with prediction score
Parameters¶
- datasetpd.DataFrame
DataFrame object for either the entire dataset for a partition on which a Metric is being computed
- classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) MultiThresholdConfusionMatrix ¶
Create a MultiThresholdConfusionMatrix using the configuration and kwargs
Parameters¶
- configConfiguration as dictionary:
Required Configurations: - positive_label: Positive Label
- kwargs: Key value pair for dynamic arguments. The current kwargs contains:
features_metadata: Contains input schema for each feature
- classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) MultiThresholdConfusionMatrix ¶
Create a new instance of MultiThresholdConfusionMatrix from serialized bytes.
Parameters¶
- serialized_bytesbytes
Serialized bytes as input.
Returns¶
- MultiThresholdConfusionMatrix
New instance of MultiThresholdConfusionMatrix
- get_multi_threshold_metric_summary(**kwargs: Any) MultiThresholdMetricSummary ¶
Return MultiThresholdMetricSummary
Returns¶
- MultiThresholdMetricSummary
Metrics at different thresholds.
- get_total_count() int ¶
- merge(other_metric: MultiThresholdConfusionMatrix, **kwargs: Any) MultiThresholdConfusionMatrix ¶
Merge two MultiThresholdConfusionMatrix into one, without mutating the others. This is done by merging both positive and negative labels.
Parameter¶
- other_metricMultiThresholdConfusionMatrix
Other MultiThresholdConfusionMatrix that need be merged.
Returns¶
- MultiThresholdConfusionMatrix
A new instance of MultiThresholdConfusionMatrix
- negative_sketch: kll_doubles_sketch¶
- positive_label: str | int = 1¶
- positive_sketch: kll_doubles_sketch¶
- prediction_score_column: str = 'y_score'¶
- serialize(**kwargs: Any) bytes ¶
Serialize the MultiThresholdConfusionMatrix to bytes.
Returns¶
- MultiThresholdConfusionMatrix
Serialized output as bytes.
- target_column: str = 'y_true'¶
- threshold_count: int = 100¶
mlm_insights.core.sdcs.sdc_registry module¶
- class mlm_insights.core.sdcs.sdc_registry.SDCMetaData(klass: ~typing.Type[~mlm_insights.core.sdcs.interfaces.shareable_dataset_component.ShareableDatasetComponent], config: ~typing.Dict[str, ~typing.Any] = <factory>)¶
Bases:
object
SDCMetaData to store class type and config of ShareableDatasetComponent
- config: Dict[str, Any]¶
- get_hash() str ¶
Get the hash of the SDCMetaData, Hash value is derived from md5-hash of SDCMetaData.config
Returns¶
- str
The calculated hash of the SDCMetaData.
- klass: Type[ShareableDatasetComponent]¶
- class mlm_insights.core.sdcs.sdc_registry.SDCRegistry¶
Bases:
object
- add_sdc(sdc_metadata: SDCMetaData, **kwargs: Any) SDCRegistry ¶
- static create_from_sdc_map(sdc_map: Dict[str, ShareableDatasetComponent]) SDCRegistry ¶
Factory method to create SDC Registry using SDC Map. Use this method to create SDC registry directly form the SDC map.
Parameters¶
- sdc_mapDict[str, ShareableDatasetComponent]
Dictionary of sdc_map, hash as the Key and ShareableDatasetComponent as value.
- static create_from_sdc_meta(sdc_metas: List[SDCMetaData]) SDCRegistry ¶
Factory method to create SDC Registry using List of SDC Metadata. For each SDC metadata , a hash will be created and new instance of SDC will be created. If two metadata are same, one key will be stored is the set.
Parameters¶
- sdc_metasList[SDCMetaData]
List of SDCMetaData
- classmethod deserialize(sdc_registry_message: SDCRegistryMessage) SDCRegistry ¶
Deserialize the Protobuffer message to SDCRegistry
Returns¶
SDCRegistry
- get_sdc(sdc_meta: SDCMetaData) ShareableDatasetComponent ¶
Get the ShareableDatasetComponent from the SDCMetaData.
Parameters¶
sdc_meta : SDCMetaData
Returns¶
ShareableDatasetComponent
Raises¶
- KeyError
If the SDCMetaData is not found in the Registry , it will raise KeyError.
- get_sdc_map() Dict[str, ShareableDatasetComponent] ¶
- get_sdcs() Any ¶
mlm_insights.core.sdcs.cpc_sdc module¶
- class mlm_insights.core.sdcs.cpc_sdc.CardinalityItem(estimate: int, lower_bound: float, upper_bound: float)¶
Bases:
object
- estimate: int¶
- lower_bound: float¶
- upper_bound: float¶
- class mlm_insights.core.sdcs.cpc_sdc.CompressedProbabilityCountingSDC(sketch: cpc_sketch, log_k: int = 11, column_list: List[str] | None = None)¶
Bases:
ShareableDatasetComponent
,Serializable
- Provides estimates in a single pass for:
identifying cardinality estimate (with associated lower and upper bound)
Reference: https://datasketches.apache.org/docs/CPC/CPC.html CompressedProbabilityCountingSDC contains only one state i.e sketch: datasketches.cpc_sketch.
- Note:
Use create method instead of constructor
- column_list: List[str] = None¶
- compute(dataset: DataFrame, **kwargs: Any) None ¶
This method get and compute the cpc sketch for data set . 1. Get the dataframe of data set based on the column list 2. Iterate the dataframe and update the cpc sketch
Parameters¶
dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which a Metric is being computed
- classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) CompressedProbabilityCountingSDC ¶
- Factory Method to create a CompressedProbabilityCountingSDC. Supported configurable parameters
DEFAULT_LOG_K: K-value to initialize cpc_sketch
Returns¶
- CompressedProbabilityCountingSDC
An Instance of CompressedProbabilityCountingSDC.
- classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) CompressedProbabilityCountingSDC ¶
Create a new instance of CompressedProbabilityCountingSDC from serialized bytes.
Parameters¶
- serialized_bytesbytes
Serialized bytes as input.
Returns¶
- CompressedProbabilityCountingSDC
New instance of CompressedProbabilityCountingSDC
- get_cardinality() CardinalityItem ¶
Returns the cardinality of input data, with associated lower and upper bound.
Returns¶
CardinalityItem
- log_k: int = 11¶
- merge(other: CompressedProbabilityCountingSDC, **kwargs: Any) CompressedProbabilityCountingSDC ¶
Merge two CompressedProbabilityCountingSDC into one with the help of an CPC union. The CPC union is updated with both the CompressedProbabilityCountingSDC instances to return a merged CompressedProbabilityCountingSDC
Parameters¶
- otherCompressedProbabilityCountingSDC
Other CompressedProbabilityCountingSDC that need be merged.
Returns¶
- CompressedProbabilityCountingSDC
A new instance of CompressedProbabilityCountingSDC after merging.
- serialize(**kwargs: Any) bytes ¶
Serialize the CompressedProbabilityCountingSDC to bytes. Since it have only one state i.e cpc_sketch, using default serialization of datasketches
Returns¶
- CompressedProbabilityCountingSDC
A new instance of CompressedProbabilityCountingSDC after merging.
- sketch: cpc_sketch¶