mlm_insights.core.sdcs package¶

Subpackages¶

Submodules¶

mlm_insights.core.sdcs.confusion_matrix_component module¶

class mlm_insights.core.sdcs.confusion_matrix_component.ConfusionMatrixComponent(target_column: str = 'y_true', prediction_column: str = 'y_predict', class_map: ~typing.Dict[str, int] = <factory>, matrix: ~typing.List[~typing.List[int]] = <factory>, max_label_threshold: int = 256)¶

Bases: ShareableDatasetComponent

ConfusionMatrixComponent store the confusion matrix between Target column & Prediction column, along with all labels. This can be used for both Binary and multiclass classification.

class_map: Dict[str, int]¶

compute(dataset: DataFrame, **kwargs: Any) → None¶

This method computes confusion matrix and list of unique classes for a partition.

Get a list of sorted unique classes from target and prediction classes
Check if number of unique classes exceeds the max threshold. This is done to ensure that the resulting profile doesn’t exceed a certain size

3. Create a map of class: encoded integer value 4 Iterate target and prediction columns to fill the cell value in the n*n matrix

Parameters¶

dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which a Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) → ShareableDatasetComponent¶

Create a ConfusionMatrixComponent using the configuration and kwargs

Parameters¶

config : Configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:

features_metadata: Contains input schema for each feature

get_confusion_between(category_a: str, category_b: str) → int¶

Get confusion between two category, if any one of them do not exist in matrix, 0 will be returned.

Parameters¶

category_astr: First category
category_bstr: Second category

Returns¶

Confusion between two categories

get_confusion_matrix() → MultiClassConfusionMatrix¶: Get the Confusion Matrix along with all labels.

Returns¶

Tuple of Confusion Matrix and list of all labels.

matrix: List[List[int]]¶

max_label_threshold: int = 256¶

merge(other: ConfusionMatrixComponent, **kwargs: Any) → ConfusionMatrixComponent¶

Merge two ConfusionMatrixComponent into one, without mutating the others. 1. Get unique classes from both the partition. 2. Recreate class map based on merged classed. 3. Iterate over all N*N classes and get confusion between two class and add them to create new N*N matrix.

Parameters¶

otherConfusionMatrixComponent: Other ConfusionMatrixComponent that need be merged.

Returns¶

ConfusionMatrixComponent: A new instance of ConfusionMatrixComponent

prediction_column: str = 'y_predict'¶

target_column: str = 'y_true'¶

mlm_insights.core.sdcs.framework_sdc module¶

class mlm_insights.core.sdcs.framework_sdc.FrameworkShareableDatasetComponent(value)¶

Bases: Enum

Enum to store all framework specific SharableFeatureComponent

CompressedProbabilityCountingSDC = <class 'mlm_insights.core.sdcs.cpc_sdc.CompressedProbabilityCountingSDC'>¶

ConfusionMatrixComponent = <class 'mlm_insights.core.sdcs.confusion_matrix_component.ConfusionMatrixComponent'>¶

MultiThresholdConfusionMatrix = <class 'mlm_insights.core.sdcs.multi_threshold_confusion_matrix.MultiThresholdConfusionMatrix'>¶

mlm_insights.core.sdcs.multi_threshold_confusion_matrix module¶

class mlm_insights.core.sdcs.multi_threshold_confusion_matrix.MultiThresholdConfusionMatrix(positive_sketch: kll_doubles_sketch, negative_sketch: kll_doubles_sketch, target_column: str = 'y_true', prediction_score_column: str = 'y_score', positive_label: str | int = 1, threshold_count: int = 100)¶

Bases: ShareableDatasetComponent, Serializable

MultiThresholdConfusionMatrix stores the frequency distribution of prediction score for both positive and negative labels.

These frequencies are then used to compute TP, TN, FP, FN at different threshold.

compute(dataset: DataFrame, **kwargs: Any) → None¶

This method compute the frequency distribution for both positive and negative labels using kll sketch.

Iterate over all instance.
If target value is positive label, update the Positive KLL sketch with prediction score
Else if target value is not positive label, update the Negative KLL sketch with prediction score

Parameters¶

datasetpd.DataFrame

DataFrame object for either the entire dataset for a partition on which a Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) → MultiThresholdConfusionMatrix¶

Create a MultiThresholdConfusionMatrix using the configuration and kwargs

Parameters¶

configConfiguration as dictionary:

Required Configurations: - positive_label: Positive Label

kwargs: Key value pair for dynamic arguments. The current kwargs contains:

features_metadata: Contains input schema for each feature

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) → MultiThresholdConfusionMatrix¶

Create a new instance of MultiThresholdConfusionMatrix from serialized bytes.

Parameters¶

serialized_bytesbytes: Serialized bytes as input.

Returns¶

MultiThresholdConfusionMatrix: New instance of MultiThresholdConfusionMatrix

get_multi_threshold_metric_summary(**kwargs: Any) → MultiThresholdMetricSummary¶

Return MultiThresholdMetricSummary

Returns¶

MultiThresholdMetricSummary: Metrics at different thresholds.

get_total_count() → int¶

merge(other_metric: MultiThresholdConfusionMatrix, **kwargs: Any) → MultiThresholdConfusionMatrix¶

Merge two MultiThresholdConfusionMatrix into one, without mutating the others. This is done by merging both positive and negative labels.

Parameter¶

other_metricMultiThresholdConfusionMatrix: Other MultiThresholdConfusionMatrix that need be merged.

Returns¶

MultiThresholdConfusionMatrix: A new instance of MultiThresholdConfusionMatrix

negative_sketch: kll_doubles_sketch¶

positive_label: str | int = 1¶

positive_sketch: kll_doubles_sketch¶

prediction_score_column: str = 'y_score'¶

serialize(**kwargs: Any) → bytes¶

Serialize the MultiThresholdConfusionMatrix to bytes.

Returns¶

MultiThresholdConfusionMatrix: Serialized output as bytes.

target_column: str = 'y_true'¶

threshold_count: int = 100¶

mlm_insights.core.sdcs.sdc_registry module¶

class mlm_insights.core.sdcs.sdc_registry.SDCMetaData(klass: ~typing.Type[~mlm_insights.core.sdcs.interfaces.shareable_dataset_component.ShareableDatasetComponent], config: ~typing.Dict[str, ~typing.Any] = <factory>)¶

Bases: object

SDCMetaData to store class type and config of ShareableDatasetComponent

config: Dict[str, Any]¶

get_hash() → str¶

Get the hash of the SDCMetaData, Hash value is derived from md5-hash of SDCMetaData.config

Returns¶

str: The calculated hash of the SDCMetaData.

klass: Type[ShareableDatasetComponent]¶

class mlm_insights.core.sdcs.sdc_registry.SDCRegistry¶

Bases: object

add_sdc(sdc_metadata: SDCMetaData, **kwargs: Any) → SDCRegistry¶

static create_from_sdc_map(sdc_map: Dict[str, ShareableDatasetComponent]) → SDCRegistry¶

Factory method to create SDC Registry using SDC Map. Use this method to create SDC registry directly form the SDC map.

Parameters¶

sdc_mapDict[str, ShareableDatasetComponent]: Dictionary of sdc_map, hash as the Key and ShareableDatasetComponent as value.

static create_from_sdc_meta(sdc_metas: List[SDCMetaData]) → SDCRegistry¶

Factory method to create SDC Registry using List of SDC Metadata. For each SDC metadata , a hash will be created and new instance of SDC will be created. If two metadata are same, one key will be stored is the set.

Parameters¶

sdc_metasList[SDCMetaData]: List of SDCMetaData

classmethod deserialize(sdc_registry_message: SDCRegistryMessage) → SDCRegistry¶: Deserialize the Protobuffer message to SDCRegistry

Returns¶

SDCRegistry

get_sdc(sdc_meta: SDCMetaData) → ShareableDatasetComponent¶

Get the ShareableDatasetComponent from the SDCMetaData.

Parameters¶

sdc_meta : SDCMetaData

Returns¶

ShareableDatasetComponent

Raises¶

KeyError: If the SDCMetaData is not found in the Registry , it will raise KeyError.

get_sdc_map() → Dict[str, ShareableDatasetComponent]¶

get_sdcs() → Any¶

serialize() → SDCRegistryMessage¶: Serialize the SDCRegistry to Protobuffer Message.

Returns¶

SDCRegistryMessage

mlm_insights.core.sdcs.cpc_sdc module¶

class mlm_insights.core.sdcs.cpc_sdc.CardinalityItem(estimate: int, lower_bound: float, upper_bound: float)¶

Bases: object

estimate: int¶

lower_bound: float¶

upper_bound: float¶

class mlm_insights.core.sdcs.cpc_sdc.CompressedProbabilityCountingSDC(sketch: cpc_sketch, log_k: int = 11, column_list: List[str] | None = None)¶

Bases: ShareableDatasetComponent, Serializable

Provides estimates in a single pass for:

identifying cardinality estimate (with associated lower and upper bound)

Reference: https://datasketches.apache.org/docs/CPC/CPC.html CompressedProbabilityCountingSDC contains only one state i.e sketch: datasketches.cpc_sketch.

Note:: Use create method instead of constructor

column_list: List[str] = None¶

compute(dataset: DataFrame, **kwargs: Any) → None¶: This method get and compute the cpc sketch for data set . 1. Get the dataframe of data set based on the column list 2. Iterate the dataframe and update the cpc sketch

Parameters¶

dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which a Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) → CompressedProbabilityCountingSDC¶

Factory Method to create a CompressedProbabilityCountingSDC. Supported configurable parameters: DEFAULT_LOG_K: K-value to initialize cpc_sketch

Returns¶

CompressedProbabilityCountingSDC: An Instance of CompressedProbabilityCountingSDC.

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) → CompressedProbabilityCountingSDC¶

Create a new instance of CompressedProbabilityCountingSDC from serialized bytes.

Parameters¶

serialized_bytesbytes: Serialized bytes as input.

Returns¶

CompressedProbabilityCountingSDC: New instance of CompressedProbabilityCountingSDC

get_cardinality() → CardinalityItem¶: Returns the cardinality of input data, with associated lower and upper bound.

Returns¶

CardinalityItem

log_k: int = 11¶

merge(other: CompressedProbabilityCountingSDC, **kwargs: Any) → CompressedProbabilityCountingSDC¶

Merge two CompressedProbabilityCountingSDC into one with the help of an CPC union. The CPC union is updated with both the CompressedProbabilityCountingSDC instances to return a merged CompressedProbabilityCountingSDC

Parameters¶

otherCompressedProbabilityCountingSDC: Other CompressedProbabilityCountingSDC that need be merged.

Returns¶

CompressedProbabilityCountingSDC: A new instance of CompressedProbabilityCountingSDC after merging.

serialize(**kwargs: Any) → bytes¶

Serialize the CompressedProbabilityCountingSDC to bytes. Since it have only one state i.e cpc_sketch, using default serialization of datasketches

Returns¶

CompressedProbabilityCountingSDC: A new instance of CompressedProbabilityCountingSDC after merging.

sketch: cpc_sketch¶