mlm_insights.core.sdcs package

Subpackages

Submodules

mlm_insights.core.sdcs.confusion_matrix_component module

class mlm_insights.core.sdcs.confusion_matrix_component.ConfusionMatrixComponent(target_column: str = 'y_true', prediction_column: str = 'y_predict', class_map: ~typing.Dict[str, int] = <factory>, matrix: ~typing.List[~typing.List[int]] = <factory>, max_label_threshold: int = 256)

Bases: ShareableDatasetComponent

ConfusionMatrixComponent store the confusion matrix between Target column & Prediction column, along with all labels. This can be used for both Binary and multiclass classification.

class_map: Dict[str, int]
compute(dataset: DataFrame, **kwargs: Any) None
This method computes confusion matrix and list of unique classes for a partition.
  1. Get a list of sorted unique classes from target and prediction classes

  2. Check if number of unique classes exceeds the max threshold. This is done to ensure that the resulting profile doesn’t exceed a certain size

3. Create a map of class: encoded integer value 4 Iterate target and prediction columns to fill the cell value in the n*n matrix

Parameters

dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which a Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) ShareableDatasetComponent

Create a ConfusionMatrixComponent using the configuration and kwargs

Parameters

config : Configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:

  • features_metadata: Contains input schema for each feature

get_confusion_between(category_a: str, category_b: str) int

Get confusion between two category, if any one of them do not exist in matrix, 0 will be returned.

Parameters

category_astr

First category

category_bstr

Second category

Returns

Confusion between two categories

get_confusion_matrix() MultiClassConfusionMatrix

Get the Confusion Matrix along with all labels.

Returns

Tuple of Confusion Matrix and list of all labels.

matrix: List[List[int]]
max_label_threshold: int = 256
merge(other: ConfusionMatrixComponent, **kwargs: Any) ConfusionMatrixComponent

Merge two ConfusionMatrixComponent into one, without mutating the others. 1. Get unique classes from both the partition. 2. Recreate class map based on merged classed. 3. Iterate over all N*N classes and get confusion between two class and add them to create new N*N matrix.

Parameters

otherConfusionMatrixComponent

Other ConfusionMatrixComponent that need be merged.

Returns

ConfusionMatrixComponent

A new instance of ConfusionMatrixComponent

prediction_column: str = 'y_predict'
target_column: str = 'y_true'

mlm_insights.core.sdcs.framework_sdc module

class mlm_insights.core.sdcs.framework_sdc.FrameworkShareableDatasetComponent(value)

Bases: Enum

Enum to store all framework specific SharableFeatureComponent

CompressedProbabilityCountingSDC = <class 'mlm_insights.core.sdcs.cpc_sdc.CompressedProbabilityCountingSDC'>
ConfusionMatrixComponent = <class 'mlm_insights.core.sdcs.confusion_matrix_component.ConfusionMatrixComponent'>
MultiThresholdConfusionMatrix = <class 'mlm_insights.core.sdcs.multi_threshold_confusion_matrix.MultiThresholdConfusionMatrix'>

mlm_insights.core.sdcs.multi_threshold_confusion_matrix module

class mlm_insights.core.sdcs.multi_threshold_confusion_matrix.MultiThresholdConfusionMatrix(positive_sketch: kll_doubles_sketch, negative_sketch: kll_doubles_sketch, target_column: str = 'y_true', prediction_score_column: str = 'y_score', positive_label: str | int = 1, threshold_count: int = 100)

Bases: ShareableDatasetComponent, Serializable

MultiThresholdConfusionMatrix stores the frequency distribution of prediction score for both positive and negative labels.

These frequencies are then used to compute TP, TN, FP, FN at different threshold.

compute(dataset: DataFrame, **kwargs: Any) None
This method compute the frequency distribution for both positive and negative labels using kll sketch.
  1. Iterate over all instance.

  2. If target value is positive label, update the Positive KLL sketch with prediction score

  3. Else if target value is not positive label, update the Negative KLL sketch with prediction score

Parameters

datasetpd.DataFrame
  • DataFrame object for either the entire dataset for a partition on which a Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) MultiThresholdConfusionMatrix

Create a MultiThresholdConfusionMatrix using the configuration and kwargs

Parameters

configConfiguration as dictionary:

Required Configurations: - positive_label: Positive Label

kwargs: Key value pair for dynamic arguments. The current kwargs contains:
  • features_metadata: Contains input schema for each feature

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) MultiThresholdConfusionMatrix

Create a new instance of MultiThresholdConfusionMatrix from serialized bytes.

Parameters

serialized_bytesbytes

Serialized bytes as input.

Returns

MultiThresholdConfusionMatrix

New instance of MultiThresholdConfusionMatrix

get_multi_threshold_metric_summary(**kwargs: Any) MultiThresholdMetricSummary

Return MultiThresholdMetricSummary

Returns

MultiThresholdMetricSummary

Metrics at different thresholds.

get_total_count() int
merge(other_metric: MultiThresholdConfusionMatrix, **kwargs: Any) MultiThresholdConfusionMatrix

Merge two MultiThresholdConfusionMatrix into one, without mutating the others. This is done by merging both positive and negative labels.

Parameter

other_metricMultiThresholdConfusionMatrix

Other MultiThresholdConfusionMatrix that need be merged.

Returns

MultiThresholdConfusionMatrix

A new instance of MultiThresholdConfusionMatrix

negative_sketch: kll_doubles_sketch
positive_label: str | int = 1
positive_sketch: kll_doubles_sketch
prediction_score_column: str = 'y_score'
serialize(**kwargs: Any) bytes

Serialize the MultiThresholdConfusionMatrix to bytes.

Returns

MultiThresholdConfusionMatrix

Serialized output as bytes.

target_column: str = 'y_true'
threshold_count: int = 100

mlm_insights.core.sdcs.sdc_registry module

class mlm_insights.core.sdcs.sdc_registry.SDCMetaData(klass: ~typing.Type[~mlm_insights.core.sdcs.interfaces.shareable_dataset_component.ShareableDatasetComponent], config: ~typing.Dict[str, ~typing.Any] = <factory>)

Bases: object

SDCMetaData to store class type and config of ShareableDatasetComponent

config: Dict[str, Any]
get_hash() str

Get the hash of the SDCMetaData, Hash value is derived from md5-hash of SDCMetaData.config

Returns

str

The calculated hash of the SDCMetaData.

klass: Type[ShareableDatasetComponent]
class mlm_insights.core.sdcs.sdc_registry.SDCRegistry

Bases: object

add_sdc(sdc_metadata: SDCMetaData, **kwargs: Any) SDCRegistry
static create_from_sdc_map(sdc_map: Dict[str, ShareableDatasetComponent]) SDCRegistry

Factory method to create SDC Registry using SDC Map. Use this method to create SDC registry directly form the SDC map.

Parameters

sdc_mapDict[str, ShareableDatasetComponent]

Dictionary of sdc_map, hash as the Key and ShareableDatasetComponent as value.

static create_from_sdc_meta(sdc_metas: List[SDCMetaData]) SDCRegistry

Factory method to create SDC Registry using List of SDC Metadata. For each SDC metadata , a hash will be created and new instance of SDC will be created. If two metadata are same, one key will be stored is the set.

Parameters

sdc_metasList[SDCMetaData]

List of SDCMetaData

classmethod deserialize(sdc_registry_message: SDCRegistryMessage) SDCRegistry

Deserialize the Protobuffer message to SDCRegistry

Returns

SDCRegistry

get_sdc(sdc_meta: SDCMetaData) ShareableDatasetComponent

Get the ShareableDatasetComponent from the SDCMetaData.

Parameters

sdc_meta : SDCMetaData

Returns

ShareableDatasetComponent

Raises

KeyError

If the SDCMetaData is not found in the Registry , it will raise KeyError.

get_sdc_map() Dict[str, ShareableDatasetComponent]
get_sdcs() Any
serialize() SDCRegistryMessage

Serialize the SDCRegistry to Protobuffer Message.

Returns

SDCRegistryMessage

mlm_insights.core.sdcs.cpc_sdc module

class mlm_insights.core.sdcs.cpc_sdc.CardinalityItem(estimate: int, lower_bound: float, upper_bound: float)

Bases: object

estimate: int
lower_bound: float
upper_bound: float
class mlm_insights.core.sdcs.cpc_sdc.CompressedProbabilityCountingSDC(sketch: cpc_sketch, log_k: int = 11, column_list: List[str] | None = None)

Bases: ShareableDatasetComponent, Serializable

Provides estimates in a single pass for:
  • identifying cardinality estimate (with associated lower and upper bound)

Reference: https://datasketches.apache.org/docs/CPC/CPC.html CompressedProbabilityCountingSDC contains only one state i.e sketch: datasketches.cpc_sketch.

Note:

Use create method instead of constructor

column_list: List[str] = None
compute(dataset: DataFrame, **kwargs: Any) None

This method get and compute the cpc sketch for data set . 1. Get the dataframe of data set based on the column list 2. Iterate the dataframe and update the cpc sketch

Parameters

dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which a Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) CompressedProbabilityCountingSDC
Factory Method to create a CompressedProbabilityCountingSDC. Supported configurable parameters

DEFAULT_LOG_K: K-value to initialize cpc_sketch

Returns

CompressedProbabilityCountingSDC

An Instance of CompressedProbabilityCountingSDC.

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) CompressedProbabilityCountingSDC

Create a new instance of CompressedProbabilityCountingSDC from serialized bytes.

Parameters

serialized_bytesbytes

Serialized bytes as input.

Returns

CompressedProbabilityCountingSDC

New instance of CompressedProbabilityCountingSDC

get_cardinality() CardinalityItem

Returns the cardinality of input data, with associated lower and upper bound.

Returns

CardinalityItem

log_k: int = 11
merge(other: CompressedProbabilityCountingSDC, **kwargs: Any) CompressedProbabilityCountingSDC

Merge two CompressedProbabilityCountingSDC into one with the help of an CPC union. The CPC union is updated with both the CompressedProbabilityCountingSDC instances to return a merged CompressedProbabilityCountingSDC

Parameters

otherCompressedProbabilityCountingSDC

Other CompressedProbabilityCountingSDC that need be merged.

Returns

CompressedProbabilityCountingSDC

A new instance of CompressedProbabilityCountingSDC after merging.

serialize(**kwargs: Any) bytes

Serialize the CompressedProbabilityCountingSDC to bytes. Since it have only one state i.e cpc_sketch, using default serialization of datasketches

Returns

CompressedProbabilityCountingSDC

A new instance of CompressedProbabilityCountingSDC after merging.

sketch: cpc_sketch