mlm_insights.core.sfcs package¶
Subpackages¶
Submodules¶
mlm_insights.core.sfcs.cpc_sfc module¶
- class mlm_insights.core.sfcs.cpc_sfc.CardinalityItem(estimate: int, lower_bound: float, upper_bound: float)¶
Bases:
object
- estimate: int¶
- lower_bound: float¶
- upper_bound: float¶
- class mlm_insights.core.sfcs.cpc_sfc.CompressedProbabilityCountingSFC(sketch: cpc_sketch, log_k: int = 11)¶
Bases:
ShareableFeatureComponent
,Serializable
- Provides estimates in a single pass for:
identifying cardinality estimate (with associated lower and upper bound)
Reference: https://datasketches.apache.org/docs/CPC/CPC.html CompressedProbabilityCountingSFC contains only one state i.e sketch: datasketches.cpc_sketch.
- Note:
Use create method instead of constructor
- compute(column: Series, **kwargs: Any) None ¶
Update the state of the CompressedProbabilityCountingSFC using input series.
Parameters¶
- columnpd.Series
Input column.
- classmethod create(config: Dict[str, ConfigParameter] | None = None) CompressedProbabilityCountingSFC ¶
- Factory Method to create a CompressedProbabilityCountingSFC. Supported configurable parameters
DEFAULT_LOG_K: K-value to initialize cpc_sketch
Returns¶
- CompressedProbabilityCountingSFC
An Instance of CompressedProbabilityCountingSFC.
- classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) CompressedProbabilityCountingSFC ¶
Create a new instance of CompressedProbabilityCountingSFC from serialized bytes.
Parameters¶
- serialized_bytesbytes
Serialized bytes as input.
Returns¶
- DistinctCountSFC
New instance of CompressedProbabilityCountingSFC
- get_cardinality() CardinalityItem ¶
Returns the cardinality of input data, with associated lower and upper bound.
Returns¶
CardinalityItem
- log_k: int = 11¶
- merge(other: CompressedProbabilityCountingSFC, **kwargs: Any) CompressedProbabilityCountingSFC ¶
Merge two CompressedProbabilityCountingSFC into one with the help of an CPC union. The CPC union is updated with both the CompressedProbabilityCountingSFC instances to return a merged CompressedProbabilityCountingSFC
Parameters¶
- otherCompressedProbabilityCountingSFC
Other CompressedProbabilityCountingSFC that need be merged.
Returns¶
- CompressedProbabilityCountingSFC
A new instance of CompressedProbabilityCountingSFC after merging.
- serialize(**kwargs: Any) bytes ¶
Serialize the CompressedProbabilityCountingSFC to bytes. Since it have only one state i.e cpc_sketch, using default serialization of datasketches
Returns¶
- DistinctCountSFC
A new instance of CompressedProbabilityCountingSFC after merging.
- sketch: cpc_sketch¶
mlm_insights.core.sfcs.descriptive_statistics_sfc module¶
- class mlm_insights.core.sfcs.descriptive_statistics_sfc.DescriptiveStatisticsSFC(total_count: int, mean: float, minimum: float, maximum: float, central_moments: List[float])¶
Bases:
ShareableFeatureComponent
DescriptiveStatisticsSFC calculate few descriptive statistics of the data.
It contains following states:
total_count: Size of the data.
mean: the statistical mean of the data
minimum: the minimum element of the data
maximum: the maximum element of the data
central_moments: It stores central_moments up to MAXIMUM_MOMENT_ORDER order.
Mathematically: central_moments[i] = sum{( x - mean )^i} /N
- central_moments: List[float]¶
- compute(column: Series, **kwargs: Any) None ¶
Update the state of the DescriptiveStatisticsSFC using input series.
Parameters¶
- columnpd.Series
Input column.
- classmethod create(config: Dict[str, ConfigParameter] | None = None) DescriptiveStatisticsSFC ¶
Factory Method to create an DescriptiveStatisticsSFC. No config parameter is supported.
Returns¶
- DescriptiveStatisticsSFC
An Instance of QuantilesSFC.
- get_central_moments(order: int) float | None ¶
Get the Central order of order K
Parameters¶
- orderint
Order of the moment, must be less than equal to MAXIMUM_MOMENT_ORDER
Returns¶
float : Central moment of order K.
- get_kurtosis() float | None ¶
Get the Excess Kurtosis of data
Returns¶
float : Excess Kurtosis of the data
- get_maximum() float | None ¶
Get the Maximum of the data
Returns¶
float : Maximum value of the data.
- get_minimum() float | None ¶
Get the Minimum of the data
Returns¶
float : Minimum value of the data.
- get_standard_deviation() float | None ¶
Get the Standard Deviation of data
Returns¶
float : Standard Deviation of the data
- maximum: float¶
- mean: float¶
- merge(other: DescriptiveStatisticsSFC, **kwargs: Any) DescriptiveStatisticsSFC ¶
Merge two DescriptiveStatisticsSFC into one, without mutating the others.
Parameters¶
- otherDescriptiveStatisticsSFC
Other DescriptiveStatisticsSFC that need be merged.
Returns¶
- DescriptiveStatisticsSFC
A new instance of DescriptiveStatisticsSFC after merging.
- minimum: float¶
- total_count: int¶
mlm_insights.core.sfcs.distinct_count_sfc module¶
- class mlm_insights.core.sfcs.distinct_count_sfc.CardinalityItem(estimate: int, lower_bound: float, upper_bound: float)¶
Bases:
object
- estimate: int¶
- lower_bound: float¶
- upper_bound: float¶
- class mlm_insights.core.sfcs.distinct_count_sfc.DistinctCountSFC(sketch: hll_sketch)¶
Bases:
ShareableFeatureComponent
,Serializable
Provides estimates in a single pass for identifying cardinality estimate (with associated lower and upper bound)
Reference: https://datasketches.apache.org/docs/HLL/HLL.html
DistinctCountSFC contains only one state i.e., sketch: datasketches.hll_sketch.
Note: Use create method instead of constructor
- compute(column: Series, **kwargs: Any) None ¶
Update the state of the DistinctCountSFC using input series.
Parameters¶
- columnpd.Series
Input column.
- classmethod create(config: Dict[str, ConfigParameter] | None = None) DistinctCountSFC ¶
Factory Method to create a DistinctCountSFC. Supported configurable parameters
DEFAULT_LOG_K: K-value to initialize hll_sketch, default = 12
Additionally, the HLL Target for the resulting sketch is set to HLL_4 and is non-configurable
Returns¶
- DistinctCountSFCDistinctCountSFC
An Instance of DistinctCountSFC.
- classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) DistinctCountSFC ¶
Create a new instance of DistinctCountSFC from serialized bytes.
Parameters¶
- serialized_bytesbytes
Serialized bytes as input.
Returns¶
- DistinctCountSFC
New instance of DistinctCountSFC
- get_cardinality() CardinalityItem ¶
Returns the cardinality of input data, with associated lower and upper bound.
Returns¶
CardinalityItem
- merge(other: DistinctCountSFC, **kwargs: Any) DistinctCountSFC ¶
Merge two DistinctCountSFC into one with the help of an HLL union. The HLL union is updated with both the DistinctCountSFC instances to return a merged DistinctCountSFC
Parameters¶
- otherDistinctCountSFC
Other DistinctCountSFC that need be merged.
Returns¶
- DistinctCountSFCDistinctCountSFC
A new instance of DistinctCountSFC after merging.
- serialize(**kwargs: Any) bytes ¶
Serialize the DistinctCountSFC to bytes. Since it have only one state i.e hll_sketch, using default serialization of datasketches
Returns¶
- DistinctCountSFC
A new instance of DistinctCountSFC after merging.
- sketch: hll_sketch¶
mlm_insights.core.sfcs.framework_sfc module¶
Bases:
Enum
Enum to store all framework specific SharableFeatureComponent
mlm_insights.core.sfcs.frequent_items_sfc module¶
- class mlm_insights.core.sfcs.frequent_items_sfc.FrequentItemEstimate(value: str, estimate: int, lower_bound: int, upper_bound: int)¶
Bases:
object
- estimate: int¶
- lower_bound: int¶
- upper_bound: int¶
- value: str¶
- class mlm_insights.core.sfcs.frequent_items_sfc.FrequentItemsSFC(sketch: frequent_strings_sketch)¶
Bases:
ShareableFeatureComponent
,Serializable
Provides estimates in a single pass for:
identifying frequent items (aka heavy hitters) and
answering point queries (approximately how many times did item appears in a stream/dataset)
Reference: https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html
- compute(column: Series, **kwargs: Any) None ¶
Update the state of the FrequentItemsSFC using input series.
Parameters¶
- columnpd.Series
Input column.
- classmethod create(config: Dict[str, ConfigParameter] | None = None) FrequentItemsSFC ¶
Factory Method to create the SFC. Use create method instead of constructor. Supported configurable parameters CONFIG_MAX_SIZE_KEY = Maximum size of counters. Default is 7. One can tweak this parameter to control both the space usage and the error (larger size corresponds to more space and less error)
Returns¶
- FrequentItemsSFC
An Instance of FrequentItemsSFC.
- classmethod deserialize(serialized_sketch: bytes, **kwargs: Any) FrequentItemsSFC ¶
Create a new instance from serialized bytes.
Parameters¶
- serialized_sketchbytes
Serialized bytes as input.
Returns¶
- FrequentItemsSFC
New instance of FrequentItemsSFC
- get_frequency_estimate(item: Any) FrequentItemEstimate ¶
Get a frequency estimate of a specific item i.e approximately how many times did item appear in the stream/dataset
Parameters¶
- item: Any
Item value to get the frequency estimate for
Returns¶
FrequentItemEstimate
- get_frequent_items_estimates() List[FrequentItemEstimate] ¶
Get a list of all the frequent item estimates from the processed data stream/data set
Returns¶
- List[FrequentItemEstimate]
List of FrequentItemEstimate which includes the value, estimate and lower/upper bounds.
- get_frequent_items_estimates_no_false_negatives() List[FrequentItemEstimate] ¶
Get a list of all the frequent item estimates using the No false negatives for the Frequent Items sketch
Returns¶
- List[FrequentItemEstimate]
List of FrequentItemEstimate which includes the value, estimate and lower/upper bounds.
- get_top_k_elements(k: int) List[FrequentItemEstimate] ¶
Get a list of top ‘k’ frequent items (aka heavy hitters). When ‘k’ exceeds the number of frequent items, returns the number of frequent items captured by the SFC, else returns `k number of frequent items.
Parameters¶
- k: int
Count of how many top frequently occurring items to return.
Returns¶
- List[FrequentItemEstimate]
List of FrequentItemEstimate which includes the value, estimate and lower/upper bounds.
- get_top_k_elements_using_no_false_negatives(k: int) List[FrequentItemEstimate] ¶
Get a list of top ‘k’ frequent items (aka heavy hitters). When ‘k’ exceeds the number of frequent items, returns the number of frequent items captured by the SFC, else returns `k number of frequent items. Here, we use the No false negatives return from the Frequent Items sketch to calculate the Top k elements. This is done in case the sketch returns an empty list for No false positive scenario
Parameters¶
- k: int
Count of how many top frequently occurring items to return.
Returns¶
- List[FrequentItemEstimate]
List of FrequentItemEstimate which includes the value, estimate and lower/upper bounds.
- get_total_count() int ¶
Returns the total count of input data.
Returns¶
- int :
total count of items in the data.
- merge(other: FrequentItemsSFC, **kwargs: Any) FrequentItemsSFC ¶
Merge two SFCs to produce a correct union, without mutating the others.
Parameters¶
- otherFrequentItemsSFC
Other FrequentItemsSFC to be merged.
Returns¶
- FrequentItemsSFC
A new instance of FrequentItemsSFC after merging.
- serialize(**kwargs: Any) bytes ¶
Serialize the FrequentItemsSFC to bytes. This allows the SFC to be persisted in a Profile
Returns¶
- KLLDoublesSFC
A new instance of KLLDoublesSFC after merging.
- sketch: frequent_strings_sketch¶
mlm_insights.core.sfcs.quantiles_sfc module¶
- class mlm_insights.core.sfcs.quantiles_sfc.QuantilesSFC(kll_sketch: kll_doubles_sketch)¶
Bases:
ShareableFeatureComponent
,Serializable
QuantilesSFC uses streaming quantiles’ algorithm. This can be used to find quantiles, ranks, pmf and cmf.
QuantilesSFC contains only one state i.e kll_sketch: datasketches.skll_doubles_sketch.
Reference: https://datasketches.apache.org/docs/KLL/KLLSketch.html
- Note:
Use create method instead of constructor
- compute(column: Series, **kwargs: Any) None ¶
Update the state of the QuantilesSFC using input series.
Parameters¶
- columnpd.Series
Input column.
- classmethod create(config: Dict[str, ConfigParameter] | None = None) QuantilesSFC ¶
- Factory Method to create an QuantilesSFC. Supported configurable parameters
KLL_K: K-value to initialize kll_double_sketch, default = 200
Returns¶
- QuantilesSFC
An Instance of QuantilesSFC.
- classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) QuantilesSFC ¶
Create a new instance of QuantilesSFC from serialized bytes.
Parameters¶
- serialized_bytesbytes
Serialized bytes as input.
Returns¶
- QuantilesSFC
New instance of QuantilesSFC
- get_quantile(rank: float) float ¶
Returns an approximation to the data value associated with the given normalized rank in a hypothetical sorted version of the input data.
Returns¶
- float
Quantile of the data of given rank
- kll_sketch: kll_doubles_sketch¶
- merge(other: QuantilesSFC, **kwargs: Any) QuantilesSFC ¶
Merge two KLL_SFC into one, without mutating the others.
Parameters¶
- otherQuantilesSFC
Other QuantilesSFC that need be merged.
Returns¶
- QuantilesSFC
A new instance of QuantilesSFC after merging.
mlm_insights.core.sfcs.sfc_merge_exception module¶
- exception mlm_insights.core.sfcs.sfc_merge_exception.SFCMergeException(message: str)¶
Bases:
Exception
Exception raised when merging of 2 ShareableFeatureComponent fails
- Attributes:
message – explanation of the error
mlm_insights.core.sfcs.sfc_registry module¶
- class mlm_insights.core.sfcs.sfc_registry.SFCMetaData(klass: ~typing.Type[~mlm_insights.core.sfcs.interfaces.shareable_feature_component.ShareableFeatureComponent], config: ~typing.Dict[str, ~typing.Any] = <factory>)¶
Bases:
object
SFCMetaData to store class type and config of ShareableFeatureComponent
- config: Dict[str, Any]¶
- get_hash() str ¶
Get the hash of the SFCMetaData, Hash value is derived from md5-hash of SFCMetaData.config
Returns¶
str: The calculated hash of the SFCMetaData.
- klass: Type[ShareableFeatureComponent]¶
- class mlm_insights.core.sfcs.sfc_registry.SFCRegistry¶
Bases:
object
- add_sfc(sfc_metadata: SFCMetaData) SFCRegistry ¶
Add ShareableFeatureComponent to the SFCRegistry
Parameters¶
sfc_metadata : SFCMetaData
Returns¶
SFCRegistry
- static create_from_sfc_map(sfc_map: Dict[str, ShareableFeatureComponent]) SFCRegistry ¶
Factory method to create SFC Registry using SFC Map. Use this method to create SFC registry directly form the SFC map.
Parameters¶
- sfc_mapDict[str, ShareableFeatureComponent]
Dictionary of sfc_map, hash as the Key and ShareableFeatureComponent as value.
- static create_from_sfc_meta(sfc_metas: List[SFCMetaData]) SFCRegistry ¶
Factory method to create SFC Registry using List of SFC Metadata. For each SFC metadata , a hash will be created and new instance of SFC will be created. If two metadata are same, one key will be stored is the set.
Parameters¶
- sfc_metasList[SFCMetaData]
List of SFCMetaData
- classmethod deserialize(sfc_registry_message: SFCRegistryMessage) SFCRegistry ¶
Deserialize the Protobuffer message to SFCRegistry
Returns¶
SFCRegistry
- get_sfc(sfc_meta: SFCMetaData) ShareableFeatureComponent ¶
Get the ShareableFeatureComponent from the SFCMetaData.
Parameters¶
sfc_meta : SFCMetaData
Returns¶
ShareableFeatureComponent
Raises¶
- KeyError
If the SFCMetaData is not found in the Registry , it will raise KeyError.
- get_sfc_map() Dict[str, ShareableFeatureComponent] ¶
Get the ShareableFeatureComponent mapping of SFCMetaData.
Returns¶
Dict[str, ShareableFeatureComponent]: