Transformer Component

The transformer component provides an easy way to do simple in-memory transformations on the input data. Some examples of how transformer can be used are:

  • To normalise an input feature.

  • To change the scale of the selected items in a feature.

  • To modify or append columns based on existing columns.

  • To convert the type.

Transformers work as a chain and so take in a list as input. The order of the list is important as the framework runs the input data sequentially through the list of transformers and sends the final output of the chain to metrics and other components.

Warning

The order in which you pass your transformer must be correct. The final dataframe column must have all the features defined in the schema provided to insight.

As with other components, transformer interfaces can be extended with custom logic. by user.

How do they work

Transformers expect a dataframe as input and also provide a dataframe as output. The first dataframe hase access to the full dataframe created by the reader (based on the input data and the input schema).

The interface method to transform is as follows:

def transform(self, data_frame: pd.DataFrame, **kwargs: Any) -> pd.DataFrame:

As mentioned above, be careful not to unintentionally change the columns so that they are different from input schema. For example, don’t drop a feature column. However, if it is intended, you can provide a modified schema through the transformer’s interface.

def get_output_schema(self, input_schema: pa.Schema, **kwargs: Any) -> pa.Schema:

Warning

The transformers are meant to be applied on a single row. Don’t create a transformation that requires backward seek, forward seek, or access to the entire data set, for example, a group by.

Conditional Feature

One of the important transformers is the conditional feature. The conditional feature lets you write Python expressions to transform the data without the need to write custom transformer classes. The conditional feature has many use cases. For example:

  • Transform unstructured data to structured data.

  • Create composite feature by applying some logic on many columns (of the same row).

  • Create variations of a single feature, for example normalization.

How does it work

Her is an example to demonstrate how conditional feature works. Lets assume we have data with the following input features:

  • Gender

  • JobFunction

  • CommuteLength

They are present in the input data. While we can create many metrics for these, like sum or mean, often users have additional use cases to better understand their data. For example a user might want to have a metric on how many female software developers are present in the data or employees with a very long commute time. One option the user can take here is to run their own ETL, adding these logics to the original input data. However, this might be quite time-consuming and the user might need to create additional infrastructure or set-ups.

ML Insights offers a simple solution where, you can write Python expressions (with row level operations) to create additional features in memory. You can define any metric available (based on the created feature’s data type, variable type, and column type) on them.

Note

Keeping true to the design principle of transformers, conditional features in no way alters the original data nor does it persist any copies of it in an external storage. All conditional features are generated in-memory and hence are transient.

Keeping this in mind, let’s see how we can define new conditional features and pass the on to the builder.

  1. First, import the required classes
    from mlm_insights.core.transformers.conditional_feature_transformer import ConditionalFeatureMetadata, ConditionalFeatureTransformer
    from mlm_insights.core.transformers.expression_evaluator import Expression, ExpressionType
    
  2. Next, define the feature types (as we would have done for an feature coming through input data)
    transformers = []
    feature_female_sde = FeatureMetadata(feature_name="FemaleSoftwareDeveloper",
                                   feature_type=FeatureType(data_type=DataType.INTEGER,
                                                            variable_type=VariableType.CONTINUOUS))
    
    feature_long_travel_time = FeatureMetadata(feature_name="LongTravelTime",
                                                feature_type=FeatureType(data_type=DataType.INTEGER,
                                                                         variable_type=VariableType.CONTINUOUS))
    
  3. Construct the conditional feature objects with proper logic expressions
    conditional_features = [
    ConditionalFeatureMetadata(
        expression=Expression(value="(df['Gender'] == 'Female') & (df['JobFunction'] == 'Software Developer')",
                              type=ExpressionType.python),
        feature_metadata=feature_female_sde),
    ConditionalFeatureMetadata(expression=Expression(value="df['CommuteLength'] > 5",
                                                     type=ExpressionType.python),
                               feature_metadata=feature_long_travel_time)]
    
    transformers.append(ConditionalFeatureTransformer.create(
        config={
            'conditional_features_metadata_config': conditional_features}))
    
  4. Finally, pass the list of conditional feature objects created to the Builder Object
    InsightsBuilder().with_transformers(transformers=transformers)
    

What we did here is create two conditional features:

  • FemaleSoftwareDeveloper - We used the logic Gender = ‘Female’ and JobFunction = ‘Software Developer’ to create a new feature which provides 1 if the data represents a female software developer, 0 otherwise.

  • LongTravelTime - We used the logic CommuteLength > 5 to identify employees with long commute time

We can now add any metric like count on these features to gain more insight on them.