Quick Start

This section provides a quick introduction to build a classifier using the Oracle AutoMLx tool for Iris dataset. The dataset is a multi-class classification dataset, and more details about the dataset can be found at Iris dataset . We demonstrate the preliminary steps required to train a model with the Oracle AutoMLx tool. We then explain the tuned model.

Load dataset

We start by reading in the dataset from Scikit-learn.

>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> data = load_iris()
>>> df = pd.DataFrame(data['data'], columns=data['feature_names'])
>>> y = pd.Series(data['target'])

This toy dataset only contains numerical data. We now separate the predictions ( y ) from the training data ( X ) for both the training ( 70% ) and test ( 30% ) datasets. The training set will be used to create a Machine Learning model using AutoMLx, and the test set will be used to evaluate the model’s performance on unseen data.

>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(df,
                                                        y,
                                                        train_size=0.7,
                                                        random_state=0)
>>> X_train.shape, X_test.shape
((105, 4), (45, 4))

Set the AutoMLx engine

AutoMLx offers the init() function, which allows to initialize the parallel engine. By default, the AutoMLx pipeline uses the dask parallel engine. One can also set the engine to local , which uses python’s multiprocessing library for parallelism instead.

>>> import automl
>>> from automl import init
>>> init(engine='local')
[2023-01-12 05:48:31,814] [automl.xengine] Local ProcessPool execution (n_jobs=36)

Train a model using AutoMLx

The Oracle AutoMLx solution provides a pipeline that automatically finds a tuned model given a prediction task and a training dataset. In particular it allows to find a tuned model for any supervised prediction task, e.g. classification or regression where the target can be binary, categorical or real-valued.

AutoMLx consists of five main modules:
  1. Preprocessing : Clean, impute, engineer, and normalize features.

  2. Algorithm Selection : Identify the right classification algorithm for a given dataset.

  3. Adaptive Sampling : Select a subset of the data samples for the model to be trained on.

  4. Feature Selection : Select a subset of the data features, based on the previously selected model.

  5. Hyperparameter Tuning : Find the right model parameters that maximize score for the given dataset.

All these pieces are readily combined into a simple AutoMLx pipeline which automates the entire Machine Learning process with minimal user input/interaction.

The AutoMLx API is quite simple to work with. We create a Pipeline instance. Next, the training data is passed to the fit() function which executes the previously mentioned steps.

>>> est = automl.Pipeline(task='classification')
>>> est.fit(X_train, y_train)
    Pipeline()

A model is then generated ( est ) and can be used for prediction tasks. Here, we use the F1_score scoring metric to evaluate the performance of this model on unseen data ( X_test ).

>>> from sklearn.metrics import f1_score
>>> y_pred = est.predict(X_test)
>>> score_default = f1_score(y_test, y_pred, average='macro')
>>> print(f'Score on test data : {score_default}')
Score on test data : 0.975983436853002

The Pipeline can also fit regression, forecasting and anomaly detection models. Please check out the rest of the documentation for more details about advanced configuration parameters.

Explain a classifier

For a variety of decision-making tasks, getting only a prediction as model output is not sufficient. A user may wish to know why the model outputs that prediction, or which data features are relevant for that prediction. For that purpose the Oracle AutoMLx solution defines the MLExplainer object, which allows to compute a variety of model explanations for any AutoMLx-trained pipeline or scikit-learn-like model. MLExplainer takes as argument the trained model, the training data and labels, as well as the task.

>>> explainer = automl.MLExplainer(est,
                                   X_train,
                                   y_train,
                                   task="classification")

Let’s explain the model’s performance (relative to the provided train labels) using Global Feature Importance. This technique would change if a given feature were dropped from the dataset, without retraining the model. This notion of feature importance considers each feature independently from all other features.

The method explain_model() allows to compute such feature importances. It also provides 95% confidence intervals for each feature importance attribution.

>>> result_explain_model_default = explainer.explain_model()
>>> result_explain_model_default.to_dataframe()
    feature attribution     upper_bound     lower_bound
0   petal width (cm)        0.350644        0.416850        0.284437
1   petal length (cm)       0.272190        0.309005        0.235374
2   sepal length (cm)       0.000000        0.000000        0.000000
3   sepal width (cm)        0.000000        0.000000        0.000000

The oracle AutoMLx solution offers advanced configuration options and allows one to change the effect of feature interactions and interaction evaluations. It also provides other model and prediction explanation techniques, such as:

Please check out the MLExplainer documentation for more details.