AutoML

The AutoMLx python package automatically creates, optimizes and explains machine learning pipelines and models. The AutoML pipeline provides a tuned ML pipeline that finds the best model for a given training dataset and a prediction task at hand. AutoML has a simple pipeline-level Python API that quickly jump-starts the datascience process with an accurate tuned model. AutoML has support for any of the following tasks:

  1. Supervised classification or regression prediction with tabular dataset where the target can be a simple binary or a multi-class value or a real valued column in a table, respectively.

  2. Supervised classification for Image and Text datasets.

  3. Unsupervised anomaly detection, where the target or the labels are not provided.

  4. Univariate and multivariate timeseries forecasting task.

The AutoML pipeline consists of five major stages of the ML pipeline: preprocessing , algorithm selection , adaptive sampling , feature selection , and model tuning

These pieces are readily combined into a simple AutoML pipeline which automatically optimizes the whole pipeline with limited user input/interaction.

Pipeline

Pipeline ( task = 'classification' , dataset_format = 'pandas' , score_metric = None , random_state = 7 , n_algos_tuned = 1 , model_list = None , preprocessing = True , search_space = None , max_tuning_trials = None , search_strategy = 'HyperGD' , ** kwargs )

Create AutoMLPipeline based on task and dataset type

Parameters
  • task ( str , default='classification' ) – Machine learning task, supported: classification, regression, anomaly_detection, forecasting

  • dataset_format ( str , default='pandas' ) – Determine the type of input/output dataset. Defaults to pandas

  • score_metric ( str , callable , tuple , list or None , default=None ) –

    One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.

    • If None : it will be determined automatically depending on the task. Default score metrics : classification: binary: neg_log_loss, multiclass: neg_log_loss, regression: neg_mean_squared_error, forecasting: neg_sym_mean_abs_percent_error, anomaly_detection: unsupervised_unify95

    • If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes.

    • If a callable: score function (or loss function) with signature score_func(model, X, y) .

    • If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above.

    • If a string: automatically infers the scoring metric from the string: nntt**unsupervised** – unsupervised_unify95, unsupervised_unify95_log_loss

      continuous_forecast – neg_sym_mean_abs_percent_error, neg_root_mean_squared_percent_error, neg_mean_abs_scaled_error, neg_root_mean_squared_error, neg_mean_squared_error, neg_max_absolute_error, neg_mean_absolute_error, neg_max_abs_error, neg_mean_abs_error

      binary – neg_log_loss, roc_auc, accuracy, f1, precision, recall, f1_micro, f1_macro, f1_weighted, f1_samples, recall_micro, recall_macro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples

      multiclass – neg_log_loss, accuracy, f1_micro, f1_macro, f1_weighted, f1_samples, recall_macro, recall_micro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples

      More information on scoring metrics can be found here :

      Classification metrics , Note: Scoring variations like recall_macro are equivalent to sklearn.metrics.recall_score(...,average="macro")

      continuous – neg_mean_squared_error, r2, neg_mean_absolute_error, neg_mean_squared_log_error, neg_median_absolute_error

      More information on scoring metrics can be found here :

      Regression metrics ,

  • random_state ( int , default=7 ) – Random seed used by AutoML.

  • n_algos_tuned ( int , default=1 ) –

    Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.

    • To disable algorithm selection set

    n_algos_tuned = len(model_list) .

  • model_list ( List [ Model | str | Any ] or None , default=None ) –

    Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for regression and classification must implement the scikit-learn-style fit and predict methods. Classification models also must support predict_proba. Anomaly detection models must follow the pyod interface. (by default, all supported built-in models for a given task are used) Supported built-in models per task:

    classification – CatBoostClassifier, DecisionTreeClassifier, ExtraTreesClassifier, GaussianNB, KNeighborsClassifier, LGBMClassifier, LogisticRegression, RandomForestClassifier, SVC, TorchMLPClassifier, XGBClassifier

    regression – AdaBoostRegressor, DecisionTreeRegressor, ExtraTreesRegressor, KNeighborsRegressor, LGBMRegressor, LinearRegression, LinearSVR, RandomForestRegressor, SVR, TorchMLPRegressor, XGBRegressor

    anomaly_detection – ClusteringLocalFactorOD, HistogramOD, IsolationForestOD, KNearestNeighborsOD, MinCovOD, OneClassSVMOD, PrincipalCompOD, AutoEncoder

    forecasting – NaiveForecaster, ThetaForecaster, ExpSmoothForecaster, ETSForecaster, STLwESForecaster, STLwARIMAForecaster, SARIMAXForecaster, VARMAXForecaster, DynFactorForecaster

  • preprocessing ( bool , default=True ) –

    Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.

    • If True, auto-preprocessor runs on dataset to normalize data.

    Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using sklearn.preprocessing.StandardScaler . Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce a ValueError . AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings).

  • search_space ( dict or None , default=None ) –

    This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. Key values must have two parameters: (1) ‘range’ which is a list containing the range and (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for LogisticRegression:

    search_space =  {
        'LogisticRegression' : {
            'C': {
                'range': [0.03125, 512],
                'type': 'continuous'
            },
            'solver': {
                'range': ['newton-cg', 'lbfgs',
                          'liblinear', 'sag'],
                'type': 'categorical'
            },
            'class_weight': {
                'range': [None, 'balanced'],
                'type': 'categorical'
            }
        }
    }
    
    • To disable Model Tune for all models set

    search_space = {} - If a key value is an empty dictionary, then Model Tune is disabled for that key. - If None , default search space defined inside AutoML is used.

  • max_tuning_trials ( int , dict or None , default=None ) – The maximum number of HPO trials, may be exceeded slightly. - If None : AutoML automatically determines when enough HPO trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if n_algos_tuned == 2 , then up to 2 * max_tuning_trials are performed in total. - If a dict : by passing a dictionary you can specify this parameter per algorithm. e.g., {'LogisticRegression': 100, 'RandomForestClassifier': 200} . Missing values in the dictionary default to None.

  • search_strategy ( str , default='HyperGD' ) – The search strategy used in Model Tune. Valid search_strategy values: HyperGD, BruteForceSampler, CmaEsSampler, GridSampler, IntersectionSearchSpace, MOTPESampler, NSGAIISampler, PartialFixedSampler, QMCSampler, RandomSampler, TPESampler, intersection_search_space, nsgaii

  • kwargs ( Any ) –

    Optional arguments. You can find a list of arguments related to each task in their config method: - :py:meth:automlx.express.classifier.AutoClassifier.configure

    for ‘classification’

    • :py:meth:automlx.express.regressor.AutoRegressor.configure for ‘regression’

    • :py:meth:automlx.express.anomaly_detector.AutoAnomalyDetector.configure for ‘anomaly_detection’

    • :py:meth:automlx.express.forecaster.AutoForecaster.configure for ‘forecasting’

Raises

AutoMLxValueError – If the given task is not supported or the provided dataset format is not supported.

Returns

An AutoMLPipeline for the given task: - :py:class:automlx.express.classifier.AutoClassifier

for ‘classification’

  • :py:class:automlx.express.regressor.AutoRegressor for ‘regression’

  • :py:class:automlx.express.anomaly_detector.AutoAnomalyDetector for ‘anomaly_detection’

  • :py:class:automlx.express.forecaster.Forecaster for ‘forecasting’

Return type

AutoMLPipeline

AutoClassifier

class AutoClassifier

Classifier AutoMLPipeline

classes_

Holds the label for each class (for task=classification only, otherwise it is set to None ).

Type

List[Any]

selected_features_names_

Names of the engineered features selected by the AutoML pipeline.

Type

List[ str ]

selected_features_names_raw_

Names of original feature names selected by the AutoML pipeline. If preprocessing is disabled, then this corresponds to selected_features_names_ ; otherwise, a raw feature is considered selected if at least one of the features engineered from it is selected.

Type

List[ str ]

ranked_models_

List of model names ranked in order of their quality from the last fit call.

Type

List[ str ]

selected_model_

Name of the best model selected by AutoML.

Type

str

selected_model_params_

Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.

Type

dict

selected_rows_

List of indices in the original train dataset provided to AutoML corresponding to the rows sampled during Adaptive Sampling. In the case of CV, this attribute will result in a list of lists corresponding to indices selected in each fold. For example, in the case of no CV, this attribute looks like: [0, 1, 5] , indicating indices 0, 1, and 5 have been selected during adaptive sampling. In the case of CV=3, this attribute looks like: [ [0, 1], [0, 5], [1, 5] ] , indicating indices 0,1 were selected from the first fold, 0,5 were selected in the 2nd fold, and 1,5 were selected in the 3rd fold.

Type

list

selected_valid_rows_

List of indices in the original validation dataset (if CV==None ) provided to AutoML corresponding to the rows sampled during Adaptive Sampling. If CV is not None, the returned value is always None given that Adaptive Sampling does not sample the validation set when CV is enabled.

Type

list

pipelines_

Sorted list of pipelines (length equal to n_algos_tuned ), with 0th element being the best model.

Type

list

completed_trials_summary_

All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.

Type

pandas.DataFrame

completed_trials_detailed_

A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.

Type

pandas.DataFrame

n_jobs_

Parallelism internally used by AutoML. Calculated as inter_model_parallelism*intra_model_parallelism .

Type

int

feature_importances_

Importance of each feature in the dataset for the selected model

Type

numpy.ndarray of shape (n_features,)

threshold_tuning_score_

The validation score of the pipelines after applying threshold tuning. The scoring metric used to select this threshold can be found in threshold_tuning_scorer_ . It is None when the task is not classification or threshold_tuning is False.

Type

List[Dict[ str , float ]]

threshold_tuning_scorer_

The scoring metric used to select threshold during threshold tuning. It is None when the task is not classification or threshold_tuning is False.

Type

Metric

configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , adaptive_sampling = None , min_features = None , optimization = None , preprocessing = None , search_space = None , min_class_instances = None , max_tuning_trials = None , search_strategy = None , threshold_tuning = None )

Configure the AutoClassifier

If an argument is set to None, then its value is not changed and the default value is used.

Parameters
  • score_metric ( str , callable , tuple , list or None , default=None ) –

    One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.

    • If None : it will be determined automatically depending

    on the task. Default score metrics : binary: neg_log_loss, multiclass: neg_log_loss - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature score_func(model, X, y) . - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nntt

    binary – neg_log_loss, roc_auc, accuracy, f1, precision, recall, f1_micro, f1_macro, f1_weighted, f1_samples, recall_micro, recall_macro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples

    multiclass – neg_log_loss, accuracy, f1_micro, f1_macro, f1_weighted, f1_samples, recall_macro, recall_micro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples

    More information on scoring metrics can be found here :

    Classification metrics , Note: Scoring variations like recall_macro are equivalent to sklearn.metrics.recall_score(...,average="macro")

  • random_state ( int or None , default=None ) – Random seed used by AutoML. Default value (if not previously set): 7

  • n_algos_tuned ( int or None , default=None ) –

    Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.

    • To disable algorithm selection set

    n_algos_tuned = len(model_list) .

    Default value (if not previously set): 1

  • model_list ( List [ str | Any ] or None , default=None ) –

    Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for classification must implement the scikit-learn-style fit, predict, and predict_proba methods. (by default, all supported built-in models for a given task are used) Supported built-in models per task:

    classification – CatBoostClassifier, DecisionTreeClassifier, ExtraTreesClassifier, GaussianNB, KNeighborsClassifier, LGBMClassifier, LogisticRegression, RandomForestClassifier, SVC, TorchMLPClassifier, XGBClassifier

  • adaptive_sampling ( bool or None , default=None ) – Set to False to disable class balancing and adaptive sampling done in AutoML. Disabling this might significantly increase runtime. Default value (if not previously set): True

  • min_features ( int , float , list or None , default=None ) –

    Minimum number of features to keep. Acceptable values:

    • If int, 0 < min_features <= n_features

    • If float, 0 < min_features <= 1.0

    • If list, names of features to keep, for example

    ['a', 'b'] means keep features ‘a’ and ‘b’ - To disable feature selection set min_features = 1.0

    Default value (if not previously set): 1

  • optimization ( int or None , default=None ) –

    Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.

    • Level 0: Optimized for reproducibility

    (controls most randomness) - Level 3: Optimized for speed and accuracy - Level 10: Optimized for speed Default value (if not previously set): 3

  • preprocessing ( bool or None , default=None ) –

    Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.

    • If True, auto-preprocessor runs on dataset to normalize data.

    Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using sklearn.preprocessing.StandardScaler . Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce a ValueError . AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings). Default value (if not previously set): True

  • search_space ( dict or None , default=None ) –

    This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. Key values must have two parameters: (1) ‘range’ which is a list containing the range and (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for LogisticRegression:

    search_space =  {
        'LogisticRegression' : {
            'C': {
                'range': [0.03125, 512],
                'type': 'continuous'
            },
            'solver': {
                'range': ['newton-cg', 'lbfgs',
                          'liblinear', 'sag'],
                'type': 'categorical'
            },
            'class_weight': {
                'range': [None, 'balanced'],
                'type': 'categorical'
            }
        }
    }
    
    - To disable *Model Tune* for all models set
    ``search_space = {}``
    - If a key value is an empty dictionary, then Model Tune is
    disabled for that key.
    - If ``None``, default search space defined inside AutoML
    is used.
    

  • min_class_instances ( int or None , default=None ) – The minimum number of instances all classes must have when doing classification. If any class has less than this number of instances, training is stopped. This argument may take any value of 2 or higher. Default value (if not previously set): 5

  • max_tuning_trials ( int , dict or None , default=None ) –

    The maximum number of HPO trials, may be exceeded slightly.
    • If None : AutoML automatically determines when enough HPO

    trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if n_algos_tuned == 2 , then up to 2 * max_tuning_trials are performed in total. - If a dict : by passing a dictionary you can specify this parameter per algorithm. e.g., {'LogisticRegression': 100, 'RandomForestClassifier': 200} . Missing values in the dictionary default to None.

    Default value (if not previously set): None

  • search_strategy ( str or None , default=None ) – The search strategy used in Model Tune. Valid search_strategy values: HyperGD, BruteForceSampler, CmaEsSampler, GridSampler, IntersectionSearchSpace, MOTPESampler, NSGAIISampler, PartialFixedSampler, QMCSampler, RandomSampler, TPESampler, intersection_search_space, nsgaii Default value (if not previously set): 'HyperGD'

  • threshold_tuning ( bool or None , default=None ) –

    Determine whether or not AutoML optimizes the prediction threshold. Threshold tuning is only used in classification tasks. However, unlike classic threshold tuning, AutoML uses a novel technique that increases or decreases the model’s prediction probabilities for a given class, thereby keeping the prediction probability fixed to 0.5 for binary classification and allowing the method to generalize to multi-class classification problems.

    • If True, the prediction threshold will be optimized

    based on the provided score metric. Threshold tuning allows users to post-process classification model predictions to optimize for their custom metric. Threshold tuning will not be exported to onnx models, therefore the onnx model quality may be lower than the original model. - If False, threshold tuning is not applied.

    Default value (if not previously set): False

train ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )

Automatically identifies the most relevant model and hyperparameters for this given set of features ( X ) and target ( y ). Does not conduct final model fit. If the latter is desired, use fit .

Parameters
  • X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.

  • y ( pandas.DataFrame , pandas.Series ) – Training dataset target.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target

  • cv ( int , str or None , default='auto' ) –

    Determines the cross-validation split. Possible inputs for cv are:

    • None: uses X_valid and y_valid for validation

    • ’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise

    • integer: specifies the number of folds in a (Stratified)KFold ,

    • iterable: yields (train, test) splits as arrays of indices.

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

  • col_types ( list of strings or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’. ‘datetime’, ‘timedelta’, ‘image’. For text classification, it has to be set to ‘text’. In the image classification, features with col_type of image should be a column containing images in PIL format. If not None , it manually specifies the type of every dataset feature.

  • time_budget ( Dict [ str , float ] , float , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

Returns

self

Return type

AutoMLPipeline

fit ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )

Automatically identifies the most relevant features, model and hyperparameters for a given training data ( X ) and target ( y ). Final model fit is conducted on a full dataset.

Parameters
  • X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.

  • y ( pandas.DataFrame , pandas.Series ) – Training dataset target. Note that y is required for forecasting task.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target

  • cv ( int , str or None , default='auto' ) –

    Determines the cross-validation split. Possible inputs for cv are:

    • None: uses X_valid and y_valid for validation

    • ’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise

    • integer: specifies the number of folds in a (Stratified)KFold ,

    • iterable: yields (train, test) splits as arrays of indices.

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

  • col_types ( List [ str ] or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’. ‘datetime’, ‘timedelta’, ‘image’. For text classification, it has to be set to ‘text’. In the image classification, features with col_type of image should be a column containing images in PIL format. If not None , it manually specifies the type of every dataset feature.

  • time_budget ( Dict [ str , float ] , float , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

Returns

self

Return type

AutoMLPipeline

refit ( self , X , y , X_valid = None , y_valid = None )

Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used. fit must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.

Parameters
Returns

self

Return type

AutoMLPipeline

predict_proba ( self , X )

Probability estimates.

Parameters

X ( pandas.DataFrame ) – Prediction dataset features

Raises

AutoMLxRuntimeError – If there is no predictions after calling the model over the given dataset

Returns

y_pred – The predicted probabilities.

Return type

numpy.ndarray of shape = (n_samples, n_classes)

AutoRegressor

class AutoRegressor

Regressor AutoMLPipeline

selected_features_names_

Names of the engineered features selected by the AutoML pipeline.

Type

List[ str ]

selected_features_names_raw_

Names of original feature names selected by the AutoML pipeline. If preprocessing is disabled, then this corresponds to selected_features_names_ ; otherwise, a raw feature is considered selected if at least one of the features engineered from it is selected.

Type

List[ str ]

ranked_models_

List of model names ranked in order of their quality from the last fit call.

Type

List[ str ]

selected_model_

Name of the best model selected by AutoML.

Type

str

selected_model_params_

Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.

Type

dict

selected_rows_

List of indices in the original train dataset provided to AutoML corresponding to the rows sampled during Adaptive Sampling. In the case of CV, this attribute will result in a list of lists corresponding to indices selected in each fold. For example, in the case of no CV, this attribute looks like: [0, 1, 5] , indicating indices 0, 1, and 5 have been selected during adaptive sampling. In the case of CV=3, this attribute looks like: [ [0, 1], [0, 5], [1, 5] ] , indicating indices 0,1 were selected from the first fold, 0,5 were selected in the 2nd fold, and 1,5 were selected in the 3rd fold.

Type

list

selected_valid_rows_

List of indices in the original validation dataset (if CV==None ) provided to AutoML corresponding to the rows sampled during Adaptive Sampling. If CV is not None, the returned value is always None given that Adaptive Sampling does not sample the validation set when CV is enabled.

Type

list

pipelines_

Sorted list of pipelines (length equal to n_algos_tuned ), with 0th element being the best model.

Type

list

completed_trials_summary_

All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.

Type

pandas.DataFrame

completed_trials_detailed_

A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.

Type

pandas.DataFrame

n_jobs_

Parallelism internally used by AutoML. Calculated as inter_model_parallelism*intra_model_parallelism .

Type

int

feature_importances_

Importance of each feature in the dataset for the selected model

Type

numpy.ndarray of shape (n_features,)

configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , adaptive_sampling = None , min_features = None , optimization = None , preprocessing = None , search_space = None , max_tuning_trials = None , search_strategy = None )

Configure the AutoRegressor

If an argument is set to None, then its value is not changed and the default value is used.

Parameters
  • score_metric ( str , callable , tuple , list or None , default=None ) –

    One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.

    • If None : it will be determined automatically depending

    on the task. Default score metrics : neg_mean_squared_error - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature score_func(model, X, y) . - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nntt

    continuous – neg_mean_squared_error, r2, neg_mean_absolute_error, neg_mean_squared_log_error, neg_median_absolute_error

    More information on scoring metrics can be found here :

    Regression metrics ,

  • random_state ( int or None , default=None ) – Random seed used by AutoML. Default value (if not previously set): 7

  • n_algos_tuned ( int or None , default=None ) –

    Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.

    • To disable algorithm selection set

    n_algos_tuned = len(model_list) .

    Default value (if not previously set): 1

  • model_list ( List [ str | Any ] or None , default=None ) –

    Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for regression must implement the scikit-learn-style fit and predict methods. (by default, all supported built-in models for a given task are used) Supported built-in models per task:

    regression – AdaBoostRegressor, DecisionTreeRegressor, ExtraTreesRegressor, KNeighborsRegressor, LGBMRegressor, LinearRegression, LinearSVR, RandomForestRegressor, SVR, TorchMLPRegressor, XGBRegressor

  • adaptive_sampling ( bool or None , default=None ) – Set to False to disable class balancing and adaptive sampling done in AutoML. Disabling this might significantly increase runtime. Default value (if not previously set): True

  • min_features ( int , float , list or None , default=None ) –

    Minimum number of features to keep. Acceptable values:

    • If int, 0 < min_features <= n_features

    • If float, 0 < min_features <= 1.0

    • If list, names of features to keep, for example

    ['a', 'b'] means keep features ‘a’ and ‘b’ - To disable feature selection set min_features = 1.0

    Default value (if not previously set): 1

  • optimization ( int or None , default=None ) –

    Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.

    • Level 0: Optimized for reproducibility

    (controls most randomness) - Level 3: Optimized for speed and accuracy - Level 10: Optimized for speed Defaults to 3

  • preprocessing ( bool or None , default=None ) –

    Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.

    • If True, auto-preprocessor runs on dataset to normalize data.

    Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using sklearn.preprocessing.StandardScaler . Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce a ValueError . AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings). Default value (if not previously set): True

  • search_space ( dict or None , default=None ) –

    This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. Key values must have two parameters: (1) ‘range’ which is a list containing the range and (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for AdaBoostRegressor:

    search_space =  {
        'AdaBoostRegressor' : {
            'learning_rate': {
                'range': [0.05, 1],
                'type': 'continuous'
            },
            'n_estimators': {
                'range': [10, 50],
                'type': 'discrete'
            },
        }
    }
    
    - To disable *Model Tune* for all models set
    ``search_space = {}``
    - If a key value is an empty dictionary, then Model Tune is
    disabled for that key.
    - If ``None``, default search space defined inside AutoML
    is used.
    

  • max_tuning_trials ( int , dict or None , default=None ) –

    The maximum number of HPO trials, may be exceeded slightly.
    • If None : AutoML automatically determines when enough HPO

    trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if n_algos_tuned == 2 , then up to 2 * max_tuning_trials are performed in total. - If a dict : by passing a dictionary you can specify this parameter per algorithm. e.g., {'LogisticRegression': 100, 'RandomForestClassifier': 200} . Missing values in the dictionary default to None.

    Default value (if not previously set): None

  • search_strategy ( str or None , default=None ) – The search strategy used in Model Tune. Valid search_strategy values: HyperGD, BruteForceSampler, CmaEsSampler, GridSampler, IntersectionSearchSpace, MOTPESampler, NSGAIISampler, PartialFixedSampler, QMCSampler, RandomSampler, TPESampler, intersection_search_space, nsgaii Default value (if not previously set): 'HyperGD'

train ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )

Automatically identifies the most relevant model and hyperparameters for this given set of features ( X ) and target ( y ). Does not conduct final model fit. If the latter is desired, use fit .

Parameters
  • X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.

  • y ( pandas.DataFrame , pandas.Series ) – Training dataset target.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target

  • cv ( int , str or None , default='auto' ) –

    Determines the cross-validation split. Possible inputs for cv are:

    • None: uses X_valid and y_valid for validation

    • ’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise

    • integer: specifies the number of folds in a (Stratified)KFold ,

    • iterable: yields (train, test) splits as arrays of indices.

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

  • col_types ( list of strings or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’ If not None , it manually specifies the type of every dataset feature.

  • time_budget ( Dict [ str , float ] , float or None , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

Returns

self

Return type

AutoMLPipeline

fit ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )

Automatically identifies the most relevant features, model and hyperparameters for a given training data ( X ) and target ( y ). Final model fit is conducted on a full dataset.

Parameters
  • X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.

  • y ( pandas.DataFrame , pandas.Series ) – Training dataset target. Note that y is required for forecasting task.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target

  • cv ( int , str or None , default='auto' ) –

    Determines the cross-validation split. Possible inputs for cv are:

    • None: uses X_valid and y_valid for validation

    • ’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise

    • integer: specifies the number of folds in a (Stratified)KFold ,

    • iterable: yields (train, test) splits as arrays of indices.

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

  • col_types ( List [ str ] or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’ If not None , it manually specifies the type of every dataset feature.

  • time_budget ( Dict [ str , float ] , float or None , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

Returns

self

Return type

AutoMLPipeline

refit ( self , X , y , X_valid = None , y_valid = None )

Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used. fit must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.

Parameters
Returns

self

Return type

AutoMLPipeline

AutoAnomalyDetector

class AutoAnomalyDetector

Anomaly Detection AutoMLPipeline

classes_

Holds the label for each class (for task=classification only, otherwise it is set to None ).

Type

List[Any]

selected_features_names_

Names of the engineered features selected by the AutoML pipeline.

Type

List[ str ]

selected_features_names_raw_

Names of original feature names selected by the AutoML pipeline. If preprocessing is disabled, then this corresponds to selected_features_names_ ; otherwise, a raw feature is considered selected if at least one of the features engineered from it is selected.

Type

List[ str ]

ranked_models_

List of model names ranked in order of their quality from the last fit call.

Type

List[ str ]

selected_model_

Name of the best model selected by AutoML.

Type

str

selected_model_params_

Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.

Type

dict

selected_rows_

List of indices in the original train dataset provided to AutoML corresponding to the rows sampled during Adaptive Sampling. In the case of CV, this attribute will result in a list of lists corresponding to indices selected in each fold. For example, in the case of no CV, this attribute looks like: [0, 1, 5] , indicating indices 0, 1, and 5 have been selected during adaptive sampling. In the case of CV=3, this attribute looks like: [ [0, 1], [0, 5], [1, 5] ] , indicating indices 0,1 were selected from the first fold, 0,5 were selected in the 2nd fold, and 1,5 were selected in the 3rd fold.

Type

list

selected_valid_rows_

List of indices in the original validation dataset (if CV==None ) provided to AutoML corresponding to the rows sampled during Adaptive Sampling. If CV is not None, the returned value is always None given that Adaptive Sampling does not sample the validation set when CV is enabled.

Type

list

pipelines_

Sorted list of pipelines (length equal to n_algos_tuned ), with 0th element being the best model.

Type

list

completed_trials_summary_

All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.

Type

pandas.DataFrame

completed_trials_detailed_

A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.

Type

pandas.DataFrame

n_jobs_

Parallelism internally used by AutoML. Calculated as inter_model_parallelism*intra_model_parallelism .

Type

int

feature_importances_

Importance of each feature in the dataset for the selected model

Type

numpy.ndarray of shape (n_features,)

configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , optimization = None , preprocessing = None , search_space = None , max_tuning_trials = None , search_strategy = None )

Configure the AutoAnomalyDetector

If an argument is set to None, then its value is not changed and the default value is used.

Parameters
  • score_metric ( str , callable , tuple , list or None , default=None ) –

    One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.

    • If None : it will be determined automatically depending

    on the task. Default score metrics : unsupervised_unify95 - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature score_func(model, X, y) . - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nntt**unsupervised** – unsupervised_unify95, unsupervised_unify95_log_loss

  • random_state ( int or None , default=None ) – Random seed used by AutoML. Default value (if not previously set): 7

  • n_algos_tuned ( int or None , default=None ) –

    Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.

    • To disable algorithm selection set

    n_algos_tuned = len(model_list) .

    Default value (if not previously set): 1

  • model_list ( List [ str | Any ] or None , default=None ) –

    Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for anomaly detection must follow the pyod interface. (by default, all supported built-in models for a given task are used) Supported built-in models per task:

    anomaly_detection – ClusteringLocalFactorOD, HistogramOD, IsolationForestOD, KNearestNeighborsOD, MinCovOD, OneClassSVMOD, PrincipalCompOD, AutoEncoder

  • optimization ( int or None , default=None ) –

    Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.

    • Level 0: Optimized for reproducibility

    (controls most randomness) - Level 3: Optimized for speed and accuracy - Level 10: Optimized for speed Default value (if not previously set): 3

  • preprocessing ( bool or None , default=None ) –

    Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.

    • If True, auto-preprocessor runs on dataset to normalize data.

    Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using sklearn.preprocessing.StandardScaler . Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce a ValueError . AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings). Default value (if not previously set): True

  • search_space ( dict or None , default=None ) –

    This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. Key values must have two parameters: (1) ‘range’ which is a list containing the range and (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for IsolationForestOD:

    search_space =  {
        'IsolationForestOD' : {
            'n_estimators': {
                'range': [10, 50],
                'type': 'discrete'
            },
            'max_features': {
                'range': [0.5, 0.7],
                'type': 'continuous'
            },
            'max_samples': {
                'range': [5, 10],
                'type': 'discrete'
            }
        }
    }
    
    - To disable *Model Tune* for all models set
    ``search_space = {}``
    - If a key value is an empty dictionary, then Model Tune is
    disabled for that key.
    - If ``None``, default search space defined inside AutoML
    is used.
    

  • max_tuning_trials ( int , dict or None , default=None ) –

    The maximum number of HPO trials, may be exceeded slightly.
    • If None : AutoML automatically determines when enough HPO

    trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if n_algos_tuned == 2 , then up to 2 * max_tuning_trials are performed in total. - If a dict : by passing a dictionary you can specify this parameter per algorithm. e.g., {'LogisticRegression': 100, 'RandomForestClassifier': 200} . Missing values in the dictionary default to None.

    Default value (if not previously set): None

  • search_strategy ( str or None , default=None ) – The search strategy used in Model Tune. Valid search_strategy values: HyperGD, BruteForceSampler, CmaEsSampler, GridSampler, IntersectionSearchSpace, MOTPESampler, NSGAIISampler, PartialFixedSampler, QMCSampler, RandomSampler, TPESampler, intersection_search_space, nsgaii Default value (if not previously set): 'HyperGD'

train ( self , X , X_valid = None , y_valid = None , col_types = None , time_budget = - 1 , contamination = None )

Automatically identifies the most relevant model and hyperparameters for this given set of features ( X ) and target ( y ). Does not conduct final model fit. If the latter is desired, use fit .

Parameters
  • X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target

  • col_types ( list of strings or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, ‘timedelta’ If not None , it manually specifies the type of every dataset feature.

  • time_budget ( Dict [ str , float ] , float or None , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

  • contamination ( float or None , default=None ) – Fraction of training dataset corresponding to anomalies (between 0.0 and 0.5). Should only be set for supervised anomaly detection (y_valid is required). Should be set to None for unsupervised anomaly detection (when using the unsupervised metrics).

Raises

AutoMLxValueError – If contamination has been provided for unsupervised AD

Returns

self

Return type

AutoMLPipeline

fit ( self , X , X_valid = None , y_valid = None , col_types = None , time_budget = - 1 , contamination = None )

Automatically identifies the most relevant features, model and hyperparameters for a given training data ( X ). Final model fit is conducted on a full dataset.

Parameters
  • X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target

  • col_types ( List [ str ] or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’

  • time_budget ( Dict [ str , float ] , float , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

  • contamination ( float or None , default=None ) – Fraction of training dataset corresponding to anomalies (between 0.0 and 0.5). Should only be set for supervised anomaly detection (y_valid is required). Should be set to None for unsupervised anomaly detection (when using the unsupervised metrics).

Returns

self

Return type

AutoMLPipeline

refit ( self , X , X_valid = None , y_valid = None )

Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used. fit must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.

Parameters
Returns

self

Return type

AutoMLPipeline

predict_proba ( self , X )

Probability estimates.

Parameters

X ( pandas.DataFrame ) – Prediction dataset features

Raises
  • AutoMLxNotFittedError – If the pipeline is not fitted yet.

  • AutoMLxRuntimeError – If there is no predictions after calling the model over the given dataset.

Returns

y_pred – The predicted probabilities.

Return type

numpy.ndarray of shape = (n_samples, n_classes)

AutoForecaster

class AutoForecaster

Forecasting AutoMLPipeline

ranked_models_

List of model names ranked in order of their quality from the last fit call.

Type

List[ str ]

selected_model_

Name of the best model selected by AutoML.

Type

str

selected_model_params_

Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.

Type

dict

pipelines_

Sorted list of pipelines (length equal to n_algos_tuned ), with 0th element being the best model.

Type

list

completed_trials_summary_

All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.

Type

pandas.DataFrame

completed_trials_detailed_

A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.

Type

pandas.DataFrame

time_series_period

The seasonality period to force-fit the time series at regardless of whether it is detected in the data.

Type

int or None

n_jobs_

Parallelism internally used by AutoML. Calculated as inter_model_parallelism*intra_model_parallelism .

Type

int

configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , optimization = None , preprocessing = None , search_space = None , max_tuning_trials = None , search_strategy = None , time_series_period = None )

Configure the AutoForecaster

If an argument is set to None, then its value is not changed and the default value is used.

Parameters
  • score_metric ( str , callable , tuple , list or None , default=None ) –

    One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.

    • If None : it will be determined automatically depending

    on the task. Default score metrics : neg_sym_mean_abs_percent_error - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature score_func(model, X, y) . - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nntt

    continuous_forecast – neg_sym_mean_abs_percent_error, neg_root_mean_squared_percent_error, neg_mean_abs_scaled_error, neg_root_mean_squared_error, neg_mean_squared_error, neg_max_absolute_error, neg_mean_absolute_error, neg_max_abs_error, neg_mean_abs_error

  • random_state ( int , or None , default=None ) – Random seed used by AutoML. Suggested default: 7

  • n_algos_tuned ( int , or None , default=None ) –

    Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.

    • To disable algorithm selection set

    n_algos_tuned = len(model_list) .

    Suggested default: 1

  • model_list ( List [ str ] , or None , default=None ) –

    Models that will be evaluated by the Pipeline. Users can specify built-in models by name (by default, all supported built-in models for a given task are used).

    • All models except VARMAX and DynFactor models are applicable

    when doing there is a single timeseries in y. - If you have multiple timeseries in y that you want to predict as a system, then multivariate forecasting VARMAX and DynFactor may be utilized. - When you have features or exogenous regressors that you known in advance for your forecast period, pass them into X.

    Supported built-in models per task:

    forecasting – NaiveForecaster, ThetaForecaster, ExpSmoothForecaster, ETSForecaster, STLwESForecaster, STLwARIMAForecaster, SARIMAXForecaster, VARMAXForecaster, DynFactorForecaster

  • optimization ( int , or None , default=None ) –

    Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.

    • Level 0: Optimized for reproducibility (controls most randomness)

    • Level 3: Optimized for speed and accuracy

    • Level 10: Optimized for speed

    Suggested default: 3

  • preprocessing ( bool , or None , default=None ) – Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users. Most of the preprocessing can not be turned off for the forecasting task. Suggested default: True

  • search_space ( dict , or None , default=None ) –

    This parameter defines the search space for model tuning. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. Key values must have two parameters: (1) ‘range’ which is a list containing the range and (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for ETSForecaster:

    search_space =  {
        'ETSForecaster' : {
            'error': {
                'range': ['add', 'mul'],
                'type': 'categorical'
            },
            'damped_trend': {
                'range': [True, False],
                'type': 'categorical'
            },
        }
    }
    
    - To disable *model tuning* for all models set
    ``search_space = {}``
    - If a key value is an empty dictionary, then model tuning is
    disabled for that key.
    - If ``None``, default search space defined inside AutoML
    is used.
    

  • max_tuning_trials ( int , dict or None , default=None ) –

    The maximum number of HPO trials, may be exceeded slightly.
    • If None : AutoML automatically determines when enough HPO

    trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if n_algos_tuned == 2 , then up to 2 * max_tuning_trials are performed in total. - If a dict : by passing a dictionary you can specify this parameter per algorithm. e.g., {'LogisticRegression': 100, 'RandomForestClassifier': 200} . Missing values in the dictionary default to None.

    Default value (if not previously set): None

  • search_strategy ( str ) – The search strategy used in model tuning. Valid search_strategy values: HyperGD, BruteForceSampler, CmaEsSampler, GridSampler, IntersectionSearchSpace, MOTPESampler, NSGAIISampler, PartialFixedSampler, QMCSampler, RandomSampler, TPESampler, intersection_search_space, nsgaii Suggested default: 'HyperGD'

  • time_series_period ( int or None , default=None ) – The seasonality period to force-fit the time series at regardless of whether it is detected in the data. If None, AutoML guesses the seasonability by inspecting the training data. However, users can use this to set it manually instead.

fit ( self , y , X = None , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )

Automatically identifies the most relevant features, model and hyperparameters for a given training data ( X ) and target ( y ). Final model fit is conducted on a full dataset.

Parameters
  • y ( pandas.DataFrame ) – Training dataset target.

  • X ( pandas.DataFrame or None , default=None ) – A dataframe of explanatory variables that support the target timeseries in y. These must be known in advance for the foreast period and the training period.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame or None , default=None ) – Validation dataset target

  • cv ( int , str or None , default='auto' ) –

    Determines the cross-validation split. Possible inputs for cv are:

    • None: uses X_valid and y_valid for validation

    • ’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise

    • integer: specifies the number of folds in a (Stratified)KFold ,

    • iterable: yields (train, test) splits as arrays of indices.

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

  • col_types ( List [ str ] or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’

  • time_budget ( Dict [ str , float ] , float , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

Returns

self

Return type

AutoMLPipeline

predict ( self , X )

Predict labels for features (X).

Parameters

X ( pandas.DataFrame ) – A dataframe of explanatory variables that support the target timeseries in y

Raises
  • AutoMLxNotFittedError – If the pipeline is not fitted yet

  • AutoMLxRuntimeError – If there is no predictions after calling the model over the given dataset

  • AutoMLxRuntimeError – If result of time series numerical inverse transform is None

Returns

y_pred – A data frame containing the predicted values.

Return type

pandas.DataFrame

forecast ( self , periods , alpha = 0.05 , X = None )

Forecast with the selected model.

A dataframe of explanatory variables that support forecast for period number of timestamps beginning from the last index in y. The index of X here must continue from that which was used in fit.

Parameters
  • periods ( int ) – The number of time steps to forecast from the end of the sample.

  • alpha ( float , default=0.05 ) – A significance level. To receive a prediction interval of 95% alpha must be set to 0.05.

  • X ( pandas.DataFrame , or None , default=None ) – A dataframe of explanatory variables that support forecast for period number of timestamps. Columns must match the ones used in fit .

Returns

summary_frame – A dataframe with three columns listing prediction, ci_lower and ci_upper for the given confidence interval (ci) provided by level of alpha. Note: ci columns are excluded for models that don’t support intervals.

Return type

pandas.Dataframe

Raises
  • AutoMLxNotFittedError – If the pipeline is not fitted yet.

  • AutoMLxValueError – If explanatory variables are not provided, complete, or length of explanatory variables not equal to requested periods.

score ( self , X , y )

Score of this pipeline for a given set of features ( X ) and labels ( y ). If inferred_score_metric has multiple score metrics, the first score metric will be calculated.

Parameters
  • X ( pd.DataFrame ) – Training dataset features

  • y ( pd.DataFrame , pd.Series ) – Training dataset target

Raises

AutoMLxNotFittedError – If the pipeline is not fitted yet

Returns

score – Score of self.predict(X) with respect to y .

Return type

float

transform ( self , X , y )

Apply automatic preprocessing to a given set of features ( X ) and labels ( y ).

Parameters
Raises

AutoMLxNotFittedError – If the pipeline is not fitted.

Returns

Transformed dataset features, transformed dataset timeseries

Return type

(pd.DataFrame or None, pd.DataFrame or pd.Series or None)

plot_forecast ( self , summary_frame , show_y = True , show_pi = True , additional_frames = None )

Plot the forecasts.

Parameters
  • summary_frame ( pd.DataFrame ) – A dataframe containing columns mean, pi_lower (optional) and pi_upper (optional)

  • show_y ( bool , default=True ) – If True, plots training series y

  • show_pi ( bool , default=True ) – if True, plots Prediction Intervals (PI) when available

  • additional_frames ( dictionary of pd.DataFrame , optional ) – Plots the dataframes to the same axes, e.g., additional_frames = dict(‘label1’=dataframe1, ‘label2’=dataframe2)

Return type

A plotly figure.

Raises

AutoMLxValueError – If summary dataframe column names are incorrect.