AutoML
The AutoMLx python package automatically creates, optimizes and explains machine learning pipelines and models. The AutoML pipeline provides a tuned ML pipeline that finds the best model for a given training dataset and a prediction task at hand. AutoML has a simple pipeline-level Python API that quickly jump-starts the datascience process with an accurate tuned model. AutoML has support for any of the following tasks:
Supervised classification or regression prediction with tabular dataset where the target can be a simple binary or a multi-class value or a real valued column in a table, respectively.
Supervised classification for Image and Text datasets.
Unsupervised anomaly detection, where the target or the labels are not provided.
Univariate and multivariate (single or multiple targets) timeseries forecasting task.
Recommendation, based on a data of interactions between users and items.
The AutoML pipeline consists of five major stages of the ML pipeline: preprocessing , algorithm selection , adaptive sampling , feature selection , and model tuning
These pieces are readily combined into a simple AutoML pipeline which automatically optimizes the whole pipeline with limited user input/interaction.
Pipeline
- Pipeline ( task = 'classification' , dataset_format = 'pandas' , score_metric = None , random_state = 7 , n_algos_tuned = 1 , model_list = None , preprocessing = True , search_space = None , max_tuning_trials = None , search_strategy = 'HyperGD' , ** kwargs )
-
Create AutoMLPipeline based on task and dataset type
- Parameters :
-
-
task ( str , default='classification' ) – Machine learning task, supported: classification, regression, anomaly_detection, forecasting, recommendation
-
dataset_format ( str , default='pandas' ) – Determine the type of input/output dataset. Defaults to pandas
-
score_metric ( str , callable , tuple , list or None , default=None ) –
One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.
-
If
None
: it will be determined automatically depending on the task. Default score metrics : classification: binary: neg_log_loss, multiclass: neg_log_loss, regression: neg_mean_squared_error, forecasting: neg_sym_mean_abs_percent_error, anomaly_detection: unsupervised_unify95, recommendation: hit_rate -
If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes.
-
If a callable: score function (or loss function) with signature
score_func(model, X, y)
. -
If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above.
-
If a string: automatically infers the scoring metric from the string: nntt**unsupervised** – unsupervised_unify95, unsupervised_unify95_log_loss
continuous_forecast – neg_sym_mean_abs_percent_error, neg_root_mean_squared_percent_error, neg_mean_abs_scaled_error, neg_root_mean_squared_error, neg_mean_squared_error, neg_max_absolute_error, neg_mean_absolute_error, neg_max_abs_error, neg_mean_abs_error
binary – neg_log_loss, roc_auc, accuracy, f1, precision, recall, f1_micro, f1_macro, f1_weighted, f1_samples, recall_micro, recall_macro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples
multiclass – neg_log_loss, accuracy, f1_micro, f1_macro, f1_weighted, f1_samples, recall_macro, recall_micro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples
- More information on scoring metrics can be found here :
-
Classification metrics , Note: Scoring variations like
recall_macro
are equivalent tosklearn.metrics.recall_score(...,average="macro")
continuous – neg_mean_squared_error, r2, neg_mean_absolute_error, neg_mean_squared_log_error, neg_median_absolute_error
- More information on scoring metrics can be found here :
recommendation – hit_rate, hits, precision, recall, map, ndcg, auc
* More information on scoring metrics can be found here in the documentation of the AutoRecommender class.
-
-
random_state ( int , default=7 ) – Random seed used by AutoML.
-
n_algos_tuned ( int , default=1 ) –
Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.
-
To disable algorithm selection set
n_algos_tuned = len(model_list)
. -
-
model_list ( List [ Model | str | Any ] or None , default=None ) –
Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for regression and classification must implement the scikit-learn-style fit and predict methods. Classification models also must support predict_proba. Anomaly detection models must follow the pyod interface. (by default, all supported built-in models for a given task are used) Supported built-in models per task:
classification – CatBoostClassifier, DecisionTreeClassifier, ExtraTreesClassifier, GaussianNB, KNeighborsClassifier, LGBMClassifier, LogisticRegression, RandomForestClassifier, SVC, TorchMLPClassifier, XGBClassifier
regression – AdaBoostRegressor, DecisionTreeRegressor, ExtraTreesRegressor, KNeighborsRegressor, LGBMRegressor, LinearRegression, LinearSVR, RandomForestRegressor, SVR, TorchMLPRegressor, XGBRegressor
anomaly_detection – ClusteringLocalFactorOD, HistogramOD, IsolationForestOD, KNearestNeighborsOD, MinCovOD, OneClassSVMOD, PrincipalCompOD, AutoEncoder
forecasting – NaiveForecaster, ThetaForecaster, ExpSmoothForecaster, ETSForecaster, STLwESForecaster, STLwARIMAForecaster, SARIMAXForecaster, VARMAXForecaster, DynFactorForecaster, ExtraTreesForecaster, XGBForecaster, LGBMForecaster
recommendation – AlsRecommender, ItemKNNRecommender, BprRecommender, TRexxRecommender
-
preprocessing ( bool , default=True ) –
Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.
-
If True, auto-preprocessor runs on dataset to normalize data.
Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using
sklearn.preprocessing.StandardScaler
. Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce aValueError
. AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings). -
-
search_space ( dict or None , default=None ) –
This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. We support 4 types of Key values:
-
- Type 1: the search space key values must have two parameters:
-
-
’range’ which is a list containing the range.
(2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for LogisticRegression:
- search_space = {
-
- ‘LogisticRegression’ {
-
- ‘C’: {
-
‘range’: [0.03125, 512], ‘type’: ‘continuous’},
- ’solver’: {
-
- ‘range’: [‘newton-cg’, ‘lbfgs’,
-
’liblinear’, ‘sag’],
’type’: ‘categorical’
}, ‘class_weight’: {
’range’: [None, ‘balanced’], ‘type’: ‘categorical’
}
}
}
-
-
- Type 2: Fixed key values where we could fix the value of hyper parameters.
-
For example, if the user wishes to fix a hyper parameter for LogisticRegression:
- search_space = {
-
- ‘LogisticRegression’ {
-
‘C’: 0.03125, ‘solver’: ‘newton-cg’
}
}
-
Type 3: If search space of a model is an empty dictionary, then Model Tune is
disabled for that model. - Type 4: If a key value contains mixed configuration, some hyper-parameters are fixed, others with a search space. For example:
- search_space = {
-
- ‘LogisticRegression’ {
-
‘C’: 0.03125, ‘solver’: ‘newton-cg’, ‘class_weight’: {
’range’: [None, ‘balanced’], ‘type’: ‘categorical’
}
}
}
-
To disable Model Tune for all models set
search_space = {}
- IfNone
, default search space defined inside AutoML is used. - If all the hyper-parameters are fixed for a model, the HyperParameterOptimization step will be skipped for that model. Otherwise, the remaining non-fixed parameters will be tuned. -
-
max_tuning_trials ( int , dict or None , default=None ) – The maximum number of HPO trials, may be exceeded slightly. - If
None
: AutoML automatically determines when enough HPO trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, ifn_algos_tuned == 2
, then up to2 * max_tuning_trials
are performed in total. - If adict
: by passing a dictionary you can specify this parameter per algorithm. e.g.,{'LogisticRegression': 100, 'RandomForestClassifier': 200}
. Missing values in the dictionary default to None. -
search_strategy ( str , default='HyperGD' ) – The search strategy used in Model Tune. Valid search_strategy values: TPESampler, BruteForceSampler, CmaEsSampler, MOTPESampler, NSGAIISampler, QMCSampler, RandomSampler, HyperGD
-
kwargs ( Any ) –
Optional arguments. You can find a list of arguments related to each task in their config method: - :py:meth:automlx.AutoClassifier.configure
for ‘classification’
-
:py:meth:automlx._interface.regressor.AutoRegressor.configure for ‘regression’
-
:py:meth:automlx._interface.anomaly_detector.AutoAnomalyDetector.configure for ‘anomaly_detection’
-
:py:meth:automlx._interface.forecaster.AutoForecaster.configure for ‘forecasting’
-
-
- Raises :
-
AutoMLxValueError – If the given task is not supported or the provided dataset format is not supported.
- Returns :
-
An AutoMLPipeline for the given task: - :py:class:automlx._interface.classifier.AutoClassifier
for ‘classification’
-
:py:class:automlx._interface.regressor.AutoRegressor for ‘regression’
-
:py:class:automlx._interface.anomaly_detector.AutoAnomalyDetector for ‘anomaly_detection’
-
:py:class:automlx._interface.forecaster.Forecaster for ‘forecasting’
-
:py:class:automlx.express.recommender.AutoRecommender for ‘recommendation’
-
- Return type :
-
AutoMLPipeline
AutoClassifier
- class AutoClassifier
-
Classifier AutoMLPipeline
- classes_
-
Holds the label for each class (for
task=classification
only, otherwise it is set toNone
).- Type :
-
List[Any]
- selected_features_names_
-
Names of the engineered features selected by the AutoML pipeline.
- Type :
-
List[ str ]
- selected_features_names_raw_
-
Names of original feature names selected by the AutoML pipeline. If preprocessing is disabled, then this corresponds to
selected_features_names_
; otherwise, a raw feature is considered selected if at least one of the features engineered from it is selected.- Type :
-
List[ str ]
- ranked_models_
-
List of model names ranked in order of their quality from the last
fit
call.- Type :
-
List[ str ]
- selected_model_params_
-
Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.
- Type :
- selected_rows_
-
List of indices in the original train dataset provided to AutoML corresponding to the rows sampled during Adaptive Sampling. In the case of CV, this attribute will result in a list of lists corresponding to indices selected in each fold. For example, in the case of no CV, this attribute looks like:
[0, 1, 5]
, indicating indices 0, 1, and 5 have been selected during adaptive sampling. In the case of CV=3, this attribute looks like:[ [0, 1], [0, 5], [1, 5] ]
, indicating indices 0,1 were selected from the first fold, 0,5 were selected in the 2nd fold, and 1,5 were selected in the 3rd fold.- Type :
- selected_valid_rows_
-
List of indices in the original validation dataset (if
CV==None
) provided to AutoML corresponding to the rows sampled during Adaptive Sampling. If CV is not None, the returned value is always None given that Adaptive Sampling does not sample the validation set when CV is enabled.- Type :
- pipelines_
-
Sorted list of pipelines (length equal to
n_algos_tuned
), with 0th element being the best model.- Type :
- completed_trials_summary_
-
All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.
- Type :
- completed_trials_detailed_
-
A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.
- Type :
- n_jobs_
-
Parallelism internally used by AutoML. Calculated as
inter_model_parallelism*intra_model_parallelism
.- Type :
- feature_importances_
-
Importance of each feature in the dataset for the selected model
- Type :
-
numpy.ndarray of shape (n_features,)
- threshold_tuning_score_
-
The validation score of the pipelines after applying threshold tuning. The scoring metric used to select this threshold can be found in threshold_tuning_scorer_ . It is None when the task is not classification or threshold_tuning is False.
- threshold_tuning_scorer_
-
The scoring metric used to select threshold during threshold tuning. It is None when the task is not classification or threshold_tuning is False.
- Type :
-
Metric
- configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , adaptive_sampling = None , min_features = None , optimization = None , preprocessing = None , search_space = None , min_class_instances = None , max_tuning_trials = None , search_strategy = None , threshold_tuning = None )
-
Configure the AutoClassifier
If an argument is set to None, then its value is not changed and the default value is used.
- Parameters :
-
-
score_metric ( str , callable , tuple , list or None , default=None ) –
One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.
-
If
None
: it will be determined automatically depending
on the task. Default score metrics : binary: neg_log_loss, multiclass: neg_log_loss - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature
score_func(model, X, y)
. - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nnttbinary – neg_log_loss, roc_auc, accuracy, f1, precision, recall, f1_micro, f1_macro, f1_weighted, f1_samples, recall_micro, recall_macro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples
multiclass – neg_log_loss, accuracy, f1_micro, f1_macro, f1_weighted, f1_samples, recall_macro, recall_micro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples
- More information on scoring metrics can be found here :
-
Classification metrics , Note: Scoring variations like
recall_macro
are equivalent tosklearn.metrics.recall_score(...,average="macro")
-
-
random_state ( int or None , default=None ) – Random seed used by AutoML. Default value (if not previously set):
7
-
n_algos_tuned ( int or None , default=None ) –
Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.
-
To disable algorithm selection set
n_algos_tuned = len(model_list)
.Default value (if not previously set):
1
-
-
model_list ( List [ str | Any ] or None , default=None ) –
Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for classification must implement the scikit-learn-style fit, predict, and predict_proba methods. (by default, all supported built-in models for a given task are used) Supported built-in models per task:
classification – CatBoostClassifier, DecisionTreeClassifier, ExtraTreesClassifier, GaussianNB, KNeighborsClassifier, LGBMClassifier, LogisticRegression, RandomForestClassifier, SVC, TorchMLPClassifier, XGBClassifier
-
adaptive_sampling ( bool or None , default=None ) – Set to False to disable class balancing and adaptive sampling done in AutoML. Disabling this might significantly increase runtime. Default value (if not previously set):
True
-
min_features ( int , float , list or None , default=None ) –
Minimum number of features to keep. Acceptable values:
-
If int, 0 < min_features <= n_features
-
If float, 0 < min_features <= 1.0
-
If list, names of features to keep, for example
['a', 'b']
means keep features ‘a’ and ‘b’ - To disable feature selection setmin_features = 1.0
Default value (if not previously set):
1
-
-
optimization ( int or None , default=None ) –
Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.
-
Level 0: Optimized for reproducibility
(controls most randomness) - Level 3: Optimized for speed and accuracy - Level 10: Optimized for speed Default value (if not previously set):
3
-
-
preprocessing ( bool or None , default=None ) –
Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.
-
If True, auto-preprocessor runs on dataset to normalize data.
Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using
sklearn.preprocessing.StandardScaler
. Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce aValueError
. AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings). Default value (if not previously set):True
-
-
search_space ( dict or None , default=None ) –
This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. We support 4 types of Key values:
-
- Type 1: the search space key values must have two parameters:
-
-
’range’ which is a list containing the range.
(2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for LogisticRegression:
- search_space = {
-
- ‘LogisticRegression’ {
-
- ‘C’: {
-
‘range’: [0.03125, 512], ‘type’: ‘continuous’},
- ’solver’: {
-
- ‘range’: [‘newton-cg’, ‘lbfgs’,
-
’liblinear’, ‘sag’],
’type’: ‘categorical’
}, ‘class_weight’: {
’range’: [None, ‘balanced’], ‘type’: ‘categorical’
}
}
}
-
-
- Type 2: Fixed key values where we could fix the value of hyper parameters.
-
For example, if the user wishes to fix a hyper parameter for LogisticRegression:
- search_space = {
-
- ‘LogisticRegression’ {
-
‘C’: 0.03125, ‘solver’: ‘newton-cg’
}
}
-
Type 3: If search space of a model is an empty dictionary, then Model Tune is
disabled for that model. - Type 4: If a key value contains mixed configuration, some hyper-parameters are fixed, others with a search space. For example:
- search_space = {
-
- ‘LogisticRegression’ {
-
‘C’: 0.03125, ‘solver’: ‘newton-cg’, ‘class_weight’: {
’range’: [None, ‘balanced’], ‘type’: ‘categorical’
}
}
}
-
To disable Model Tune for all models set
search_space = {}
- IfNone
, default search space defined inside AutoML is used. - If all the hyper-paramaters are fixed for a model, the HyperParameterOptimization step will be skipped for that model. Otherwise, the remaining non-fixed parameters will be tuned. -
-
min_class_instances ( int or None , default=None ) – The minimum number of instances all classes must have when doing classification. If any class has less than this number of instances, training is stopped. This argument may take any value of 2 or higher. Default value (if not previously set):
5
-
max_tuning_trials ( int , dict or None , default=None ) –
- The maximum number of HPO trials, may be exceeded slightly.
-
-
If
None
: AutoML automatically determines when enough HPO
trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if
n_algos_tuned == 2
, then up to2 * max_tuning_trials
are performed in total. - If adict
: by passing a dictionary you can specify this parameter per algorithm. e.g.,{'LogisticRegression': 100, 'RandomForestClassifier': 200}
. Missing values in the dictionary default to None. -
Default value (if not previously set):
None
-
search_strategy ( str or None , default=None ) – The search strategy used in Model Tune. Valid search_strategy values: TPESampler, BruteForceSampler, CmaEsSampler, MOTPESampler, NSGAIISampler, QMCSampler, RandomSampler, HyperGD Default value (if not previously set):
'HyperGD'
-
threshold_tuning ( bool or None , default=None ) –
Determine whether or not AutoML optimizes the prediction threshold. Threshold tuning is only used in classification tasks. However, unlike classic threshold tuning, AutoML uses a novel technique that increases or decreases the model’s prediction probabilities for a given class, thereby keeping the prediction probability fixed to 0.5 for binary classification and allowing the method to generalize to multi-class classification problems.
-
If True, the prediction threshold will be optimized
based on the provided score metric. Threshold tuning allows users to post-process classification model predictions to optimize for their custom metric. Threshold tuning will not be exported to onnx models, therefore the onnx model quality may be lower than the original model. - If False, threshold tuning is not applied. Default value (if not previously set):
False
-
-
- Raises :
-
AutoMLxValueError – If min_class_instances is less than 2.
- fit ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )
-
Automatically identifies the most relevant features, model and hyperparameters for a given training data (
X
) and target (y
). Final model fit is conducted on a full dataset.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target. Note that y is required for forecasting task.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
cv ( int , str or None , default='auto' ) –
Determines the cross-validation split. Possible inputs for cv are:
-
None: uses X_valid and y_valid for validation
-
’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise
-
integer: specifies the number of folds in a (Stratified)KFold ,
-
iterable: yields (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
col_types ( List [ str ] or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’. ‘datetime’, ‘timedelta’, ‘image’. For text classification, it has to be set to ‘text’. In the image classification, features with col_type of image should be a column containing images in PIL format. If notNone
, it manually specifies the type of every dataset feature. -
time_budget ( Dict [ str , float ] , float , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- predict ( self , X )
-
Predict labels for features (X).
- Parameters :
-
X ( pandas.DataFrame ) – Prediction dataset features
- Raises :
-
-
AutoMLxNotFittedError – If the pipeline is not fitted yet
-
AutoMLxRuntimeError – If there is no predictions after calling the selected model over the given dataset
-
- Returns :
-
y_pred – The predicted values.
- Return type :
-
numpy.ndarray of shape (n_samples,)
- predict_proba ( self , X )
-
Probability estimates.
- Parameters :
-
X ( pandas.DataFrame ) – Prediction dataset features
- Raises :
-
AutoMLxRuntimeError – If there is no predictions after calling the model over the given dataset
- Returns :
-
y_pred – The predicted probabilities.
- Return type :
-
numpy.ndarray of shape = (n_samples, n_classes)
- score ( self , X , y )
-
Score of this pipeline for a given set of features (
X
) and labels (y
). If inferred_score_metric has multiple score metrics, the first score metric will be calculated.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target
-
- Raises :
-
AutoMLxNotFittedError – If the pipeline is not fitted yet
- Returns :
-
score – Score of
self.predict(X)
with respect toy
. - Return type :
- transform ( self , X , y = None )
-
Apply automatic preprocessing to a given set of features (
X
) and labels (y
).- Parameters :
-
-
X ( pandas.DataFrame ) – Dataset features
-
y ( pandas.DataFrame , pandas.Series or None , default=None ) – Dataset target
-
- Returns :
-
-
X ( pandas.DataFrame ) – Transformed dataset features
-
y ( pandas.DataFrame, pandas.Series or None ) – Transformed dataset target
-
- Raises :
-
AutoMLxNotFittedError – The pipeline is not Fitted
- refit ( self , X , y , X_valid = None , y_valid = None )
-
Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used.
fit
must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- train ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )
-
Automatically identifies the most relevant model and hyperparameters for this given set of features (
X
) and target (y
). Does not conduct final model fit. If the latter is desired, usefit
.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
cv ( int , str or None , default='auto' ) –
Determines the cross-validation split. Possible inputs for cv are:
-
None: uses X_valid and y_valid for validation
-
’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise
-
integer: specifies the number of folds in a (Stratified)KFold ,
-
iterable: yields (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
col_types ( list of strings or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’. ‘datetime’, ‘timedelta’, ‘image’. For text classification, it has to be set to ‘text’. In the image classification, features with col_type of image should be a column containing images in PIL format. If notNone
, it manually specifies the type of every dataset feature. -
time_budget ( Dict [ str , float ] , float , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
AutoRegressor
- class AutoRegressor
-
Regressor AutoMLPipeline
- selected_features_names_
-
Names of the engineered features selected by the AutoML pipeline.
- Type :
-
List[ str ]
- selected_features_names_raw_
-
Names of original feature names selected by the AutoML pipeline. If preprocessing is disabled, then this corresponds to
selected_features_names_
; otherwise, a raw feature is considered selected if at least one of the features engineered from it is selected.- Type :
-
List[ str ]
- ranked_models_
-
List of model names ranked in order of their quality from the last
fit
call.- Type :
-
List[ str ]
- selected_model_params_
-
Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.
- Type :
- selected_rows_
-
List of indices in the original train dataset provided to AutoML corresponding to the rows sampled during Adaptive Sampling. In the case of CV, this attribute will result in a list of lists corresponding to indices selected in each fold. For example, in the case of no CV, this attribute looks like:
[0, 1, 5]
, indicating indices 0, 1, and 5 have been selected during adaptive sampling. In the case of CV=3, this attribute looks like:[ [0, 1], [0, 5], [1, 5] ]
, indicating indices 0,1 were selected from the first fold, 0,5 were selected in the 2nd fold, and 1,5 were selected in the 3rd fold.- Type :
- selected_valid_rows_
-
List of indices in the original validation dataset (if
CV==None
) provided to AutoML corresponding to the rows sampled during Adaptive Sampling. If CV is not None, the returned value is always None given that Adaptive Sampling does not sample the validation set when CV is enabled.- Type :
- pipelines_
-
Sorted list of pipelines (length equal to
n_algos_tuned
), with 0th element being the best model.- Type :
- completed_trials_summary_
-
All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.
- Type :
- completed_trials_detailed_
-
A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.
- Type :
- n_jobs_
-
Parallelism internally used by AutoML. Calculated as
inter_model_parallelism*intra_model_parallelism
.- Type :
- feature_importances_
-
Importance of each feature in the dataset for the selected model
- Type :
-
numpy.ndarray of shape (n_features,)
- configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , adaptive_sampling = None , min_features = None , optimization = None , preprocessing = None , search_space = None , max_tuning_trials = None , search_strategy = None )
-
Configure the AutoRegressor
If an argument is set to None, then its value is not changed and the default value is used.
- Parameters :
-
-
score_metric ( str , callable , tuple , list or None , default=None ) –
One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.
-
If
None
: it will be determined automatically depending
on the task. Default score metrics : neg_mean_squared_error - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature
score_func(model, X, y)
. - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nnttcontinuous – neg_mean_squared_error, r2, neg_mean_absolute_error, neg_mean_squared_log_error, neg_median_absolute_error
- More information on scoring metrics can be found here :
-
-
random_state ( int or None , default=None ) – Random seed used by AutoML. Default value (if not previously set):
7
-
n_algos_tuned ( int or None , default=None ) –
Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.
-
To disable algorithm selection set
n_algos_tuned = len(model_list)
.Default value (if not previously set):
1
-
-
model_list ( List [ str | Any ] or None , default=None ) –
Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for regression must implement the scikit-learn-style fit and predict methods. (by default, all supported built-in models for a given task are used) Supported built-in models per task:
regression – AdaBoostRegressor, DecisionTreeRegressor, ExtraTreesRegressor, KNeighborsRegressor, LGBMRegressor, LinearRegression, LinearSVR, RandomForestRegressor, SVR, TorchMLPRegressor, XGBRegressor
-
adaptive_sampling ( bool or None , default=None ) – Set to False to disable class balancing and adaptive sampling done in AutoML. Disabling this might significantly increase runtime. Default value (if not previously set):
True
-
min_features ( int , float , list or None , default=None ) –
Minimum number of features to keep. Acceptable values:
-
If int, 0 < min_features <= n_features
-
If float, 0 < min_features <= 1.0
-
If list, names of features to keep, for example
['a', 'b']
means keep features ‘a’ and ‘b’ - To disable feature selection setmin_features = 1.0
Default value (if not previously set):
1
-
-
optimization ( int or None , default=None ) –
Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.
-
Level 0: Optimized for reproducibility
(controls most randomness) - Level 3: Optimized for speed and accuracy - Level 10: Optimized for speed Defaults to
3
-
-
preprocessing ( bool or None , default=None ) –
Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.
-
If True, auto-preprocessor runs on dataset to normalize data.
Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using
sklearn.preprocessing.StandardScaler
. Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce aValueError
. AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings). Default value (if not previously set):True
-
-
search_space ( dict or None , default=None ) –
This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. We support 4 types of Key values:
-
- Type 1: the search space key values must have two parameters:
-
-
’range’ which is a list containing the range.
(2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for AdaBoostRegressor:
- search_space = {
-
- ‘AdaBoostRegressor’ {
-
- ‘learning_rate’: {
-
‘range’: [0.05, 1], ‘type’: ‘continuous’
}, ‘n_estimators’: {
’range’: [10, 50], ‘type’: ‘discrete’
},
}
}
-
-
- Type 2: Fixed key values where we could fix the value of hyper parameters.
-
For example, if the user wishes to fix a hyper parameter for AdaBoostRegressor:
- search_space = {
-
- ‘AdaBoostRegressor’ {
-
‘learning_rate’: 0.984, ‘n_estimators’: 30
}
}
-
Type 3: If search space of a model is an empty dictionary, then Model Tune is
disabled for that model. - Type 4: If a key value contains mixed configuration, some hyper-parameters are fixed, others with a search space. For example:
- search_space = {
-
- ‘AdaBoostRegressor’ {
-
‘learning_rate’: 0.984, ‘n_estimators’: {
’range’: [10, 50], ‘type’: ‘discrete’
},
}
}
-
To disable Model Tune for all models set
search_space = {}
- IfNone
, default search space defined inside AutoML is used. - If all the hyper-paramaters are fixed for a model, the HyperParameterOptimization step will be skipped for that model. Otherwise, the remaining non-fixed parameters will be tuned. -
-
max_tuning_trials ( int , dict or None , default=None ) –
- The maximum number of HPO trials, may be exceeded slightly.
-
-
If
None
: AutoML automatically determines when enough HPO
trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if
n_algos_tuned == 2
, then up to2 * max_tuning_trials
are performed in total. - If adict
: by passing a dictionary you can specify this parameter per algorithm. e.g.,{'LogisticRegression': 100, 'RandomForestClassifier': 200}
. Missing values in the dictionary default to None. -
Default value (if not previously set):
None
-
search_strategy ( str or None , default=None ) – The search strategy used in Model Tune. Valid search_strategy values: TPESampler, BruteForceSampler, CmaEsSampler, MOTPESampler, NSGAIISampler, QMCSampler, RandomSampler, HyperGD Default value (if not previously set):
'HyperGD'
-
- class fit ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )
-
Automatically identifies the most relevant features, model and hyperparameters for a given training data (
X
) and target (y
). Final model fit is conducted on a full dataset.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target. Note that y is required for forecasting task.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
cv ( int , str or None , default='auto' ) –
Determines the cross-validation split. Possible inputs for cv are:
-
None: uses X_valid and y_valid for validation
-
’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise
-
integer: specifies the number of folds in a (Stratified)KFold ,
-
iterable: yields (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
col_types ( List [ str ] or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’ If notNone
, it manually specifies the type of every dataset feature. -
time_budget ( Dict [ str , float ] , float or None , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- class predict ( self , X )
-
Predict labels for features (X).
- Parameters :
-
X ( pandas.DataFrame ) – Prediction dataset features
- Raises :
-
-
AutoMLxNotFittedError – If the pipeline is not fitted yet
-
AutoMLxRuntimeError – If there is no predictions after calling the selected model over the given dataset
-
- Returns :
-
y_pred – The predicted values.
- Return type :
-
numpy.ndarray of shape (n_samples,)
- class score ( self , X , y )
-
Score of this pipeline for a given set of features (
X
) and labels (y
). If inferred_score_metric has multiple score metrics, the first score metric will be calculated.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target
-
- Raises :
-
AutoMLxNotFittedError – If the pipeline is not fitted yet
- Returns :
-
score – Score of
self.predict(X)
with respect toy
. - Return type :
- class transform ( self , X , y = None )
-
Apply automatic preprocessing to a given set of features (
X
) and labels (y
).- Parameters :
-
-
X ( pandas.DataFrame ) – Dataset features
-
y ( pandas.DataFrame , pandas.Series or None , default=None ) – Dataset target
-
- Returns :
-
-
X ( pandas.DataFrame ) – Transformed dataset features
-
y ( pandas.DataFrame, pandas.Series or None ) – Transformed dataset target
-
- Raises :
-
AutoMLxNotFittedError – The pipeline is not Fitted
- class refit ( self , X , y , X_valid = None , y_valid = None )
-
Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used.
fit
must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- class train ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )
-
Automatically identifies the most relevant model and hyperparameters for this given set of features (
X
) and target (y
). Does not conduct final model fit. If the latter is desired, usefit
.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
cv ( int , str or None , default='auto' ) –
Determines the cross-validation split. Possible inputs for cv are:
-
None: uses X_valid and y_valid for validation
-
’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise
-
integer: specifies the number of folds in a (Stratified)KFold ,
-
iterable: yields (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
col_types ( list of strings or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’ If notNone
, it manually specifies the type of every dataset feature. -
time_budget ( Dict [ str , float ] , float or None , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
AutoAnomalyDetector
- class AutoAnomalyDetector
-
Anomaly Detection AutoMLPipeline
- classes_
-
Holds the label for each class (for
task=classification
only, otherwise it is set toNone
).- Type :
-
List[Any]
- selected_features_names_
-
Names of the engineered features selected by the AutoML pipeline.
- Type :
-
List[ str ]
- selected_features_names_raw_
-
Names of original feature names selected by the AutoML pipeline. If preprocessing is disabled, then this corresponds to
selected_features_names_
; otherwise, a raw feature is considered selected if at least one of the features engineered from it is selected.- Type :
-
List[ str ]
- ranked_models_
-
List of model names ranked in order of their quality from the last
fit
call.- Type :
-
List[ str ]
- selected_model_params_
-
Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.
- Type :
- selected_rows_
-
List of indices in the original train dataset provided to AutoML corresponding to the rows sampled during Adaptive Sampling. In the case of CV, this attribute will result in a list of lists corresponding to indices selected in each fold. For example, in the case of no CV, this attribute looks like:
[0, 1, 5]
, indicating indices 0, 1, and 5 have been selected during adaptive sampling. In the case of CV=3, this attribute looks like:[ [0, 1], [0, 5], [1, 5] ]
, indicating indices 0,1 were selected from the first fold, 0,5 were selected in the 2nd fold, and 1,5 were selected in the 3rd fold.- Type :
- selected_valid_rows_
-
List of indices in the original validation dataset (if
CV==None
) provided to AutoML corresponding to the rows sampled during Adaptive Sampling. If CV is not None, the returned value is always None given that Adaptive Sampling does not sample the validation set when CV is enabled.- Type :
- pipelines_
-
Sorted list of pipelines (length equal to
n_algos_tuned
), with 0th element being the best model.- Type :
- completed_trials_summary_
-
All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.
- Type :
- completed_trials_detailed_
-
A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.
- Type :
- n_jobs_
-
Parallelism internally used by AutoML. Calculated as
inter_model_parallelism*intra_model_parallelism
.- Type :
- feature_importances_
-
Importance of each feature in the dataset for the selected model
- Type :
-
numpy.ndarray of shape (n_features,)
- configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , optimization = None , preprocessing = None , search_space = None , max_tuning_trials = None , search_strategy = None )
-
Configure the AutoAnomalyDetector
If an argument is set to None, then its value is not changed and the default value is used.
- Parameters :
-
-
score_metric ( str , callable , tuple , list or None , default=None ) –
One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.
-
If
None
: it will be determined automatically depending
on the task. Default score metrics : unsupervised_unify95 - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature
score_func(model, X, y)
. - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nntt**unsupervised** – unsupervised_unify95, unsupervised_unify95_log_loss -
-
random_state ( int or None , default=None ) – Random seed used by AutoML. Default value (if not previously set):
7
-
n_algos_tuned ( int or None , default=None ) –
Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.
-
To disable algorithm selection set
n_algos_tuned = len(model_list)
.Default value (if not previously set):
1
-
-
model_list ( List [ str | Any ] or None , default=None ) –
Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for anomaly detection must follow the pyod interface. (by default, all supported built-in models for a given task are used) Supported built-in models per task:
anomaly_detection – ClusteringLocalFactorOD, HistogramOD, IsolationForestOD, KNearestNeighborsOD, MinCovOD, OneClassSVMOD, PrincipalCompOD, AutoEncoder
-
optimization ( int or None , default=None ) –
Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.
-
Level 0: Optimized for reproducibility
(controls most randomness) - Level 3: Optimized for speed and accuracy - Level 10: Optimized for speed Default value (if not previously set):
3
-
-
preprocessing ( bool or None , default=None ) –
Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.
-
If True, auto-preprocessor runs on dataset to normalize data.
Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using
sklearn.preprocessing.StandardScaler
. Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce aValueError
. AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings). Default value (if not previously set):True
-
-
search_space ( dict or None , default=None ) –
This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. We support 4 types of Key values:
-
- Type 1: the search space key values must have two parameters:
-
-
’range’ which is a list containing the range.
(2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for IsolationForestOD:
- search_space = {
-
- ‘IsolationForestOD’ {
-
- ‘n_estimators’: {
-
‘range’: [10, 50], ‘type’: ‘discrete’
}, ‘max_features’: {
’range’: [0.5, 0.7], ‘type’: ‘continuous’
}, ‘max_samples’: {
’range’: [5, 10], ‘type’: ‘discrete’
}
}
}
-
-
- Type 2: Fixed key values where we could fix the value of hyper parameters.
-
For example, if the user wishes to fix a hyper parameter for IsolationForestOD:
- search_space = {
-
- ‘IsolationForestOD’ {
-
‘n_estimators’: 10, ‘max_features’: 0.5, ‘max_samples’: 10
}
}
-
Type 3: If search space of a model is an empty dictionary, then Model Tune is
disabled for that model. - Type 4: If a key value contains mixed configuration, some hyper-parameters are fixed, others with a search space. For example:
- search_space = {
-
- ‘IsolationForestOD’ {
-
- ‘n_estimators’: {
-
‘range’: [10, 50], ‘type’: ‘discrete’
}, ‘max_features’: 0.5, ‘max_samples’: 10
}
}
-
To disable Model Tune for all models set
search_space = {}
- IfNone
, default search space defined inside AutoML is used. - If all the hyper-paramaters are fixed for a model, the HyperParameterOptimization step will be skipped for that model. Otherwise, the remaining non-fixed parameters will be tuned. -
-
max_tuning_trials ( int , dict or None , default=None ) –
- The maximum number of HPO trials, may be exceeded slightly.
-
-
If
None
: AutoML automatically determines when enough HPO
trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if
n_algos_tuned == 2
, then up to2 * max_tuning_trials
are performed in total. - If adict
: by passing a dictionary you can specify this parameter per algorithm. e.g.,{'LogisticRegression': 100, 'RandomForestClassifier': 200}
. Missing values in the dictionary default to None. -
Default value (if not previously set):
None
-
search_strategy ( str or None , default=None ) – The search strategy used in Model Tune. Valid search_strategy values: TPESampler, BruteForceSampler, CmaEsSampler, MOTPESampler, NSGAIISampler, QMCSampler, RandomSampler, HyperGD Default value (if not previously set):
'HyperGD'
-
- fit ( self , X , X_valid = None , y_valid = None , col_types = None , time_budget = - 1 , contamination = None )
-
Automatically identifies the most relevant features, model and hyperparameters for a given training data (
X
). Final model fit is conducted on a full dataset.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
col_types ( List [ str ] or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’ -
time_budget ( Dict [ str , float ] , float , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
contamination ( float or None , default=None ) – Fraction of training dataset corresponding to anomalies (between 0.0 and 0.5). Should only be set for supervised anomaly detection (y_valid is required). Should be set to None for unsupervised anomaly detection (when using the unsupervised metrics).
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- predict ( self , X )
-
Predict labels for features (X).
- Parameters :
-
X ( pandas.DataFrame ) – Prediction dataset features
- Raises :
-
-
AutoMLxNotFittedError – If the pipeline is not fitted yet
-
AutoMLxRuntimeError – If there is no predictions after calling the selected model over the given dataset
-
- Returns :
-
y_pred – The predicted values.
- Return type :
-
numpy.ndarray of shape (n_samples,)
- predict_proba ( self , X )
-
Probability estimates.
- Parameters :
-
X ( pandas.DataFrame ) – Prediction dataset features
- Raises :
-
-
AutoMLxNotFittedError – If the pipeline is not fitted yet.
-
AutoMLxRuntimeError – If there is no predictions after calling the model over the given dataset.
-
- Returns :
-
y_pred – The predicted probabilities.
- Return type :
-
numpy.ndarray of shape = (n_samples, n_classes)
- score ( self , X , y )
-
Score of this pipeline for a given set of features (
X
) and labels (y
). If inferred_score_metric has multiple score metrics, the first score metric will be calculated.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target
-
- Raises :
-
AutoMLxNotFittedError – If the pipeline is not fitted yet
- Returns :
-
score – Score of
self.predict(X)
with respect toy
. - Return type :
- transform ( self , X , y = None )
-
Apply automatic preprocessing to a given set of features (
X
) and labels (y
).- Parameters :
-
-
X ( pandas.DataFrame ) – Dataset features
-
y ( pandas.DataFrame , pandas.Series or None , default=None ) – Dataset target
-
- Returns :
-
-
X ( pandas.DataFrame ) – Transformed dataset features
-
y ( pandas.DataFrame, pandas.Series or None ) – Transformed dataset target
-
- Raises :
-
AutoMLxNotFittedError – The pipeline is not Fitted
- refit ( self , X , X_valid = None , y_valid = None )
-
Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used.
fit
must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- train ( self , X , X_valid = None , y_valid = None , col_types = None , time_budget = - 1 , contamination = None )
-
Automatically identifies the most relevant model and hyperparameters for this given set of features (
X
) and target (y
). Does not conduct final model fit. If the latter is desired, usefit
.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
col_types ( list of strings or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, ‘timedelta’ If notNone
, it manually specifies the type of every dataset feature. -
time_budget ( Dict [ str , float ] , float or None , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
contamination ( float or None , default=None ) – Fraction of training dataset corresponding to anomalies (between 0.0 and 0.5). Should only be set for supervised anomaly detection (y_valid is required). Should be set to None for unsupervised anomaly detection (when using the unsupervised metrics).
-
- Raises :
-
AutoMLxValueError – If contamination has been provided for unsupervised AD
- Returns :
-
self
- Return type :
-
AutoMLPipeline
AutoForecaster
- class AutoForecaster
-
Forecasting AutoMLPipeline
- ranked_models_
-
List of model names ranked in order of their quality from the last
fit
call.- Type :
-
List[ str ]
- selected_model_params_
-
Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.
- Type :
- pipelines_
-
Sorted list of pipelines (length equal to
n_algos_tuned
), with 0th element being the best model.- Type :
- completed_trials_summary_
-
All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.
- Type :
- completed_trials_detailed_
-
A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.
- Type :
- configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , optimization = None , preprocessing = None , search_space = None , max_tuning_trials = None , search_strategy = None , time_series_period = None )
-
Configure the AutoForecaster
If an argument is set to None, then its value is not changed and the default value is used.
- Parameters :
-
-
score_metric ( str , callable , tuple , list or None , default=None ) –
One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.
-
If
None
: it will be determined automatically depending
on the task. Default score metrics : neg_sym_mean_abs_percent_error - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature
score_func(model, X, y)
. - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nnttcontinuous_forecast – neg_sym_mean_abs_percent_error, neg_root_mean_squared_percent_error, neg_mean_abs_scaled_error, neg_root_mean_squared_error, neg_mean_squared_error, neg_max_absolute_error, neg_mean_absolute_error, neg_max_abs_error, neg_mean_abs_error
-
-
random_state ( int , or None , default=None ) – Random seed used by AutoML. Suggested default:
7
-
n_algos_tuned ( int , or None , default=None ) –
Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.
-
To disable algorithm selection set
n_algos_tuned = len(model_list)
.Suggested default:
1
-
-
model_list ( List [ str ] , or None , default=None ) –
Models that will be evaluated by the Pipeline. Users can specify built-in models by name (by default, all supported built-in models for a given task are used).
-
All models except VARMAX and DynFactor models are applicable
when doing there is a single timeseries in y. - If you have multiple timeseries in y that you want to predict as a system, then multi-target forecasting VARMAX and DynFactor may be utilized. - When you have features or exogenous regressors that you known in advance for your forecast period, pass them into X.
Supported built-in models per task:
forecasting – NaiveForecaster, ThetaForecaster, ExpSmoothForecaster, ETSForecaster, STLwESForecaster, STLwARIMAForecaster, SARIMAXForecaster, VARMAXForecaster, DynFactorForecaster, ExtraTreesForecaster, XGBForecaster, LGBMForecaster
-
-
optimization ( int , or None , default=None ) –
Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.
-
Level 0: Optimized for reproducibility (controls most randomness)
-
Level 3: Optimized for speed and accuracy
-
Level 10: Optimized for speed
Suggested default:
3
-
-
preprocessing ( bool , or None , default=None ) – Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users. Most of the preprocessing can not be turned off for the forecasting task. Suggested default:
True
-
search_space ( dict , or None , default=None ) –
This parameter defines the search space for model tuning. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. We support 4 types of Key values:
-
- Type 1: the search space key values must have two parameters:
-
-
’range’ which is a list containing the range.
(2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for ETSForecaster:
- search_space = {
-
- ‘ETSForecaster’ {
-
‘error’: { ‘range’: [‘add’, ‘mul’], ‘type’: ‘categorical’
}, ‘damped_trend’: {
’range’: [True, False], ‘type’: ‘categorical’ }
}
}
-
-
- Type 2: Fixed key values where we could fix the value of hyper parameters.
-
For example, if the user wishes to fix a hyper parameter for ETSForecaster:
- search_space = {
-
- ‘ETSForecaster’ {
-
‘error’: “add”, ‘damped_trend’: True
}
}
-
Type 3: If search space of a model is an empty dictionary, then Model Tune is
disabled for that model. - Type 4: If a key value contains mixed configuration, some hyper-parameters are fixed, others with a search space. For example:
- search_space = {
-
- ‘ETSForecaster’ {
-
‘error’: ‘add’, ‘damped_trend’: { ‘range’: [True, False], ‘type’: ‘categorical’ }
}
}
-
To disable Model Tune for all models set
search_space = {}
- IfNone
, default search space defined inside AutoML is used. - If all the hyper-paramaters are fixed for a model, the HyperParameterOptimization step will be skipped for that model. Otherwise, the remaining non-fixed parameters will be tuned. -
-
max_tuning_trials ( int , dict or None , default=None ) –
- The maximum number of HPO trials, may be exceeded slightly.
-
-
If
None
: AutoML automatically determines when enough HPO
trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if
n_algos_tuned == 2
, then up to2 * max_tuning_trials
are performed in total. - If adict
: by passing a dictionary you can specify this parameter per algorithm. e.g.,{'LogisticRegression': 100, 'RandomForestClassifier': 200}
. Missing values in the dictionary default to None. -
Default value (if not previously set):
None
-
search_strategy ( str ) – The search strategy used in model tuning. Valid search_strategy values: TPESampler, BruteForceSampler, CmaEsSampler, MOTPESampler, NSGAIISampler, QMCSampler, RandomSampler, HyperGD Suggested default:
'HyperGD'
-
time_series_period ( int or None , default=None ) – The seasonality period to force-fit the time series at regardless of whether it is detected in the data. If None, AutoML guesses the seasonability by inspecting the training data. However, users can use this to set it manually instead.
-
- fit ( self , y , X = None , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )
-
Automatically identifies the most relevant features, model and hyperparameters for a given training data (
X
) and target (y
). Final model fit is conducted on a full dataset.- Parameters :
-
-
y ( pandas.DataFrame ) – Training dataset target.
-
X ( pandas.DataFrame or None , default=None ) – A dataframe of explanatory variables that support the target timeseries in y. These must be known in advance for the foreast period and the training period.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame or None , default=None ) – Validation dataset target
-
cv ( int , str or None , default='auto' ) –
Determines the cross-validation split. Possible inputs for cv are:
-
None: uses X_valid and y_valid for validation
-
’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise
-
integer: specifies the number of folds in a (Stratified)KFold ,
-
iterable: yields (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
col_types ( List [ str ] or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’ -
time_budget ( Dict [ str , float ] , float , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- predict ( self , X )
-
Predict the target for the time steps in
X
. For a simpler API to predict the target only for time steps in the future, useforecast
.- Parameters :
-
X ( pandas.DataFrame ) – A dataframe of explanatory variables that support the target timeseries in
y
. Predictions will be given for the time steps inX
. - Raises :
-
-
AutoMLxNotFittedError – If the pipeline is not fitted yet
-
AutoMLxRuntimeError – If there is no predictions after calling the model over the given dataset
-
AutoMLxRuntimeError – If result of time series numerical inverse transform is
None
-
- Returns :
-
y_pred – A data frame containing the predicted values.
- Return type :
- forecast ( self , periods , alpha = 0.05 , X = None )
-
Forecast future values of the target.
- Parameters :
-
-
periods ( int ) – The number of time steps to forecast from the end of the sample.
-
alpha ( float , default=0.05 ) – A significance level. To receive a prediction interval of 95% alpha must be set to 0.05.
-
X ( pandas.DataFrame , or None , default=None ) – A dataframe of explanatory variables that support the forecast for
periods
number of timestamps. The index should begin immediately after the last index iny
(as provided tofit
). The columns must match the ones used infit
.
-
- Returns :
-
A dataframe with three columns listing prediction, ci_lower and ci_upper for the given confidence interval (CI) provided by level of alpha. Note: CI columns are excluded for models that don’t support intervals.
- Return type :
-
pandas.Dataframe
- Raises :
-
-
AutoMLxNotFittedError – If the pipeline is not fitted yet.
-
AutoMLxValueError – If explanatory variables are not provided, complete, or length of explanatory variables not equal to requested periods.
-
- plot_forecast ( self , predictions , show_y = True , show_pi = True , additional_frames = None )
-
Plot the forecasts.
- Parameters :
-
-
predictions ( pd.DataFrame ) – A dataframe containing columns mean, pi_lower (optional) and pi_upper (optional)
-
show_y ( bool , default=True ) – If True, plots training series y
-
show_pi ( bool , default=True ) – if True, plots Prediction Intervals (PI) when available
-
additional_frames ( dictionary of pd.DataFrame , optional ) – Plots the dataframes to the same axes, e.g., additional_frames = dict(‘label1’=dataframe1, ‘label2’=dataframe2)
-
- Return type :
-
A plotly figure.
- Raises :
-
AutoMLxValueError – If predictions column names are incorrect.
- score ( self , X , y )
-
Score of this pipeline for a given set of features (
X
) and labels (y
). If inferred_score_metric has multiple score metrics, the first score metric will be calculated.- Parameters :
-
-
X ( pd.DataFrame ) – Training dataset features
-
y ( pd.DataFrame , pd.Series ) – Training dataset target
-
- Raises :
-
AutoMLxNotFittedError – If the pipeline is not fitted yet
- Returns :
-
score – Score of
self.predict(X)
with respect toy
. - Return type :
- transform ( self , X , y )
-
Apply automatic preprocessing to a given set of features (
X
) and labels (y
).- Parameters :
-
-
X ( pandas.DataFrame or None ) – Dataset features
-
y ( pandas.DataFrame , pandas.Series or None ) – Dataset timeseries
-
- Raises :
-
AutoMLxNotFittedError – If the pipeline is not fitted.
- Returns :
-
Transformed dataset features, transformed dataset timeseries
- Return type :
-
(pd.DataFrame or None, pd.DataFrame or pd.Series or None)
- refit ( self , X , y , X_valid = None , y_valid = None )
-
Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used.
fit
must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
- train ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )
-
Automatically identifies the most relevant model and hyperparameters for this given set of features (
X
) and target (y
). Does not conduct final model fit. If the latter is desired, usefit
.- Parameters :
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target.
-
X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target
-
cv ( int , str or None , default='auto' ) –
Determines the cross-validation split. Possible inputs for cv are:
-
None: uses X_valid and y_valid for validation
-
’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise
-
integer: specifies the number of folds in a (Stratified)KFold ,
-
iterable: yields (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
col_types ( list of strings or None , default=None ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’ If notNone
, it manually specifies the type of every dataset feature. -
time_budget ( Dict [ str , float ] , float or None , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
- Returns :
-
self
- Return type :
-
AutoMLPipeline
AutoRecommender
- class AutoRecommender
-
Recommender System AutoMLPipeline
- ranked_models_
-
List of model names ranked in order of their quality from the last
fit
call.- Type :
-
List[ str ]
- selected_model_params_
-
Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.
- Type :
- pipelines_
-
Sorted list of pipelines (length equal to
n_algos_tuned
), with 0th element being the best pipeline.- Type :
- completed_trials_summary_
-
All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.
- Type :
- completed_trials_detailed_
-
A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.
- Type :
- configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , optimization = None , preprocessing = None , search_space = None , max_tuning_trials = None , search_strategy = None )
-
Configure the AutoRecommender
If an argument is set to None, then its value is not changed and the default value is used.
- Parameters :
-
score_metric –
One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.
-
If
None
: it will be determined automatically depending
on the task. Default value score metrics (if not previously set) : recommendation: hit_rate - If a string: automatically infers the scoring metric from the string: nn
- Availiable score metrics:
-
- hits:
-
Computes the number of relevant recommendations done at K ,i.e., the top K recommendations done by the model that matched with an actual interaction of a user.
- hit_rate:
-
Computes Hit Rate At K as the sum of the users to which at least a relevant item was correctly recommended, divided by the number of total users.
- precision:
-
Computes Precision At K, a measure of how many of the top K recommended items are in the set of true relevant items for all users, without taking into account the order for the computation of the metric.
\[precision@K={\frac{1}{U}}{\sum_{i=1}^{U}{\sum_{j=1}^{K}{\frac{rel_i(r_j)}{K}}}}\] - recall:
-
Computes Recall At K, a measure of the fraction of the relevant items recommended between the top K out of all relevan items, without taking into account the order for the computation of the metric.
\[recall@K={\frac{1}{U}}{\sum_{i=1}^{U}{\sum_{j=1}^{K}{\frac{rel_i(r_j)}{Q_i}}}}\] - map:
-
Computes Mean Average Precision At K as the sum of every average precision of every user, divided by the number of users. The MAP is meant to calculate average precision for the relevant items in the test set, so it is normalized by the cutoff K or size of interactions for users with less than K interactions in the test set. The mean AP can be defined as the sum for every k of the precision at k, for every value 1 <= k <= K, multiplied by the delta recall.
\[MAP@K={\frac{1}{U}} {\sum_{i=1}^{U} {\frac{AP@K(i)}{\min(Q_i, K)}}}\]
-
- random_state int or None, default=None
-
Random seed used by AutoML. Default value (if not previously set):
7
- n_algos_tuned int or None, default=None
-
Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.
-
To disable algorithm selection set
n_algos_tuned = len(model_list)
.Default value (if not previously set):
1
-
- model_list List[str] or None, default=None
-
Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Supported built-in models per task:
recommendation – AlsRecommender, ItemKNNRecommender, BprRecommender, TRexxRecommender
- Avaliable models:
-
-
-
"AlsRecommender"
: -
Alternated Least Square (ALS) is a recommendation algorithm using Collaborative Filtering from Matrix Factorizations. reference:
-
-
-
BprRecommender
: -
Bayesian Personalized Ranking (BPR) computes users’ items rankings for using a maximum posterior estimator. reference:
-
-
-
ItemKNNRecommender
: -
ItemKNN is a model that internally computes an item-item similarity matrix based on observed co-interactions from users. To produce recommendations, it uses the user interaction history and combine the item vectors of each interacted items to find similar items. If there were originally weights put on interactions, we scale them with the item reciprocal ranks. reference: https://dl.acm.org/doi/10.1145/963770.963776
-
-
-
TRexxRecommender
: -
T-Rexx is a deep learning model that provides sequence-aware recommendations. It extracts users’ preferences using a Multi-head self attention mechanism. Learned users and items embeddings are combined into predictions via a sampled softmax. It is an hybrid between the SDM model and the SASRec model. reference:
-
-
- optimization int or None, default=None
-
Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.
-
Level 0: Optimized for reproducibility
(controls most randomness) - Level 3: Optimized for speed and accuracy - Level 10: Optimized for speed Default value (if not previously set):
3
-
- preprocessing bool or None, default=None
-
Not supported for AutoRecommender. Has no effect on this class.
- search_space dict or None, default=None
-
This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. We support 4 types of Key values:
-
- Type 1: the search space key values must have two parameters:
-
-
‘range’ which is a list containing the range.
(2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for TRexxRecommender:
- search_space = {
-
- ‘TRexxRecommender’ {
-
- ‘dropout_rate’: {
-
‘range’: [0.03125, 0.5], ‘type’: ‘continuous’},
- ‘optimizer_name’: {
-
‘range’: [‘lazyadam’, ‘adam’] ‘type’: ‘categorical’
}, ‘dnn_activation’: {
‘range’: [‘tanh’, ‘relu’], ‘type’: ‘categorical’
}
}
}
-
-
- Type 2: Fixed key values where we could fix the value of hyper parameters.
-
For example, if the user wishes to fix a hyper parameter for TRexxRecommender:
- search_space = {
-
- ‘TRexxRecommender’ {
-
‘dropout_rate’: 0.5, ‘optimizer_name’: ‘adam’
}
}
-
Type 3: If search space of a model is an empty dictionary, then Model Tune is
disabled for that model. - Type 4: If a key value contains mixed configuration, some hyper-parameters are fixed, others with a search space. For example:
- search_space = {
-
- ‘TRexxRecommender’ {
-
‘dropout_rate’: 0.5, ‘optimizer_name’: ‘adam’, ‘dnn_activation’: {
‘range’: [‘tanh’, ‘relu’], ‘type’: ‘categorical’
}
}
}
-
To disable Model Tune for all models set
search_space = {}
- IfNone
, default search space defined inside AutoML is used. - If all the hyper-paramaters are fixed for a model, the tuning step will be skipped for that model. Otherwise, the remaining non-fixed parameters will be tuned. -
- max_tuning_trials int, dict or None, default=None
-
The maximum number of HPO trials, may be exceeded slightly. - If
None
: AutoML automatically determines when enough HPO trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, ifn_algos_tuned == 2
, then up to2 * max_tuning_trials
are performed in total. - If adict
: by passing a dictionary you can specify this parameter per algorithm. e.g.,{'AlsRecommender': 100, 'ItemKNNRecommender': 200}
. Missing values in the dictionary default to None. Default value (if not previously set):None
- search_strategy str or None, default=None
-
The search strategy used in Model Tune. Valid search_strategy values: TPESampler, BruteForceSampler, CmaEsSampler, MOTPESampler, NSGAIISampler, QMCSampler, RandomSampler, HyperGD Default value (if not previously set):
'HyperGD'
- Raises :
-
AutoMLxValueError – If preprocessing arg is given a value
- fit ( self , data , col_types , data_valid = None , time_budget = - 1 )
-
Automatically identifies the optimal model and hyperparameters for the given training data (
data
). Final model fit is conducted on a full dataset.- Parameters :
-
-
data ( pandas.DataFrame ) – Training dataset.
-
Dict or list with string values indicating the type of features of the dataset or their role. for Dict:
- Mandatory values that correspond to column name keys:
-
- ”recommendation_subject” indicates the column with the ids for subjects
-
that receive recommendations.
”recommendation” : indicates the column with the ids for recommendations. Example:
- col_types = {
-
“movie_id”: “recommendation”, “user_id”: “recommendation_subject”
} where “movie_id” is the column to recommend from and “user_id” is the column to recommend to.
- Additional columns can be added to indicate their type.
-
- Example:
-
- col_types = {
-
“movie_id”: “recommendation”, “user_id”: “recommendation_subject”, “rating” : “numerical”
}
- for List:
-
- Mandatory values:
-
- ”recommendation_subject” indicates the column with the ids for subjects
-
that receive recommendations.
”recommendation” : indicates the column with the ids for recommendations.
The values of this list are following the order of the columns of the training data DataFrame. “recommendation_subject” and “recommendation” must be placed at the position of the corresponding columns. The rest of the columns expect a type value. The required length of the list must be equal to the number of the columns of the training data DataFrame. Example: col_types = [“recommendation_subject”, “recommendation”, “numerical”] can be passed to a dataframe that has the following columns: columns: [“user_id”, “movie_id”, “rating”] in the case where values from the column “movie_id” should be recommended to values from the “user_id” column.
- Supported types are:
-
- ”categorical” for columns to be interpreted as categoricals
-
regardless of their data type.
”numerical” : for int, float and double type. “text” : for str type that consist of multiple words.
-
data_valid ( pandas.DataFrame or None , default=None ) – Validation dataset.
-
time_budget ( Dict [ str , float ] , float , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
- Returns :
-
self
- Return type :
-
automlx._interface.automl_pipeline.AutoMLPipeline
- predict ( self , subjects , new_data = None , n_recommendations = 10 , repeat_recommendations = False )
-
Predict labels for given subjects.
- Parameters :
-
-
subjects ( pandas.DataFrame ) – Ids of subjects to recommender subject predictions for.
-
new_data ( pandas.DataFrame ) – additional new context to be considered in the subject predictions.
-
n_recommendations ( int ) – For each subject id, n_recommendations are predicted.
-
repeat_recommendations ( bool ) – Enables/disables the repetition of predictions.
-
- Returns :
-
A data frame containing two columns: the subject ids along with the predicted subject recommendations.
- Return type :
- recommend ( self , subjects , new_data = None , n_recommendations = 10 , repeat_recommendations = False )
-
Recommend labels for given subjects.
- Parameters :
-
-
subjects ( pandas.DataFrame ) – Ids of subjects to recommender subject recommendations for.
-
new_data ( pandas.DataFrame ) – additional new context to be considered in the subject recommendations.
-
n_recommendations ( int ) – For each subject id, n_recommendations are predicted.
-
repeat_recommendations ( bool ) – Enables/disables the repetition of recommendations.
-
- Raises :
-
AutoMLxNotImplementedError – If new_data attr is provided. If repeat_recommendations is passed as True.
- Returns :
-
A data frame containing two columns: the subject ids along with the predicted subject recommendations.
- Return type :
- score ( self , data , score_metric = None , n_recommendations = 10 )
-
Score of this pipeline for a given dataset (
data
).
- train_test_split ( data , col_types )
-
Split the given dataset in two by using the leave-one-last-split approach.
The split generates a train/test split by putting a fraction of the last interaction according to chronological order of each recommendation subject in the testing set and leaves the remaining ones in the train dataset.
- Parameters :
-
-
data ( pandas.DataFrame ) – Dataset to split in AutoRecommender pipeline.
-
Dict or list with string values indicating the type of features of the dataset or their role. for Dict:
- Mandatory values that correspond to column name keys:
-
- ”recommendation_subject” indicates the column with the ids for subjects
-
that receive recommendations.
”recommendation” : indicates the column with the ids for recommendations. Example:
- col_types = {
-
“movie_id”: “recommendation”, “user_id”: “recommendation_subject”
} where “movie_id” is the column to recommend from and “user_id” is the column to recommend to.
- Additional columns can be added to indicate their type.
-
- Example:
-
- col_types = {
-
“movie_id”: “recommendation”, “user_id”: “recommendation_subject”, “rating” : “numerical”
}
- for List:
-
- Mandatory values:
-
- ”recommendation_subject” indicates the column with the ids for subjects
-
that receive recommendations.
”recommendation” : indicates the column with the ids for recommendations.
The values of this list are following the order of the columns of the training data DataFrame. “recommendation_subject” and “recommendation” must be placed at the position of the corresponding columns. The rest of the columns expect a type value. The required length of the list must be equal to the number of the columns of the training data DataFrame. Example: col_types = [“recommendation_subject”, “recommendation”, “numerical”] can be passed to a dataframe that has the following columns: columns: [“user_id”, “movie_id”, “rating”] in the case where values from the column “movie_id” should be recommended to values from the “user_id” column.
- Supported types are:
-
- ”categorical” for columns to be interpreted as categoricals
-
regardless of their data type.
”numerical” : for int, float and double type. “text” : for str type that consist of multiple words.
-
- Raises :
-
AutoMLxValueError – If timestamp is passed as None
- Returns :
-
Two train, test indexed by timestamp dataframes.
- Return type :
-
Tuple[pd.DataFrame, pd.DataFrame]
- refit ( self , data , data_valid = None )
-
Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection and Model Tune are re-used.
fit
must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.- Parameters :
-
-
data ( pandas.DataFrame ) – Training dataset.
-
data_valid ( pandas.DataFrame or None , default=None ) – Validation dataset
-
- Returns :
-
self
- Return type :
-
automlx._interface.automl_pipelineAutoMLPipeline
- train ( self , data , col_types , data_valid = None , time_budget = - 1 )
-
Automatically identifies the optimal model and hyperparameters for this given dataset (
data
). Does not conduct final model fit. If the latter is desired, usefit
.- Parameters :
-
-
data ( pandas.DataFrame ) – Training dataset.
-
Dict or list with string values indicating the type of features of the dataset or their role. for Dict:
- Mandatory values that correspond to column name keys:
-
- ”recommendation_subject” indicates the column with the ids for subjects
-
that receive recommendations.
”recommendation” : indicates the column with the ids for recommendations. Example:
- col_types = {
-
“movie_id”: “recommendation”, “user_id”: “recommendation_subject”
} where “movie_id” is the column to recommend from and “user_id” is the column to recommend to.
- Additional columns can be added to indicate their type.
-
- Example:
-
- col_types = {
-
“movie_id”: “recommendation”, “user_id”: “recommendation_subject”, “rating” : “numerical”
}
- for List:
-
- Mandatory values:
-
- ”recommendation_subject” indicates the column with the ids for subjects
-
that receive recommendations.
”recommendation” : indicates the column with the ids for recommendations.
The values of this list are following the order of the columns of the training data DataFrame. “recommendation_subject” and “recommendation” must be placed at the position of the corresponding columns. The rest of the columns expect a type value. The required length of the list must be equal to the number of the columns of the training data DataFrame. Example: col_types = [“recommendation_subject”, “recommendation”, “numerical”] can be passed to a dataframe that has the following columns: columns: [“user_id”, “movie_id”, “rating”] in the case where values from the column “movie_id” should be recommended to values from the “user_id” column.
- Supported types are:
-
- ”categorical” for columns to be interpreted as categoricals
-
regardless of their data type.
”numerical” : for int, float and double type. “text” : for str type that consist of multiple words.
-
data_valid ( pandas.DataFrame or None , default=None ) – Validation dataset.
-
time_budget ( Dict [ str , float ] , float or None , default=-1 ) –
- If float:
-
Time budget in seconds.
- If Dict[str, float]:
-
Time budget for each step in seconds. Step names are: ModelSelection , ModelTune
-
-
-1
for unconstrained time budget: best effort mode is -
enabled and optimization continues until convergence.
-
-
- Returns :
-
self
- Return type :
-
automlx._interface.automl_pipeline.AutoMLPipeline