AutoML
The AutoMLx python package automatically creates, optimizes and explains machine learning pipelines and models. The AutoML pipeline provides a tuned ML pipeline that best models the given training dataset and a prediction task at hand. AutoML has a simple pipeline-level Python API that quickly jump-starts the datascience process with an accurate tuned model. AutoML has support for any of the following tasks:
Supervised classification or regression prediction with tabular dataset where the target can be a simple binary or a multi-class value or a real valued column in a table, respectively.
Unsupervised anomaly detection, where the target or the labels are not provided.
Univariate and multivariate timeseries forecasting task.
The AutoML pipeline consists of five major stages of the ML pipeline: preprocessing , algorithm selection , adaptive sampling , feature selection , and model tuning
These pieces are readily combined into a simple AutoML pipeline which automatically optimizes the whole pipeline with limited user input/interaction.
Pipeline
- class automl. Pipeline ( task = 'classification' , score_metric = None , random_state = 7 , n_algos_tuned = 1 , model_list = [] , adaptive_sampling = True , min_features = 1 , optimization = 3 , preprocessing = True , search_space = None , time_series_period = None , min_class_instances = 5 , max_tuning_trials = None , threshold_tuning = False )
-
Automatic Machine Learning Pipeline object, uses metalearning to quickly identify most relevant features, model and hyperparameters for a given training dataset.
Warning
The following interfaces are deprecated and will be replaced by
completed_trials_summary_
andcompleted_trials_detailed_
in version 23.3.0.-
model_selection_trials_
-
adaptive_sampling_trials_
-
feature_selection_trials_
-
tuning_trials_
-
all_trials_
-
all_trials_extra_scores_
Warning
Optimization levels 1 and 2 are deprecated and will behave the same as level 3 in version 23.3.0.
Warning
The
selected_features_
attribute is deprecated and will be removed in version 23.3.0. To inspect selected features useselected_features_names_
andselected_features_names_raw_
instead.- Parameters
-
-
task (str, optional ) – Machine learning task, supported: classification, regression, anomaly_detection, forecasting Defaults to
'classification'
. -
score_metric (str, callable, tuple, list, optional ) –
One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.
-
If
None
: it will be determined automatically depending on the task. Default score metrics : binary: neg_log_loss, multiclass: neg_log_loss, continuous: neg_mean_squared_error, continuous_forecast: neg_sym_mean_abs_percent_error, unsupervised: unsupervised_n-1_experts -
If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes.
-
If a callable: score function (or loss function) with signature
score_func(model, X, y)
. -
If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above.
-
If a string: automatically infers the scoring metric from the string:
binary – neg_log_loss, roc_auc, accuracy, f1, precision, recall, f1_micro, f1_macro, f1_weighted, f1_samples, recall_micro, recall_macro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples
multiclass – neg_log_loss, recall_macro, accuracy, f1_micro, f1_macro, f1_weighted, f1_samples, recall_micro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples
continuous – neg_mean_squared_error, r2, neg_mean_absolute_error, neg_mean_squared_log_error, neg_median_absolute_error
continuous_forecast – neg_sym_mean_abs_percent_error, neg_root_mean_squared_percent_error, neg_mean_abs_scaled_error, neg_root_mean_squared_error, neg_mean_squared_error, neg_max_absolute_error, neg_mean_absolute_error, neg_max_abs_error, neg_mean_abs_error
unsupervised – unsupervised_n-1_experts, unsupervised_unify95, unsupervised_unify95_log_loss
More information on scoring metrics can be found here : Classification metrics , Regression metrics .
Note: Scoring variations like
recall_macro
are equivalent tosklearn.metrics.recall_score(...,average="macro")
Defaults to
None
. -
-
random_state (int, optional ) – Random seed used by AutoML. Defaults to
7
. -
model_list (List[str], optional ) –
Models that will be evaluated by the Pipeline. (By default, all supported models for a given task are used, other than TabNetClassifier, AdaBoostClassifier, KNeighborsClassifier, and LinearSVC. To enable one of these models, add them to the model list.) Supported models per task:
-
classification:
'AdaBoostClassifier'
,'DecisionTreeClassifier'
,'ExtraTreesClassifier'
,'TorchMLPClassifier'
,'KNeighborsClassifier'
,'LGBMClassifier'
,'LinearSVC'
,'LogisticRegression'
'RandomForestClassifier'
,'SVC'
,'XGBClassifier'
,'GaussianNB'
,'CatBoostClassifier'
, ‘TabNetClassifier'
-
regression:
'AdaBoostRegressor'
,'DecisionTreeRegressor'
,'ExtraTreesRegressor'
,'TorchMLPRegressor'
,'KNeighborsRegressor'
,'LGBMRegressor'
,'LinearSVR'
,'LinearRegression'
,'RandomForestRegressor'
,'SVR'
,'XGBRegressor'
-
anomaly_detection:
'IsolationForestOD'
,'SubspaceOD'
,'HistogramOD'
,'ClusteringLocalFactorOD'
,'PrincipalCompOD'
,'MinCovOD'
,'AutoEncoder'
,'KNearestNeighborsOD'
,'OneClassSVMOD'
-
forecasting:
-
'NaiveForecaster'
- Naive and Seasonal Naive method -
'ThetaForecaster'
- Equivalent to Simple Exponential Smoothing (SES) with drift -
'ExpSmoothForecaster'
- Holt-Winters’ damped method -
'STLwESForecaster'
- Seasonal Trend LOESS (locally weighted smoothing) with Exponential Smoothing substructure -
'STLwARIMAForecaster'
- Seasonal Trend LOESS (locally weighted smoothing) with ARIMA substructure -
'SARIMAXForecaster'
- Seasonal Autoregressive Integrated Moving Average with Exogenous Variables -
'ETSForecaster'
- Error, Trend, Seasonality (ETS) Statespace Exponential Smoothing -
'ProphetForecaster'
(optional) - Facebook Prophet with Exogenous Variables. (Available only if installed locally with pip install prophet) -
'VARMAXForecaster'
- Vector AutoRegressive Moving Average with Exogenous Variables -
'DynFactorForecaster'
- Dynamic Factor Models in state-space form with Exogenous Variables
-
-
-
n_algos_tuned (int, optional ) –
Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.
-
To disable algorithm selection set
n_algos_tuned = len(model_list)
.
Defaults to
1
. -
-
adaptive_sampling (bool, optional ) – Set to False to disable class balancing and adaptive sampling done in AutoML. Disabling this might significantly increase runtime. Defaults to
True
. -
min_features (int, float, list, optional ) –
Minimum number of features to keep. Acceptable values:
-
If int,
0 < min_features <= n_features
, the minimum number of features to keep. -
If float,
0 < min_features <= 1.0
, the minimum fraction of features to keep. -
If list, names of features to keep, for example
['a', 'b']
means keep features'a'
and'b'
-
To disable feature selection set
min_features = 1.0
Defaults to
1
. -
-
optimization (int, optional ) –
Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.
-
Level 0: Optimized for reproducibility (controls most randomness)
-
Level 1: Undefined (placeholder for future version)
-
Level 2: Faster than Level 0, more reproducible than Level 3
-
Level 3: Optimized for speed and accuracy
Defaults to
3
. -
-
preprocessing (bool, optional ) –
Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.
-
If True, auto-preprocesser runs on dataset to normalize data. Categorical features are one-hot encoded if they contain less than 5 unique values; otherwise they are label encoded or, if they contain more than 20 percent unique values, they are ignored. Numeric features are normalized to mean of 0 and variance of 1 using
sklearn.preprocessing.StandardScaler
. Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. Datetime columns are automatically expanded into engineered features including day of week, day of month, etc. -
If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce a
ValueError
. AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings).
Defaults to
True
. -
-
search_space (dict, optional ) –
This parameter defines the Model Tuning search space. Dictionary keys are algorithm names (str) with search space as the key value. Key values must have two parameters: (1) ‘range’ which is a list containing the range and (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for LogisticRegression:
search_space = { 'LogisticRegression' : { 'C': {'range': [0.03125, 512], 'type': 'continuous'}, 'solver': { 'range': ['newton-cg', 'lbfgs', 'liblinear', 'sag'], 'type': 'categorical' }, 'class_weight': { 'range': [None, 'balanced'], 'type': 'categorical' } } }
-
To disable Model Tune for all models set
search_space = {}
-
If a key value is an empty dictionary, then Model Tune is disabled for that key.
-
If
None
, default search space defined inside AutoML is used.
Defaults to
None
. -
-
time_series_period (int, optional ) – The period of time series for the forecasting task Defaults to
None
. -
min_class_instances (int, optional ) – The minimum number of instances all classes must have when doing classification. If any class has less than this number of instances, training is stopped. This argument may take any value of 2 or higher. Defaults to
5
. -
max_tuning_trials (int, dict,
None
,) –The maximum number of tuning trials, may be exceeded slightly.
-
If
None
: AutoML automatically determines when enough tuning trials have been completed. -
If an integer: the maximum number of trials for each algorithm. That is, if
n_algos_tuned == 2
, then up to2 * max_tuning_trials
are performed in total. -
If a
dict
: by passing a dictionary you can specify this parameter per algorithm. e.g.,{'LogisticRegression': 100, 'RandomForestClassifier': 200}
. Missing values in the dictionary default to None.
Defaults to
None
. -
-
threshold_tuning (bool, optional ) –
Determine whether or not AutoML optimizes the prediction threshold. Threshold tuning is only used in classification tasks. However, unlike classic threshold tuning, it is applied for both binary and multi-class classification tasks.
-
If True, the prediction threshold will be optimized based on the provided score metric. If the score metric is not sensitive to the threshold (for example, negative log loss), then f1 macro will be used instead. Threshold tuning allows users to post-process classification model predictions to optimize for their custom metric. Threshold tuning will not exported to onnx models, the onnx model quality may be lower than the original model.
-
If False, no minimum threshold is required for any class, and the class with the largest raw prediction probability is returned as the prediction.
Defaults to
True
-
-
- classes_
-
Holds the label for each class (for
task=classification
only, otherwise it is set toNone
).- Type
-
numpy.ndarray of shape (n_classes,)
- selected_features_names_
-
Names of features selected by the AutoML pipeline. The feature names correspond to the features engineered by the preprocessing phase.
- Type
-
List[ str ]
- selected_features_names_raw_
-
Names of features selected by the AutoML pipeline, as found in the input training dataset. A raw feature is considered selected if at least one of the features engineered from it during preprocessing are selected.
- Type
-
List[ str ]
- ranked_models_
-
List of model names ranked in order of their quality from the last
fit
call.- Type
-
List[ str ]
- selected_model_params_
-
Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.
- Type
- selected_rows_
-
List of indices in the original train dataset provided to AutoML corresponding to the rows sampled during Adaptive Sampling. in the case of CV, this attribute will result in a list of lists corresponding to indices selected in each fold. For example, in the case of no CV, this attribute looks like:
[0, 1, 5]
, indicating indices 0, 1, and 5 have been selected during adaptive sampling. In the case of CV=3, this attribute looks like:[ [0, 1], [0, 5], [1, 5] ]
, indicating indices 0,1 were selected from the first fold, 0,5 were selected in the 2nd fold, and 1,5 were selected in the 3rd fold.- Type
- selected_valid_rows_
-
List of indices in the original validation dataset (if
CV==None
) provided to AutoML corresponding to the rows sampled during Adaptive Sampling. If CV is not None, the returned value is always None given that Adaptive Sampling does not sample the validation set when CV is enabled.- Type
- pipelines_
-
Sorted list of pipelines (length equal to
n_algos_tuned
), with 0th element being the best model.- Type
- model_selection_trials_
-
ML model choices evaluated by model selection. Each tuple is of the form (algorithm, #samples, #features, mean validation score, hyperparameters, all validation scores, runtime, memory_usage), where the hyperparameters are a dict.
- Type
-
List[ tuple ]
- adaptive_sampling_trials_
-
Sampling choices evaluated by adaptive sampling. Each tuple is of the form (algorithm, #samples, #features, mean validation score, hyperparameters, all validation scores, runtime, memory_usage), where the hyperparameters are a dict.
- Type
-
List[ tuple ]
- feature_selection_trials_
-
Subset/Ranking algorithm choices evaluated by feature selection. Each tuple is of the form (algorithm, #samples, #features, mean validation score, hyperparameters, all validation scores, runtime, memory_usage), where the hyperparameters are a dict.
- Type
-
List[ tuple ]
- tuning_trials_
-
Hyperparameter choices evaluated by model tuning ranked in order of their achieved cross-validation scores. Each tuple is of the form (algorithm, #samples, #features, mean validation score, hyperparameters, all validation scores, runtime, memory_usage), where the hyperparameters are a dict.
- Type
-
List[ tuple ]
- all_trials_
-
All trials performed by the AutoML Pipeline. This includes
model_selection_trials_
,adaptive_sampling_trials_
,feature_selection_trial_
andtuning_trials_
. Each tuple is of the form (algorithm, #samples, #features, mean validation score, hyperparameters, all validation scores, runtime, memory_usage), where the hyperparameters are a dict.- Type
-
List[ tuple ]
- all_trials_extra_scores_
-
A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics value. Each row is of the form (Algorithm, Hyperparameters, # Samples, # Features, Features, Stage, Scoring Metric, CV Fold ID, Score).
- Type
- n_jobs_
-
Parallelism internally used by AutoML. Calculated as
inter_model_parallelism*intra_model_parallelism
.- Type
- feature_importances_
-
List of feature importance tuples. Each tuple contains feature name, its importance, and its standard deviation
- Type
-
List[ tuple ]
- threshold_tuning_score_
-
The validation score of the pipeline after applying threshold tuning. The scoring metric used to select this threshold can be found in threshold_tuning_scorer_ . It is None when the task is not classification or threshold_tuning is False.
- Type
- threshold_tuning_scorer_
-
The scoring metric used to select threshold during threshold tuning. It is None when the task is not classification or threshold_tuning is False.
- Type
-
callable
- fit ( X = None , y = None , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = 0 , contamination = None )
-
Automatically identifies the most relevant features, model and hyperparameters for a given training data (
X
) and target (y
). Final model fit is conducted on a full dataset.- Parameters
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
y ( pandas.DataFrame , pandas.Series , optional ) – Training dataset target. Optional for semi-supervised tasks, e.g., anomaly detection. Note that y is required for forecasting task. (Needs to be passed as None for unsupervised anomaly detection)
-
X_valid ( pandas.DataFrame , optional ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series , optional ) – Validation dataset target
-
cv ( 'auto' , int , cross-validation generator or an iterable , optional ) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
-
None: uses X_valid and y_valid for validation
-
’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise
-
integer: specifies the number of folds in a (Stratified)KFold ,
-
iterable: yields (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
col_types (List[str], optional ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’ and ‘timedelta’ -
time_budget (float, optional ) –
Time budget in seconds.
-
0 for unconstrained time budget: best effort mode is enabled and optimization continues until convergence.
Defaults to
0
. -
-
contamination (float, optional ) – Fraction of training dataset corresponding to anomalies (between 0.0 and 0.5). Should only be set for supervised anomaly detection (y_valid is required). Should be set to None for unsupervised anomaly detection (when using the unsupervised metrics). Default to
None
.
-
- Returns
-
self
- Return type
- forecast ( periods , alpha = 0.05 , X = None )
-
Forecast with the selected model.
Only out-of-sample forecasts are supported. In-sample fit values should be accessed via the predict interface.
- Parameters
-
-
periods ( int ) – The number of time steps to forecast from the end of the sample.
-
alpha (float, optional ) – A significance level. To receive a prediction interval of 95% alpha must be set to 0.05. Defaults to
0.05
. -
X (pandas.DataFrame, optional ) – A dataframe of explanatory variables that support forecast for period number of timesteps. Columns must match that used in
fit
.
-
- Returns
-
summary_frame – A dataframe with three columns listing prediction, ci_lower and ci_upper for the given confidence interval (ci) provided by level of alpha. Note: ci columns are excluded for models that don’t support intervals.
- Return type
-
pandas.Dataframe
- predict ( X )
-
Predict labels for features (X).
- Parameters
-
X ( pandas.DataFrame ) – Training dataset features, or Explanatory features if task is ‘forecasting’
- Returns
-
y_pred – The predicted values.
- Return type
-
numpy.ndarray of shape (n_samples,)
- predict_proba ( X )
-
Probability estimates.
More information can be found here: Prediction Probabilities
- Parameters
-
X ( pandas.DataFrame ) – Training dataset features
- Returns
-
y_pred_proba – The predicted probabilities.
- Return type
-
numpy.ndarray of shape = (n_samples, n_classes)
- print_memory_usage ( in_ipython = None )
-
Prints max memory usage information about the last pipeline run.
- Parameters
-
in_ipython ( bool ) – Sets to True if IPython kernel is being used
- print_profile_summary ( in_ipython = None )
-
Prints profiling information about the last pipeline run.
- Parameters
-
in_ipython ( bool ) – Sets to True if IPython kernel is being used
- print_summary ( in_ipython = None )
-
Prints information about the last
fit
call.- Parameters
-
in_ipython ( bool ) – Sets to True if IPython kernel is being used
- print_times ( in_ipython = None )
-
Prints timing and speedup information about the last
fit
call.- Parameters
-
in_ipython ( bool ) – Sets to True if IPython kernel is being used
- print_trials ( max_rows = None , sort_column = 'Mean Validation Score' , in_ipython = None )
-
Prints all trials executed by the AutoML Pipeline in the last
fit
call.- Parameters
-
-
max_rows ( int ) – Number of trials to print. Pass in None to print all trials
-
sort_column ( str ) – Column to sort results by. Must be one of [‘Algorithm’, ‘#Samples’, ‘#Features’, ‘Mean Validation Score’, ‘Hyperparameters’, ‘All Validation Scores’, ‘CPU Time’]
-
in_ipython ( bool ) – Sets to True if IPython kernel is being used
-
- refit ( X = None , y = None , X_valid = None , y_valid = None , cv = 'auto' )
-
This method is used to refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used.
fit
must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.- Parameters
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.
-
y ( pandas.DataFrame , pandas.Series , optional ) – Training dataset target. Optional for semi-supervised tasks, like anomaly detection.
-
X_valid ( pandas.DataFrame , optional ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series , optional ) – Validation dataset target
-
cv ( 'auto' , int , cross-validation generator or an iterable , optional ) –
Determines the cross-validation splitting strategy. Used for ensemble generation. Possible inputs for cv are:
-
None: uses X_valid and y_valid for validation
-
’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise
-
integer: specifies the number of folds in a (Stratified)KFold ,
-
iterable: yields (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
- Returns
-
self
- Return type
- score ( X , y )
-
Score of this pipeline for a given set of features (
X
) and labels (y
). If inferred_score_metric has multiple score metrics, the first score metric would be calculated.- Parameters
-
-
X ( pandas.DataFrame ) – Training dataset features
-
y ( pandas.DataFrame ) – Training dataset target
-
- Returns
-
score – Score of
self.predict(X)
with respect toy
. - Return type
- to_onnx ( X , y )
-
Serializes an AutoML estimator to the ONNX format. Only requires one sample from the training or test set as input. This sample is used to infer the final types and shapes
- Parameters
-
-
X ( pandas.DataFrame , numpy.ndarray , scipy.sparse.csr.csr_matrix ) – Sample dataset features
-
y ( pandas.DataFrame , pandas.Series , numpy.ndarray , optional ) – Sample dataset target. Optional for semi-supervised tasks, like anomaly detection.
-
- Returns
-
An ONNX model
- Return type
-
onnx.ModelProto
- train ( X = None , y = None , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = 0 , contamination = None )
-
Automatically identifies the most relevant features, model and hyperparameters for this given set of features (X) and target (y). Does not conduct final model fit. If the latter is desired, use
fit
.- Parameters
-
-
X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.
-
y ( pandas.DataFrame , pandas.Series , optional ) – Training dataset target. Optional for semi-supervised tasks, like anomaly detection.
-
X_valid ( pandas.DataFrame , optional ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series , optional ) – Validation dataset target
-
cv ( 'auto' , int , cross-validation generator or an iterable , optional ) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
-
None: uses X_valid and y_valid for validation
-
’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise
-
integer: specifies the number of folds in a (Stratified)KFold ,
-
iterable: yields (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
col_types ( list of strings ) – List of length
X.shape[1]
with string values indicating type of features. Supported types are: [‘categorical’, ‘numerical’, ‘text’. ‘datetime’, ‘timedelta’] Note: Datetime support is in experimental stage -
time_budget ( float , optional ) –
Time budget in seconds.
-
0 for unconstrained time budget: best effort mode is enabled and optimization continues until convergence.
(default: 0)
-
-
contamination ( float , None , optional ) – Fraction of training dataset corresponding to anomalies (between 0.0 and 0.5). Should only be set for supervised anomaly detection (y_valid is required). Should be set to None for unsupervised anomaly detection (when using the unsupervised metrics). (default: None)
-
- Returns
-
self
- Return type
- transform ( X , y )
-
Applies automatic preprocessing to a given set of features (
X
) and labels (y
).- Parameters
-
-
X ( pandas.DataFrame ) – Dataset features
-
y ( pandas.DataFrame ) – Dataset target
-
- Returns
-
-
X ( pandas.DataFrame ) – Transformed dataset features
-
y ( pandas.DataFrame ) – Transformed dataset target
-
-
ModelTune
- class automl. ModelTune ( task = 'classification' , score_metric = None , random_state = 7 )
-
Automatic Model Tuning object, uses a highly parallel, scalable and asynchronous gradient-based hyperparameter optimizer to quickly prune the hyperparameter search space and tune the given model object.
Warning
The model tuning object is deprecated and will be removed in version 23.3.0.
- Parameters
-
-
task ( str , default=classification ) – Machine learning task, supported inputs: classification, regression, anomaly_detection, forecasting
-
score_metric ( str , callable , tuple , list , default=None ) –
One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.
-
If
None
: it will be determined automatically depending on the task. Default score metrics : binary: neg_log_loss, multiclass: neg_log_loss, continuous: neg_mean_squared_error, continuous_forecast: neg_sym_mean_abs_percent_error, unsupervised: unsupervised_n-1_experts -
If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one fors which the pipeline optimizes.
-
If a callable: score function (or loss function) with signature
score_func(model, X, y)
. -
If a tuple: should be a tuple with two values with types
(str, callable)
. The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. -
If a str: automatically infers the scoring metric from the string:
binary – neg_log_loss, roc_auc, accuracy, f1, precision, recall, f1_micro, f1_macro, f1_weighted, f1_samples, recall_micro, recall_macro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples
multiclass – neg_log_loss, recall_macro, accuracy, f1_micro, f1_macro, f1_weighted, f1_samples, recall_micro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples
continuous – neg_mean_squared_error, r2, neg_mean_absolute_error, neg_mean_squared_log_error, neg_median_absolute_error
continuous_forecast – neg_sym_mean_abs_percent_error, neg_root_mean_squared_percent_error, neg_mean_abs_scaled_error, neg_root_mean_squared_error, neg_mean_squared_error, neg_max_absolute_error, neg_mean_absolute_error, neg_max_abs_error, neg_mean_abs_error
unsupervised – unsupervised_n-1_experts, unsupervised_unify95, unsupervised_unify95_log_loss
More information on scoring metrics can be found here : Classification metrics , Regression metrics .
Note: Scoring variations like
recall_macro
are equivalent tosklearn.metrics.recall_score(...,average="macro")
-
-
random_state ( int , default=7 ) – Random seed used by ModelTune.
-
- selected_model_params_
-
Dictionary containing the optimal hyperparameters for the selected_model. Keys are hyperparameter names with their corresponding values.
- Type
- tuning_trials_
-
Hyperparameter choices evaluated by tuning ranked in order of their achieved cross-validation scores. Each tuple is of the form (float, dict), the float is the cross-validation score of a corresponding dict which is a particular set of hyperparameters.
- at_summary
-
Dictionary containing the following stats: all trials, in the form of list of tuples, where each tuple is (score, hyperparameter, EvalResult (This is an internal class that holds all relevant results for each model evaluation)), total runtime of ModelTune, score of default run, default hyperparameter values, scoring metric, shape of train dataset, and shape of validation dataset (if it is available)
- Type
- fit ( model , X , y , X_valid = None , y_valid = None , cv = 5 , param_space = None , col_types = None , contamination = None )
-
Automatically identifies the best model hyperparameters for this given set of features (X) and target (y).
- Parameters
-
-
model ( str or sklearn-like model object ) – Model to tune on the given dataset, where it can be either one of the supported model names or sklearn-like object that implements at least the following methods:
get_params
,set_params
,fit
,predict
andscore
. When providing an unsupported custom model,param_space
argument should not beNone
. %s%s -
X ( pandas.DataFrame ) – Training dataset features
-
y ( pandas.DataFrame , pandas.Series ) – Training dataset target
-
X_valid ( pandas.DataFrame ) – Validation dataset features
-
y_valid ( pandas.DataFrame , pandas.Series ) – Validation dataset target
-
cv ( int , cross-validation generator or an iterable , optional ) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
-
None, to use X_valid and y_valid for validation
-
integer, to specify the number of folds in a (Stratified)KFold ,
-
a generator like a
StratifiedKFold
orKFold
, -
An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. -
-
param_space ( dict ) –
The search ranges for each hyperparameter (default: none, uses predefined model specific hyperparameter search space). Example param_space input for RandomForest algorithm tuning:
param_space = { 'n_estimators': { 'range': [5, 500], 'type': 'discrete', }, 'max_features': { 'range': [0.01, 0.5], 'type': 'continuous', }, 'class_weight': { 'range': ['balanced', 'balanced_subsample'], 'type': 'categorical', }, }
-
col_types ( list ) – List of string identifying type of features. Supported types are: [‘categorical’,’numerical’,datetime’,’timedelta’]
-
contamination ( float , optional ) – Fraction of training dataset corresponding to anomalies. contamination has to be between 0 and 0.5. Default value is 0.01.
-
- Returns
-
self
- Return type
- predict ( X )
-
Predict labels for features (X).
- Parameters
-
X ( pandas.DataFrame ) – Training dataset features, or Explanatory features if task is ‘forecasting’
- Returns
-
y_pred – The predicted values.
- Return type
-
numpy.ndarray of shape (n_samples,)
- predict_proba ( X )
-
Probability estimates.
More information can be found here: Prediction Probabilities
- Parameters
-
X ( pandas.DataFrame ) – Training dataset features
- Returns
-
y_pred_proba – The predicted probabilities.
- Return type
-
numpy.ndarray of shape = (n_samples, n_classes)
- score ( X , y )
-
Score of this pipeline for a given set of features (
X
) and labels (y
). If inferred_score_metric has multiple score metrics, the first score metric would be calculated.- Parameters
-
-
X ( pandas.DataFrame ) – Training dataset features
-
y ( pandas.DataFrame ) – Training dataset target
-
- Returns
-
score – Score of
self.predict(X)
with respect toy
. - Return type
- transform ( X , y )
-
Applies automatic preprocessing to a given set of features (
X
) and labels (y
).- Parameters
-
-
X ( pandas.DataFrame ) – Dataset features
-
y ( pandas.DataFrame ) – Dataset target
-
- Returns
-
-
X ( pandas.DataFrame ) – Transformed dataset features
-
y ( pandas.DataFrame ) – Transformed dataset target
-