Considerations for Secure Use

  • On data validation
    • The pipeline consumes data inputs using Pandas DataFrames. It is advised to perform data type validation on loading, for example, by holding the known dtype of your application data in a separate file and passing it to pandas.read_csv utility on data loading.

  • On logging
    • By default, logging to console is enabled and logging to file is disabled.

    • If logging to file is desired, initialization with automl.init(file_path=<desired_file_location>) can be utilized. A filehandler to the logging module is instantiated, duplicating console output to a disk-resident file. This file is created and edited with read-write permissions to the POSIX user executing the AutoML application. It is recommended to further harden logfile permissions to 0400 once model training is complete.

    • It is recommended to log at loglevel=logging.INFO . Data column names are assumed to be not sensitive attributes. If this assumption does not hold for your application, an anonymization utility that deidentifies columns names with a transform-inverse_transfrom wrapper may be utilized. This should envelope both automl.Pipeline.fit and automl.Pipeline.predict calls.

    • In rare instances, on data read failures logging.INFO may reveal the line number and contents of bad data. Use of pandas.read_csv utility with error_bad_lines=False argument is recommended to suppress record information leakage to the logfile.

    • Logging at logging.DEBUG level is not recommended. It is for use only by individuals who also have plaintext privileges to the input data. Debug information may reveal data record attributes.

  • On execution engine
    • The dask (respectively, local “python multiprocessing”) execution backend requires write privileges to the /tmp directory, wherein it writes to disk, socket (respectively, filedescriptor) metadata to coordinate communication between the parallel execution threads.

    • The default directory is chosen from a platform-dependent list (e.g., on POSIX Linux it is the /tmp directory), but the user of the application can control the directory location by setting the TMPDIR, TEMP or TMP environment variables. This configuration can also be set directly within python.

    • While this metadata does not contain any sensitive information, the folder in /tmp directory is created with 0600 permissions using the python tempfile module. Note that on POSIX, a process that is terminated abruptly with SIGKILL cannot automatically delete any tempfile.NamedTemporaryFiles it creates.

  • On pipeline serialization for model persistence
    • After training the AutoML pipeline, it may be desirable to persist the pipeline for future use without having to retrain. The pipeline object can be exported as a pickle dump, and/or in the ONNX format (except forecasting models which may only be pickled).

    • Loading the pickled (or ONNX) pipeline object from disk to memory is equivalent to executing code. Therefore, it is critical to ensure integrity of the serialized object. Failure to do so may expose you to arbitrary code execution (ACE).

  • Running untrusted AutoML pipeline/models
    • Only if you must, execute untrusted models inside a sandbox, such as that provided by the nsjail utility.

    • If integrity of the pipeline object is in doubt, you may consider carefully auditing the pipeline object in a sandbox.

  • On machine learning model security
    • This library provides no protection against security attacks on the internal parameters of the ML model. Attacks can range from Data Poisoning, Membership Inference, Attribute Inference, Adversarial, Byzantine to Model Extraction.

    • If your application requires protections against such attacks, they will need to be supplemented. For example, to prevent Data Poisoning attacks, you would need to either ensure the integrity of the data input to the pipeline or sanitize each data record.