Considerations for Secure Use 

On data validation
- The pipeline consumes data inputs formatted as Pandas DataFrame for tabular workloads or HuggingFace datasets for vision or NLP workload. It is advised to perform data type validation on loading, for example, by holding the known dtype of your application data in a separate file and passing it to pandas.read_csv utility on data loading.
On logging
- By default, logging to console is enabled and logging to file is disabled. Initializing the engine explicitly also initializes logging to the stdout.
- If logging to file is desired, initialization with automl.init(logger =/my/logging/directory) can be utilized. A filehandler to the logging moduleis instantiated, duplicating console output to a disk-resident file. This file is created and edited with read-write permissions to the POSIX user executing the AutoML application. It is recommended to further harden logfile permissions to 0400 once model training is complete.
- It is recommended to log at loglevel=logging.INFO . Data column names are assumed to not be sensitive attributes. If this assumption does not hold for your application, an anonymization utility that deidentifies columns names with a transform-inverse_transfrom wrapper may be utilized. This should envelope all AutoMLx function calls that accept data as an input argument. For example, for the AutoML Regressor, this includes both automl.Pipeline.fit and automl.Pipeline.predict .
- Aggregated dataset statistics are suppressed at loglevel=logging.INFO . To enable them you may utilize a slightly more permissive loglevel=logging.SENSITIVE_INFO . Dataset characteristics that inform AutoMLx steps (such as the number of records within each target class) are logged at this level.
- Logging at logging.DEBUG level is not recommended in production deployments. It is for use only by individuals who also have plaintext privileges to the input data. Debug information may reveal data record attributes.
- If pandas library is used to load datasets, then pandas.read_csv utility with error_bad_lines=False argument is recommended to suppress record information leakage to the logfile.
On AutoMLx cache store
- AutoMLx uses a disk-resident directory to cache intermediate results produced due to data transformations and model trials. For tabular datasets and tasks such as classification, regression, forecasting or anomaly detectionno disk-resident cache is used. For image classification a cache directory is created and destroyed as part of the AutoMLx process.
- The default directory is chosen from a platform-dependent list (e.g., on POSIX Linux it is the /tmp directory), but the user of the application can control the directory location by setting the TMPDIR, TEMP or TMP environment variables.
- The cache directory is created using the tempfile module securely (with only read/write access to the user creating it). If none of the temporary locations are writeable, we default to the __pycache__ directory, which is typically where the python source code resides. Files and directories created by AutoMLx are always cleaned up when the engine is shut down.
- CRITICAL: `SIGTERM events and other catastrophic system failures may leave these files abandoned in the temporary directory. If this is a privacy/security concern, you may change the default directory with `init_engine(cache_dir=/my/secure/directory) function.
- Note that cache_dir path passed in through the argument is not cleaned up by the AutoMLx process, hence it is your responsibility to delete its contents, for example, with a call to shutil.rmtree(/my/secure/directory) .
- Alternatively, an exit trap may be utilized in a shell script; for example, by placing the command trap "rm -f /my/secure/directory " EXIT in the shell script that executes the python code. It is important that the trap statement be placed at the beginning of the script because any command above the trap can exit and not be caught in the trap.
On Execution backend engine
- The python multiprocessing execution backend requires write privileges to the AutoMLx working cache directory, wherein it writes to disk filedescriptor metadata to coordinate communication between the parallel execution threads. There is no sensitive information in these handles.
- The Ray execution backend writes logs and execution session metadata to the AutoMLx working cache directory. This information is removed when the process finishes running. To retain these metadata you may configure the location by providing temp_dir in ray_setup dict of engine_opts when initializing the AutoMLx engine.
- In both instances, metadata contains only execution records and not sensitive data. The directory that houses execution metadata is created with a umask of 0600 using the python tempfile module. Note that on POSIX, a process that is terminated abruptly with SIGKILL cannot automatically delete any tempfile.NamedTemporaryFiles it creates.
- Ray logging can be further modified by passing as argument the ray setup dictionary to automlx.init_engine(ray_setup={my_kwarg:”kwarg_value”})
On Distributed trials in AutoMLx with Multi-node Ray cluster
- To setup AutoMLx to utilize a cluster of nodes, with each node computing model trials in a distributed fashion, please refer to the ExecutionEngineSetup notebook.
- A Multi-node Ray cluster carries two critical security and privacy risks:
- The first is a privacy risk related to the communication of potentially sensitive datasets to and from the compute nodes in the cluster. This communication can be susceptible to eavesdropping if the communication channel is not encrypted. Scripts and instructions to setup secure TLS authenticated communication between the cluster nodes are presented as part of the execution engine setup demo notebook. For a detailed look at how Ray is configured with TLS on its gRPC channels please see https://docs.ray.io/en/latest/ray-core/configure.html
- The second is a security risk stemming from having too many ports open on the server and compute nodes. Open ports can allow incoming connections to a device or network. The following ports must be opened for the corresponding Ray functionality: Ray Client (default port 10001), Dashboard (default port 8265),Ray GCS server (default port 6379), Serve (default port 8000), Ray Prometheus metrics (default port 8080). To further configure Ray ports please refer to Ports Configuration . Port specifications can also be passed in to the ray_setup argument of automlx.init_engine as a kwarg (keyword argument).
On HuggingFace Datasets Caching
- Tabular datasets do not utilize disk-based dataset cache stores.
- Image datasets may utilize disk-based cache classification task or NLP tasks. If data is manipulated in the HuggingFace datasets format. In particular, AutoMLx may write partially processed files to the disk-resident HuggingFace cache .
- You may disable disk-caching using automl.init(use_dataset_caching=False) , but this may impact performance for image classification.
- The Hugging Face Datasets cache uses the same directory as the AutoMLx working directory cache, which can be set with init_engine(cache_dir=/my/secure/directory) .
- You may also change specifically the dataset cache location by setting the shell environment variable, HF_DATASETS_CACHE to another directory.
On pipeline serialization for model persistence
- After training the AutoML pipeline, it may be desirable to persist the pipeline for future use without having to retrain. The pipeline object can be exported as a pickle dump, and/or in the ONNX format (except forecasting models which may only be pickled).
- Loading the pickled (or ONNX) pipeline object from disk to memory is equivalent to executing code. Therefore, it is critical to ensure integrity of the serialized object. Failure to do so may expose you to arbitrary code execution (ACE).
Running untrusted AutoML pipeline/models
- Only if you must, execute untrusted models inside a sandbox, such as thatprovided by the nsjail utility.
- If integrity of the pipeline object is in doubt, you may consider carefully auditing the pipeline object in a sandbox.
- Note that to guarantee security, untrusted pipelines should forever be used within the sandboxed environment.
On machine learning model security
- This library provides no protection against security attacks on the internal parameters of the ML model. Attacks can range from Data Poisoning, Membership Inference, Attribute Inference, Adversarial, Byzantine to Model Extraction.
- If your application requires protections against such attacks, they will need to be supplemented. For example, to prevent Data Poisoning attacks, you would need to either ensure the integrity of the data input to the pipeline orsanitize each data record.