Set Up Data Flow

Before you can create, manage and run applications in Data Flow, the tenant administrator (or any user with elevated privileges to create buckets and modify IAM) must create groups, a compartment, storage, and associated policies in IAM.

Figure 1. Overview of connections to and from Data Flow
At the top are the groups of users, ETL, Data engineering, Administration, SQL user and Data Science. The ETL group users from DIS workspaces load data from various data sources in Data Flow. It is cleaned and prepared by the Data Engineering, Admin and SQL User groups users. the data is sent to disparate databases where it can be worked on by Data Scientists in notebooks and using the model catalog.
These are the steps required to set up Data Flow:
  • Setting up identity groups.
  • Setting up the compartment and object storage buckets.
  • Setting up Identity and Access Management policies

Set Up Identity Groups

As a general practice, categorize your Data Flow users into three groups for clear separation of their use cases and privilege level.

Create the following three groups in your identity service, and add users to each group:

  • dataflow-admins
  • dataflow-data-engineers
  • dataflow-sql-users
dataflow-admins
The users in this group are administrators or super-users of the Data Flow. They have privileges to take any action on Data Flow or to set up and manage different resources related to Data Flow. They are able to manage Applications owned by other users and Runs initiated by any user within their tenancy. Dataflow-admins do not have need of administration access to the Spark clusters provisioned on-demand by Data Flow, as those clusters are fully managed by Data Flow.
dataflow-data-engineers
The users in this group have privilege to manage and run Data Flow Applications and Runs for their data engineering jobs. For example, running Extract Transform Load (ETL) jobs in Data Flow's on-demand serverless Spark clusters. The users in this group do not have nor need administration access to the Spark clusters provisioned on-demand by Data Flow, as those clusters are fully managed by Data Flow.
dataflow-sql-users
The users in this group have privilege to run interactive SQL queries by connecting to Data Flow Interactive SQL clusters over JDBC or ODBC.

Setting Up the Compartment and Object Storage Buckets

Follow these steps to create a compartment and Object Storage buckets for Data Flow.

Data Flow expects four specific storage buckets in Object Storage. We recommend that you create a compartment dedicated to Data Flow in which to organize and isolate your cloud resources. More information on compartments can be found in the IAM documentation.
  1. Create a compartment called dataflow-compartment.
    You can create the compartment using the Console or using the API.
  2. Create the following storage buckets in Object Storage under the dataflow-compartment compartment:
    • dataflow-logs
    • dataflow-warehouse
    • managed-table-bucket
    • external-table-bucket
    dataflow-logs

    Data Flow requires a bucket to store the logs (both stdout and stderr) for every Application run. Create it as a standard storage tier bucket. The location of the bucket must follow the pattern: oci://dataflow-logs@<your_object_store_namespace>/.

    dataflow-warehouse

    Data Flow requires a data warehouse for Spark SQL applications. Create it as a standard storage tier bucket. The location of the warehouse must follow the pattern: oci://dataflow-warehouse@<your_object_store_namespace>/.

    managed-table-bucket and external-table-bucket
    For unstructured and semistructured data assets in Object Storage, Data Flow requires a metastore to securely store and retrieve schema definitions. Data Catalog Metastore provides a Hive-compatible Metastore as a persistent external metadata repository shared across multiple OCI services. Before creating a metastore in Data Catalog, you must create two buckets in Object Storage to contain the managed and external tables. We recommend that you name those buckets managed-table-bucket and external-table-bucket.
    • Managed-table-bucket is used for resources related to Managed Table in Data Catalog's Hive-compatible Metastore, where the Metastore manages the table object.
    • External-table-bucket is used for resources related to the External Table in Data Catalog's Hive-compatible Metastore, where you manage the table object.