Before you can create, manage and run applications in Data Flow, the tenant administrator (or any user with elevated
privileges to create buckets and change policies in IAM)
must create groups, a compartment, storage, and associated policies in IAM.
Figure 1. Overview of connections to and from Data FlowThese are the steps required to set up Data Flow:
Setting up identity groups.
Setting up the compartment and object storage buckets.
Setting up Identity and Access Management policies
Set Up Identity Groups
As a general practice, categorize your Data Flow users
into three groups for clear separation of their use cases and privilege level.
Create the following three groups in your identity service, and add users to each group:
dataflow-admins
dataflow-data-engineers
dataflow-sql-users
dataflow-admins
The users in this group are administrators or super-users of the Data Flow. They have privileges to take any action on Data Flow or to set up and manage different resources
related to Data Flow. They manage Applications owned by
other users and Runs started by any user within their tenancy. Dataflow-admins don't have
need of administration access to the Spark clusters provisioned on-demand by Data Flow, as those clusters are fully managed by Data Flow.
dataflow-data-engineers
The users in this group have privilege to manage and run Data Flow Applications and Runs for their data engineering
jobs. For example, running Extract Transform Load (ETL) jobs in Data Flow's on-demand serverless Spark clusters. The users
in this group don't have nor need administration access to the Spark clusters provisioned
on-demand by Data Flow, as those clusters are fully
managed by Data Flow.
dataflow-sql-users
The users in this group have privilege to run interactive SQL queries by connecting to
Data Flow Interactive SQL clusters over JDBC or
ODBC.
Setting Up the Compartment and Object Storage
Buckets 🔗
Follow these steps to create a compartment and Object Storage buckets for Data Flow.
Data Flow expects four specific storage buckets in
Object Storage. We recommend that you create a compartment
dedicated to Data Flow in which to organize and isolate your
cloud resources. More information on compartments can be found in the IAM documentation.
Create the following storage buckets in Object Storage
under the dataflow-compartment compartment:
dataflow-logs
dataflow-warehouse
managed-table-bucket
external-table-bucket
dataflow-logs
Data Flow requires a bucket to store the logs
(both stdout and stderr) for every Application run. Create it as a standard
storage tier bucket. The location of the bucket must follow the pattern:
oci://dataflow-logs@<your_object_store_namespace>/.
dataflow-warehouse
Data Flow requires a data warehouse for Spark SQL
applications. Create it as a standard storage tier bucket. The location of the
warehouse must follow the pattern:
oci://dataflow-warehouse@<your_object_store_namespace>/.
managed-table-bucket and external-table-bucket
For unstructured and semistructured data assets in Object Storage, Data Flow requires a metastore to securely store and retrieve
schema definitions. Data Catalog Metastore
provides a Hive-compatible Metastore as a persistent external metadata repository
shared across many OCI services.
Before creating a metastore in Data Catalog, you
must create two buckets in Object Storage to
contain the managed and external tables. We recommend that you name those buckets
managed-table-bucket and
external-table-bucket.
Managed-table-bucket is used for resources related to
Managed Table in Data Catalog's Hive-compatible Metastore, where the Metastore manages the table
object.
External-table-bucket is used for resources related to the
External Table in Data Catalog's Hive-compatible Metastore, where
you manage the table object.