A data flow is a logical diagram representing the flow of data from source data assets, such as a database or flat file, to target data assets, such as a data lake or data warehouse.
The flow of data from source to target can undergo a series of transformations to aggregate,
cleanse, and shape the data. Data engineers and ETL developers can then analyze or gather
insights and use that data to make impactful business decisions.
In this tutorial, you:
Create a project where you can save the data flow.
Add source operators and select the data entities to use in the data flow.
Use shaping operators and apply transformations.
Identify the target data asset for loading the data.
Before You Begin
To ingest and transform data using a data flow, you must have the following:
The PAR_MANAGE permission enabled on the staging bucket.
Copy
allow any-user to manage buckets in compartment <compartment-name> where ALL {request.principal.type = 'disworkspace', request.principal.id = '<workspace-ocid>', request.permission = 'PAR_MANAGE'}
In Oracle Cloud Infrastructure Data Integration, data flows and tasks can only be created in a project or folder.
To create a project and a data flow:
On the workspace Home page, click Projects.
You can also click Open tab (plus icon) in the tab bar, and select Projects.
On the Projects page, click Create project.
On the Create project page, enter DI_Lab for Name, and then click Create.
After you create a project, you can create a data flow to ingest data from two source files, containing customers (CUSTOMERS.json) and orders (REVENUE.csv) information.
On the DI_Lab project details page, click Data flows in the submenu on the left side.
If you're on the Projects page, click DI_Lab to open the project details page.
Click Create data flow.
The data flow designer opens in a new tab.
In the Properties panel, for Name, enter Load Customers and Revenue Data.
The Identifier value is automatically generated based on the value you enter for the data flow name. You can change the generated value, but after you save the data flow, you aren't allowed to update the identifier.
Click Create.
The designer remains open for you to continue editing.
2. Adding Source Operators 🔗
You add source operators to identify the data entities to use for the data flow. A data entity represents a database table in this tutorial.
From the Operators panel, drop a Source operator onto the canvas.
On the canvas, select SOURCE_1, if it isn't already selected.
The Properties panel displays the details for the operator in focus.
In the Details tab of the Properties panel, click Select next to each of the following options to make the selections:
For Data asset, select Data_Lake.
For Connection, select Default connection.
For Schema, select the compartment and then the bucket. For the purposes of this tutorial, Object Storage serves as the source data asset, this is why you select the bucket here.
For Data entity, click Browse by name and then select CUSTOMERS.json. For File type, select JSON.
When you complete the selections for SOURCE_1, the operator name becomes CUSTOMERS_JSON, reflecting the data entity selection. In the Identifier field, rename the source operator to CUSTOMERS.
Repeat steps 1 through 3 to add a second Source operator with the following values:
For Data asset, select Data_Lake.
For Connection, select Default Connection.
For Schema, select the compartment and then the bucket. For the purposes of this tutorial, Object Storage serves as the source data asset, this is why you select the bucket here.
For Data entity, select REVENUE.csv, and then select CSV for the file
type. Accept the default values for the remaining items.
When you complete the selections for the second source operator, the operator name becomes REVENUE_CSV, reflecting the data entity selection. In the Identifier field, rename the source operator to REVENUE.
(Optional) Click Assign parameter to keep the source details from being bound to the compiled code, when you publish the data flow.
(Optional) In the Attributes tab, you can view the data entity's attributes and apply exclude or rename rules to the attributes from their respective Actions menu ().
You can also use the filter icon on the Name or Type column to apply one or more filters on the attributes to be excluded.
(Optional) In the Data tab, you can view a sampling of data from the source data entity and apply transformations to the data or select attributes to view a data profile.
Using Data Xplorer, you can explore a data sample, review profiling metadata, and
apply transformations in the Data tab of the Properties panel. Expression operators
are added to the canvas for each transformation applied.
In the Properties panel for FILTER_2, select the
Data tab.
All data rows and attributes are displayed. You can use the vertical scrollbar
to scroll the rows, and the horizontal scrollbar to scroll the
attributes.
In the Search by pattern field, enter STATE*.
The number of attributes in the table are filtered. Only those attributes that match the pattern are displayed.
Click the transformations menu () for FILTER_2.CUSTOMERS_JSON.STATE_PROVINCE, and then select Change case.
In the Change case dialog, from the Type menu, select Upper.
Don't select the Keep source attributes checkbox.
Leave the Name as-is.
Click Apply.
An expression operator is added to the data flow. In the Properties panel, the Details tab is now in focus, showing the details of the expression operator.
In the Expressions table, you can see the generated expression, UPPER(EXPRESSION_1.CUSTOMERS_JSON.STATE_PROVINCE).
In the Properties panel for the expression operator, change the name in the Identifier field to CHANGE_CASE.
Click the Data tab, and then use the horizontal scrollbar to scroll to the end.
CHANGE_CASE.STATE_PROVINCE is added to the end of the dataset. You can preview the transformed data for CHANGE_CASE.STATE_PROVINCE in the Data tab.
From the Operators panel, drop the Expression operator onto the canvas, placing it after CHANGE_CASE.
Connect CHANGE_CASE to the new EXPRESSION_1 operator.
In the Properties panel for EXPRESSION_1, select Add expression in the Expressions table.
In the Add expression panel:
In the Identifier field, rename the expression to FULLNAME.
Keep Data type as VARCHAR.
Set Length to 200.
Under Expression builder, switch from the Incoming list to the
Functions list.
In the search by name field, enter CON. Then find CONCAT under String.
Enter CONCAT(CONCAT(EXPRESSION_1.CUSTOMERS_JSON.FIRST_NAME, ' '),EXPRESSION_1.CUSTOMERS_JSON.LAST_NAME)
You can also highlight a function's placeholders and then double-click or drop attributes from the Incoming list to create an expression.
Click Add.
The CONCAT expression is now listed in the Expressions table for the EXPRESSION_1 operator. You can add as many expressions as you want.
To save the data flow and continue editing, click Save.
4. Joining Data 🔗
After you apply filters and transformations, you can join the source data entities
using a unique customer identifier, and then load the data into a target data
entity.
To join the data from EXPRESSION_1 with the data from FILTER_1, drop a Join operator from the Operators panel onto the canvas, placing it after to EXPRESSION_1 and FILTER_1.
Connect EXPRESSION_1 to JOIN_1. Then connect FILTER_1 to JOIN_1.
With JOIN_1 selected, in the Details tab of the Properties panel, click Create next to Join condition.
In the Create join condition panel, enter CUST in the search by name field.
You want to join the entities using CUST_ID and CUST_KEY.
In the editor, enter JOIN_1_1.CUSTOMERS_JSON.CUST_ID=JOIN_1_2.REVENUE_CSV.CUST_KEY.
Click Create.
To save the data flow and continue editing, click Save.
5. Adding a Target Operator 🔗
From the Operators panel, drop a Target operator onto the canvas.
Connect JOIN_1 to TARGET_1.
Select TARGET_1 on the canvas, if it isn't already selected.
In the Details tab of the Properties panel, don't select the Create new data entity checkbox.
Next, click Select for each of the following options to make the selections for a data entity:
For Data asset, select Data_Warehouse.
For Connection, select Default Connection.
For Schema, select BETA.
For Data entity, select CUSTOMERS_TARGET.
For Staging location, select the Data Lake data asset, its default connection, and the compartment. Then for Schema, select the Object Storage bucket that you used to import the sample data.
Leave the value for Integration strategy as Insert.
To review the Attributes mapping, click the Map tab.
By default, all attributes are mapped by name.
For example, CUST_ID from JOIN_1 maps to CUST_ID in the target data entity.
To manually map attributes that aren't yet mapped, select the All menu in the Target attributes table, and then select Attributes not mapped.
You can do the same in the Source attributes table (for the incoming fields).
Now drop FULLNAME from Source attributes to FULL_NAME in Target attributes.
In the Target attributes table, select Attributes not mapped, and
then select All.
All attributes are now mapped.
(Optional) Click Actions to review the various mapping options that are available.
(Optional) Click View rules to view the applied Rules.
To save the data flow and exit the designer, click Save and close.