The Data Flow Support feature in ML Pipelines lets
users integrate Data Flow Applications as steps within a
pipeline.
With this new functionality, users can orchestrate the runs of Data Flow
Applications (Apache Spark as a Service) alongside other steps in an ML Pipeline,
streamlining large-scale data processing tasks.
When a pipeline containing a Data Flow step is run, it
automatically creates and manages a new run of the Data Flow
Application associated with that step. The Data Flow run is
treated the same as any other step in the pipeline. When successfully completed, the pipeline
continues its run, starting later steps as part of the pipeline's orchestration.
Using Data Flow Applications in ML Pipelines is
straightforward:
1. Add a Data Flow Step
Select the Data Flow step type in your ML
Pipeline.
2. Select a Data Flow Application
Select the Data Flow application you want to run as a
step and configure options such as cluster size and environment variables.
3. Run the Pipeline
Start a run of the pipeline. When the Data Flow step
is reached, the associated application runs. When completed, the results are reflected
in the step run, and the pipeline seamlessly proceeds to the next steps.
This integration simplifies data scientists' workflows by enabling them to handle large
datasets efficiently within the same pipeline, leveraging the scalable compute power of OCI
Data Flow, while maintaining automation through ML
Pipelines.
Policies
Include the following Policies for Data Flow integration with pipelines:
Data Flow and Pipelines Integration.
Pipeline Run Access to OCI Services.
(Optional) Custom Networking policies, but only if using custom networking.
See Data Flow Policies for all the prerequisites required to use Data Flow.
Note
When a Data Flow run is triggered by a Pipeline
run, it inherits the resource principal datasciencepipelinerun. Therefore,
granting privileges to datasciencepipelinerun also grants privileges to the
code running inside the Data Flow run started by the
Pipeline run.
When defining pipeline steps to use Data Flow, while Creating a Pipeline, select From Data Flow
applications.
Under Select a dataflow application, select the Data Flow application to use.
If the Data Flow application is in a
different compartment, select Change compartment.
(Optional)
In the Data Flow configuration section, select
Configure.
In the Configure your Data Flow configuration
panel:
Select the Driver shape and Executor shape.
Enter the Number of executors.
(Optional) Select the log bucket.
(Optional) Add Spark configuration properties.
(Optional) Specify the Warehouse bucket URI.
Quick Start Guide 🔗
This is a step-by-step guide for creating a Data Flow
pipeline.
Follow the Data Flow Policies documentation. It details the
initial set up required before you can use Data Flow.
Upload the following sample Python application, hello-world.py
to a bucket:
Copy
print("======Start======")
import os
from pyspark.sql import SparkSession
def in_dataflow():
if os.environ.get("HOME") == "/home/dataflow":
return True
return False
def get_spark():
if in_dataflow():
return SparkSession.builder.appName("hello").getOrCreate()
else:
return SparkSession.builder.appName("LocalSparkSession").master("local[*]").getOrCreate()
print("======Opening Session======")
spark = get_spark()
print("======Application Created======")
# Test the connection by creating a simple DataFrame
df = spark.createDataFrame([("Hello",), ("World",)], ["word"])
print("======Data Frame Created======")
# Show the DataFrame's content
df.show()
print("======Done======")
Follow the steps in Data Flow Policies to create a
Data Flow application using the Python
application in step 2.
Test the Data Flow application.
On the Application's details page, click
Run.
In the Run application panel, apply arguments and parameters, update the
resource configuration, or add supported Spark properties, as needed.
Select Run to run the Application.
(Optional) Check the logs. Go to the run details and select logs.
Create the pipeline.
Before creating a pipeline, ensure that you have policies that let the
pipeline run resource use Data Flow and
access the bucket with your hello-world application. For more information,
see Pipeline Policies.
Create a pipeline with a step that uses your hello-world Data Flow application:
Create a pipeline with a name such as Data Flow Step
Demo.
Select Add pipeline
steps.
Give the step a name, for example, Step
1.
To use your Data Flow application, select From Data Flow
applications.
Select the Data Flow
application.
If the Data Flow application
is in a different compartment, select Change
compartment.