Data Flow Integration

The Data Flow Support feature in ML Pipelines lets users integrate Data Flow Applications as steps within a pipeline.

With this new functionality, users can orchestrate the runs of Data Flow Applications (Apache Spark as a Service) alongside other steps in an ML Pipeline, streamlining large-scale data processing tasks.

When a pipeline containing a Data Flow step is run, it automatically creates and manages a new run of the Data Flow Application associated with that step. The Data Flow run is treated the same as any other step in the pipeline. When successfully completed, the pipeline continues its run, starting later steps as part of the pipeline's orchestration.

Using Data Flow Applications in ML Pipelines is straightforward:

1. Add a Data Flow Step
Select the Data Flow step type in your ML Pipeline.
2. Select a Data Flow Application
Select the Data Flow application you want to run as a step and configure options such as cluster size and environment variables.
3. Run the Pipeline
Start a run of the pipeline. When the Data Flow step is reached, the associated application runs. When completed, the results are reflected in the step run, and the pipeline seamlessly proceeds to the next steps.
This integration simplifies data scientists' workflows by enabling them to handle large datasets efficiently within the same pipeline, leveraging the scalable compute power of OCI Data Flow, while maintaining automation through ML Pipelines.

Policies

Include the following Policies for Data Flow integration with pipelines:
  • Data Flow and Pipelines Integration.
  • Pipeline Run Access to OCI Services.
  • (Optional) Custom Networking policies, but only if using custom networking.
See Data Flow Policies for all the prerequisites required to use Data Flow.
Note

When a Data Flow run is triggered by a Pipeline run, it inherits the resource principal datasciencepipelinerun. Therefore, granting privileges to datasciencepipelinerun also grants privileges to the code running inside the Data Flow run started by the Pipeline run.

Configuring Data Flow with Pipelines

Ensure you have the appropriate Policies applied.

  1. When defining pipeline steps to use Data Flow, while Creating a Pipeline, select From Data Flow applications.
  2. Under Select a dataflow application, select the Data Flow application to use.

    If the Data Flow application is in a different compartment, select Change compartment.

  3. (Optional) In the Data Flow configuration section, select Configure.

    In the Configure your Data Flow configuration panel:

    1. Select the Driver shape and Executor shape.
    2. Enter the Number of executors.
    3. (Optional) Select the log bucket.
    4. (Optional) Add Spark configuration properties.
    5. (Optional) Specify the Warehouse bucket URI.

Quick Start Guide

This is a step-by-step guide for creating a Data Flow pipeline.

  1. Follow the Data Flow Policies documentation. It details the initial set up required before you can use Data Flow.
  2. Upload the following sample Python application, hello-world.py to a bucket:
    print("======Start======")
    import os
    from pyspark.sql import SparkSession
     
    def in_dataflow():
        if os.environ.get("HOME") == "/home/dataflow":
            return True
        return False
     
    def get_spark():
        if in_dataflow():
            return SparkSession.builder.appName("hello").getOrCreate()
        else:
            return SparkSession.builder.appName("LocalSparkSession").master("local[*]").getOrCreate()
     
    print("======Opening Session======")
    spark = get_spark()
    print("======Application Created======")
    # Test the connection by creating a simple DataFrame
    df = spark.createDataFrame([("Hello",), ("World",)], ["word"])
    print("======Data Frame Created======")
    # Show the DataFrame's content
    df.show()
    print("======Done======")
  3. Follow the steps in Data Flow Policies to create a Data Flow application using the Python application in step 2.
  4. Test the Data Flow application.
    1. On the Application's details page, click Run.
    2. In the Run application panel, apply arguments and parameters, update the resource configuration, or add supported Spark properties, as needed.
    3. Select Run to run the Application.
    4. (Optional) Check the logs. Go to the run details and select logs.
  5. Create the pipeline.

    Before creating a pipeline, ensure that you have policies that let the pipeline run resource use Data Flow and access the bucket with your hello-world application. For more information, see Pipeline Policies.

    1. Create a pipeline with a step that uses your hello-world Data Flow application:
      1. Create a pipeline with a name such as Data Flow Step Demo.
      2. Select Add pipeline steps.
      3. Give the step a name, for example, Step 1.
      4. To use your Data Flow application, select From Data Flow applications.
      5. Select the Data Flow application.
      6. If the Data Flow application is in a different compartment, select Change compartment.
      7. Select Save to save the step.
      8. (Optional) Define logging.
      9. Select Create to create the pipeline.
    2. Enable pipeline logs:
      1. Go to the pipeline details.
      2. Select the Logs resource.
      3. Enable logs.
    3. Run the pipeline:
      1. Go to the pipeline details.
      2. Select the Pipeline run resource.
      3. Select Start a pipeline run.
      4. Select Start.