Migrating Spark Applications to Oracle Cloud Infrastructure Data Flow

This tutorial shows you how to migrate your existing Spark applications to Oracle Cloud Infrastructure Data Flow.

Before You Begin

To successfully perform this tutorial, you must have Set Up Your Tenancy, and can Access Data Flow.

Allowed Spark Variables

Data Flow automatically configures many Spark variables based on factors including the infrastructure you choose when running jobs. To ensure proper operation, some Spark variables can't be set or overridden when running jobs. For more information, see Supported Spark Properties in Data Flow.

Compatibility Limitations

You can't set environment variables in Data Flow jobs. Instead, you can pass variables as command line arguments, and add them to the environment in the application.

1. Supported Ways to Access the Spark Session

Data Flow creates the Spark session before your Spark application runs. It ensures that your application takes full advantage of all the hardware you configured your run to use.
Important

Don't try to create a Spark session within your application. It doesn't use the hardware you provisioned for it, and other unpredictable behavior might result.

The following are supported ways of accessing your Spark session within applications:

2. Managing Java Dependencies for Apache Spark Applications in Data Flow

In Data Flow, when you run Java or Scala applications that rely on JARs not included with Spark, you must create uber or fat JARs. These JARs include the code dependencies you need. The Data Flow runtime includes several popular open source libraries that your Java or Scala applications might also use. To avoid runtime conflicts between Data Flow versions and your application versions, use a process called shading instead. You might need to recompile your Java or Scala applications with the following shading rules for your applications to run correctly in Data Flow.
Note

Shading is not needed if you are using Spark 3.5.0 or 3.2.1.

What's Next

Now you can start migrating your Spark applications to run in Data Flow.