Develop Oracle Cloud Infrastructure Data Flow Applications Locally, Deploy to The Cloud

Oracle Cloud Infrastructure Data Flow is a fully managed Apache Spark cloud service. It lets you run Spark applications at any scale, and with minimal administrative or set up work. Data Flow is ideal for scheduling reliable long-running batch processing jobs.

You can develop Spark applications without being connected to the cloud. You can quickly develop, test, and iterate them on your laptop computer. When they're ready, you can deploy them to Data Flow without any need to reconfigure them, make code changes, or apply deployment profiles.

Using Spark 3.5.0 or 3.2.1, give several changes over using earlier versions:
  • Most of the source code and libraries used to run Data Flow are hidden. You no longer need to match the Data Flow SDK versions, and no longer have third-party dependency conflicts with Data Flow.
  • The SDKs are compatible with Spark, so you no longer need to move conflicting third-party dependencies, letting you separate your application from your libraries for faster, less complicated, smaller, and more flexible builds.
  • The new template pom.xml file downloads and builds a near identical copy of Data Flow on your local machine. You can run the step debugger on your local machine to detect and resolve problems before running your Application on Data Flow. You can compile and run against the exact same library versions that Data Flow runs. Oracle can quickly decide if your issue is a problem with Data Flow or your application code.

Before You Begin

Before you begin to develop you appliations, you need the following set up and working:

  1. An Oracle Cloud log in with the API Key capability enabled. Load your user under Identity /Users, and confirm you can create API Keys.

    User Informationtab showing API Keys set to Yes.

  2. An API key registered and deployed to your local environment. See Register an API Key for more information.
  3. A working local installation of Apache Spark 2.4.4, 3.0.2, 3.2.1, or 3.5.0. You can confirm it by running spark-shell in the command line interface.
  4. Apache Maven installed. The instructions and examples use Maven to download the dependencies you need.

1. The Concepts of Developing Locally

Regardless of your development environment, there are three things you need to consider when you develop applications locally:
  1. Customize your local Spark installation with Oracle Cloud Infrastructure library files, so that it resembles Data Flow's runtime environment.
  2. Detect where your code is running,
  3. Configure the Oracle Cloud Infrastructure HDFS client appropriately.

2. Creating "Fat JARs" for Java Applications

Java and Scala applications usually need to include more dependencies into a JAR file known as a "Fat JAR".

If you use Maven, you can do this using the Shade plugin. The following examples are from Maven pom.xml files. You can use them as a starting point for your project. When you build your application, the dependencies are automatically downloaded and inserted into your runtime environment.

Note

If using Spark 3.5.0 or 3.2.1, this chapter doesn't apply. Instead, follow chapter 2. Managing Java Dependencies for Apache Spark Applications in Data Flow.

3. Testing Your Application Locally

Before deploying your application, you can test it locally to be sure it works. There are three techniques you can use, choose the one that works best for you. These examples assume that your application artifact is named application.jar (for Java) or application.py (for Python).

With Spark 3.5.0 and 3.2.1, there are some improvements over previous versions:
  • Data Flow hides most of the source code and libraries it uses to run, so Data Flow SDK versions no longer need matching and third-party dependency conflicts with Data Flow ought not to happen.
  • Spark has been upgraded so that the OCI SDKs are now compatible with it. This means that conflicting third-party dependencies don't need moving, so the application and libraries libraries can be separated for faster, less complicated, smaller, and more flexible builds.
  • The new template pom.xml file downloads and build an almost identical copy of Data Flow on a developer's local machine. This means that:
    • Developers can run the step debugger on their local machine to quickly detect and resolve problems before running on Data Flow.
    • Developers can compile and run against the exact same library versions that Data Flow runs. So the Data Flow team can quickly decide if an issue is a problem with Data Flow or the application code.

Method 1: Run from your IDE

If you developed in an IDE like Eclipse, you needn't do anything more than click Run, and choose the appropriate main class. Click the Run button in Eclipse.

When you run, it's normal to see Spark produce warning messages in the Console, which let you know Spark is being invoked. Spark Console showing the kind of error messages you see as Spark is invoked after clicking Run.

Method 2: Run PySpark from the Command Line

In a command window, run:
python3 application.py
You see output similar to the following:
$ python3 example.py
Warning: Ignoring non-Spark config property: fs.oci.client.hostname
Warning: Ignoring non-Spark config property: fs.oci.client.auth.fingerprint
Warning: Ignoring non-Spark config property: fs.oci.client.auth.tenantId
Warning: Ignoring non-Spark config property: fs.oci.client.auth.pemfilepath
Warning: Ignoring non-Spark config property: fs.oci.client.auth.userId
20/08/01 06:52:00 WARN Utils: Your hostname resolves to a loopback address: 127.0.0.1; using 192.168.1.41 instead (on interface en0)
20/08/01 06:52:00 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/08/01 06:52:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Warnings about non-Spark configuration properties are normal if you configure the Oracle Cloud Infrastructure HDFS driver based on your configuration profile.

Method 3: Use Spark-Submit

The spark-submit utility is included with your Spark distribution. Use this method in some situations, for example, when a PySpark application requires extra JAR files.

Use this example command to run a Java application using spark-submit:
spark-submit --class example.Example example.jar
Tip

Because you need to provide the main class name to Data Flow, this code is a good way to confirm that you're using the correct class name. Remember that class names are case-sensitive.
In this example, you use spark-submit to run a PySpark application that requires Oracle JDBC JAR files:
spark-submit \
	--jars java/oraclepki-18.3.jar,java/ojdbc8-18.3.jar,java/osdt_cert-18.3.jar,java/ucp-18.3.jar,java/osdt_core-18.3.jar \
	example.py

4. Deploy the Application

After you have developed your application, you can run it in Data Flow.
  1. Copy the application artifact (jar file, Python script, or SQL script) to Oracle Cloud Infrastructure Object Storage.
  2. If your Java application has dependencies not provided by Data Flow, remember to copy the assembly jar file.
  3. Create a Data Flow Application that references this artifact within Oracle Cloud Infrastructure Object Storage.

After step 3, you can run the Application as many times as you want. For more information, the Getting Started with Oracle Cloud Infrastructure Data Flow tutorial takes you through this process step by step.

What's Next

Now you know how to develop your applications locally and deploy them to Data Flow.