Importing an Apache Spark Application to the Oracle Cloud

Spark applications need to be hosted in Oracle Cloud Infrastructure Object Storage before you can run them.

You can upload your application to any bucket. The user running the application must have read access to all assets (including all related compartments, buckets and files) for the application to launch successfully.

Develop Data Flow-compatible Spark Applications

Data Flow supports running ordinary Spark applications and has no special design-time requirements.

We recommend that you develop your Spark application using Spark local mode on your laptop or similar environment. When development is complete, upload the application to Oracle Cloud Infrastructure Object Storage, and run it at scale using Data Flow.

Best Practices for Bundling Applications

Best Practice for Bundling your Applications
TechnologyNotes
Java or Scala ApplicationsFor the best reliability, upload applications as Uber JARs or Assembly JARs, with all dependencies included in the Object Store. Use tools like Maven Assembly Plugin (Java) or sbt-assembly (Scala) to build appropriate JARs.
SQL ApplicationsUpload all your SQL files (.sql) to the Object Store.
Python ApplicationsBuild applications with the default libraries and upload the python file to the Object Store. To include any third-party libraries or packages, see Spark-Submit Functionality in Data Flow.

Do not provide your application package in a zipped format such as .zip or .gzip.

Once your application is imported to Oracle Cloud Infrastructure Object Store, you will later refer to it using a special URI:
oci://<bucket>@<tenancy>/<applicationfile>

For example, with a Java or Scala application, let's suppose a developer at examplecorp developed a Spark application called logcrunch.jar and uploaded it to a bucket called production_code. You can always determine the correct tenancy by clicking on the user profile icon in the top right of the console UI.

The correct URI becomes:
oci://production_code@examplecorp/logcrunch.jar

Load Data into the Oracle Cloud

Data Flow is optimized to manage data in Oracle Cloud Infrastructure Object Storage. Managing data in Object Storage maximizes performance and lets the application access data on behalf of the user running the application. However, Data Flow can read data from other data sources supported by Spark, including RDBMS, ADW, NoSQL stores, and more. Data Flow can talk to on-premises systems using the Private Endpoint feature along with an existing FastConnect configuration.

Loading Data
ApproachTools
Native web UIThe Oracle Cloud Infrastructure Console lets you manage storage buckets and upload files, including directory trees.
Third-party tools

Consider using REST APIs and the Command Line Infrastructure.

For transferring large amounts of data, consider these third-party tools: