Importing an Apache Spark Application to the Oracle Cloud

Spark applications need to be hosted in Oracle Cloud Infrastructure Object Storage before you can run them.

You can upload your application to any bucket. The user running the application must have read access to all assets (including all related compartments, buckets and files) for the application to start successfully.

Develop Data Flow-compatible Spark Applications

Data Flow supports running ordinary Spark applications and has no special design-time requirements.

We recommend that you develop your Spark application using Spark local mode on your laptop or similar environment. When development is complete, upload the application to Oracle Cloud Infrastructure Object Storage, and run it at scale using Data Flow.

Best Practices for Bundling Applications

Best Practice for Bundling your Applications
Technology	Notes
Java or Scala Applications	For the best reliability, upload applications as Uber JARs or Assembly JARs, with all dependencies included in the Object Store. Use tools such as Maven Assembly Plugin (Java) or sbt-assembly (Scala) to build appropriate JARs.
SQL Applications	Upload all your SQL files (`.sql`) to the Object Store.
Python Applications	Build applications with the default libraries and upload the python file to the Object Store. To include any third-party libraries or packages, see Spark-Submit Functionality in Data Flow.

Don't provide your application package in a zipped format such as .zip or .gzip.

After your application is imported to Oracle Cloud Infrastructure Object Store, you refer to it using a special URI:

oci://<bucket>@<tenancy>/<applicationfile>

For example, with a Java or Scala application, let's suppose a developer at examplecorp developed a Spark application called logcrunch.jar and uploaded it to a bucket called production_code. You can always find the correct tenancy by clicking on the user profile icon in the upper right of the Console UI.

The correct URI becomes:

oci://production_code@examplecorp/logcrunch.jar

Load Data into the Oracle Cloud

Data Flow is optimized to manage data in Oracle Cloud Infrastructure Object Storage. Managing data in Object Storage maximizes performance and lets the application access data on behalf of the user running the application. However, Data Flow can read data from other data sources supported by Spark, including RDBMS, ADW, NoSQL stores, and more. Data Flow can talk to on-premises systems using the Private Endpoint feature along with an existing FastConnect configuration.

Loading Data
Approach	Tools
Native web UI	The Oracle Cloud Infrastructure Console lets you manage storage buckets and upload files, including directory trees.
Third-party tools	Consider using REST APIs and the Command Line Infrastructure. For transferring large amounts of data, consider these third-party tools: rclone cyberduck

Oracle Cloud Infrastructure Documentation Try Free Tier

Importing an Apache Spark Application to the Oracle Cloud

Develop Data Flow-compatible Spark Applications 🔗

Best Practices for Bundling Applications 🔗

Load Data into the Oracle Cloud 🔗

Oracle Cloud Infrastructure Documentation
Try Free Tier

Develop Data Flow-compatible Spark Applications

Best Practices for Bundling Applications

Load Data into the Oracle Cloud