Getting Started with Spark-Submit and CLI
A tutorial to help you get started running a Spark application in Data Flow using spark-submit while using the
execute string at the CLI.
Follow the existing tutorial for Getting Started with Oracle Cloud Infrastructure Data Flow, but use CLI to run spark-submit commands.
Before You Begin
Complete some prerequisites and set up authentication before you can use spark-submit commands in Data Flow with CLI.
Prerequisites to Use Spark-submit with CLI
Authentication to Use Spark-submit with CLI
Set up authenticate to use spark-submit with CLI.
$ oci session authenticate - select the intended region from the provided list of regions. - Please switch to newly opened browser window to log in! - Completed browser authentication process! - Enter the name of the profile you would like to create: <profile_name> ex. oci-cli - Config written to: ~/.oci/config - Try out your newly created session credentials with the following example command: $ oci iam region list --config-file ~/.oci/config --profile <profile_name> --auth security_token
~/.oci/config file. Use the profile name to run the tutorial.
1. Create the Java Application Using Spark-Submit and CLI
Use Spark-submit and the CLI to complete tutorials.
- Set up your tenancy.
If you don't have a bucket in Object Storage where you can save your
input and results, you must create a bucket with a suitable folder
structure. In this example, the folder structure is
Run this code:
If you have run this tutorial before, delete the contents of the output directory,
oci --profile <profile-name> --auth security_token data-flow run submit \ --compartment-id <compartment-id> \ --display-name Tutorial_1_ETL_Java \ --execute ' --class convert.Convert --files oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/kaggle_berlin_airbnb_listings_summary.csv oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/oow-lab-2019-java-etl-1.0-SNAPSHOT.jar \ kaggle_berlin_airbnb_listings_summary.csv oci://<bucket-name>@<namespace-name>/output/tutorial1'
oci://<bucket-name>@<namespace-name>/output/tutorial1, to prevent the tutorial failing.Note
To find the compartment-id, from the navigation menu, click Indentity and click Compartments. The compartments available to you are listed, including the OCID of each.
2: Machine Learning with PySpark
Use Spark-submit and CLI to carry out machine learning with PySpark,
- Before attempting this exercise, complete 1. Create the Java Application Using Spark-Submit and CLI. Its results are used in this exercise.
Run the following code:
oci --profile <profile-name> --auth security_token data-flow run submit \ --compartment-id <compartment-id> \ --display-name Tutorial_3_PySpark_ML \ --execute ' oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/oow_lab_2019_pyspark_ml.py oci://<your_bucket>@<namespace-name>/output/tutorial1'
Use Spark-submit and the CLI in other situations.
You can use spark-submit from the CLI to create and run Java, Python, or SQL applications with Data Flow, and explore the results. Data Flow handles all details of deployment, tear down, log management, security, and UI access. With Data Flow, you focus on developing Spark applications without worrying about the infrastructure.