Getting Started with Spark-Submit and SDK

A tutorial to help you get started to use Java SDK code to run a Spark application in Data Flow using spark-submit with the execute string.

Get started with spark-submit in Data Flow using SDK. Follow the existing tutorial for Getting Started with Oracle Cloud Infrastructure Data Flow, but use Java SDK to run spark-submit commands.

Before You Begin

Complete prerequisites before you can use spark-submit commands in Data Flow with Java SDK.

  1. Set up your tenancy.
  2. Set up keys and configure.
  3. Create a maven project and add Oracle Cloud Infrastructure Java SDK dependencies:
    <dependency>
      <groupId>com.oracle.oci.sdk</groupId>
      <artifactId>oci-java-sdk-dataflow</artifactId>
      <version>${oci-java-sdk-version}</version>
    </dependency>

1. ETL with Java

Use Spark-submit and Java SDK to carry out ETL with Java.

Using Spark-submit and Java SDK, complete the exercise, ETL with Java, from the Getting Started with Oracle Cloud Infrastructure Data Flow tutorial.
  1. Set up your tenancy.
  2. If you don't have a bucket in Object Storage where you can save your input and results, you must create a bucket with a suitable folder structure. In this example, the folder structure is /output/.
  3. Run this code:
    public class ETLWithJavaExample {
     
      private static Logger logger = LoggerFactory.getLogger(ETLWithJavaExample.class);
      String compartmentId = "<compartment-id>"; // need to change comapartment id
     
      public static void main(String[] ars){
        System.out.println("ETL with JAVA Tutorial");
        new ETLWithJavaExample().createRun();
      }
     
      public void createRun(){
     
        ConfigFileReader.ConfigFile configFile = null;
        // Authentication Using config from ~/.oci/config file
        try {
          configFile = ConfigFileReader.parseDefault();
        }catch (IOException ie){
          logger.error("Need to fix the config for Authentication ", ie);
          return;
        }
     
        try {
        AuthenticationDetailsProvider provider =
            new ConfigFileAuthenticationDetailsProvider(configFile);
     
        // Creating a Data Flow Client
        DataFlowClient client = new DataFlowClient(provider);
        client.setRegion(Region.US_PHOENIX_1);
     
        // creation of execute String
        String executeString = "--class convert.Convert "
            + "--files oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/kaggle_berlin_airbnb_listings_summary.csv "
            + "oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow-lab-2019-java-etl-1.0-SNAPSHOT.jar "
            + "kaggle_berlin_airbnb_listings_summary.csv oci://<bucket-name>@<namespace-name>/output/optimized_listings";
     
        // Create Run details and create run.
        CreateRunResponse response;
         
        CreateRunDetails runDetails = CreateRunDetails.builder()
            .compartmentId(compartmentId).displayName("Tutorial_1_ETL_with_JAVA").execute(executeString)
            .build();
     
        CreateRunRequest runRequest = CreateRunRequest.builder().createRunDetails(runDetails).build();
        CreateRunResponse response = client.createRun(runRequest);
     
        logger.info("Successful run creation for ETL_with_JAVA with OpcRequestID: "+response.getOpcRequestId()
            +" and Run ID: "+response.getRun().getId());
     
        }catch (Exception e){
          logger.error("Exception creating run for ETL_with_JAVA ", e);
        }
     
      }
    }
    If you have run this tutorial before, delete the contents of the output directory, oci://<bucket-name>@<namespace-name>/output/optimized_listings, to prevent the tutorial failing.
    Note

    To find the compartment-id, from the navigation menu, click Identity and click Compartments. The compartments available to you're listed, including the OCID of each.

2: Machine Learning with PySpark

Using Spark-submit and Java SDK, carry out machine learning with PySpark.

Using Spark-submit and Java SDK, complete, 3. Machine Learning with PySpark, from the Getting Started with Oracle Cloud Infrastructure Data Flow tutorial.
  1. Complete exercise 1. ETL with Java, before attempting this exercise. The results are used in this exercise.
  2. Run the following code:
    public class PySParkMLExample {
     
      private static Logger logger = LoggerFactory.getLogger(PySParkMLExample.class);
      String compartmentId = "<compartment-id>"; // need to change comapartment id
     
      public static void main(String[] ars){
        System.out.println("ML_PySpark Tutorial");
        new PySParkMLExample().createRun();
      }
     
      public void createRun(){
     
        ConfigFileReader.ConfigFile configFile = null;
        // Authentication Using config from ~/.oci/config file
        try {
          configFile = ConfigFileReader.parseDefault();
        }catch (IOException ie){
          logger.error("Need to fix the config for Authentication ", ie);
          return;
        }
     
        try {
        AuthenticationDetailsProvider provider =
            new ConfigFileAuthenticationDetailsProvider(configFile);
     
        DataFlowClient client = new DataFlowClient(provider);
        client.setRegion(Region.US_PHOENIX_1);
     
        String executeString = "oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow_lab_2019_pyspark_ml.py oci://<bucket-name>@<namespace-name>/output/optimized_listings";
     
        CreateRunResponse response;
     
        CreateRunDetails runDetails = CreateRunDetails.builder()
            .compartmentId(compartmentId).displayName("Tutorial_3_ML_PySpark").execute(executeString)
            .build();
     
        CreateRunRequest runRequest = CreateRunRequest.builder().createRunDetails(runDetails).build();
        CreateRunResponse response = client.createRun(runRequest);
     
        logger.info("Successful run creation for ML_PySpark with OpcRequestID: "+response.getOpcRequestId()
            +" and Run ID: "+response.getRun().getId());
     
        }catch (Exception e){
          logger.error("Exception creating run for ML_PySpark ", e);
        }
     
     
      }
    }

What's Next

Use Spark-submit and the CLI in other situations.

You can use spark-submit and Java SDK to create and run Java, Python, or SQL applications with Data Flow, and explore the results. Data Flow handles all details of deployment, tear down, log management, security, and UI access. With Data Flow, you focus on developing Spark applications without worrying about the infrastructure.