Creating a Spark-Submit Data Flow Application

Create a Spark-Submit Application in Data Flow.

Upload your Spark-submit files to an Oracle Cloud Infrastructure Object Storage. See Set Up Object Store for details.
1. On the Data Flow page, in the left-side menu, select Applications. If you need help finding the Data Flow page, see Listing Applications.
2. On the Applications page, select Create application.
3. In the Create application panel, enter a name for the application and an optional description that can help you search for it.
4. Under Resource configuration, provide the following values. To help calculate the number of resources that you need, see Sizing the Data Flow Application.
  
  Select the Spark version.
  
  (Optional) Select a pool.
  
  For Driver shape, select the type of cluster node to use to host the Spark driver.
  
  (Optional) If you selected a flexible shape for the driver, customize the number of OCPUs and the amount of memory.
  
  For Executor shape, select the type of cluster node to use to host each Spark executor.
  
  (Optional) If you selected a flexible shape for the executor, customize the number of OCPUs and the amount of memory.
  
  (Optional) To enable use of Spark dynamic allocation (autoscaling), select Enable autoscaling.
  
  Enter the number of executors you need. If you selected to use autoscaling, enter a minimum and maximum number of executors.
5. Under Application configuration, provide the following values.
  
  (Optional) If the application is for Spark streaming, select Spark Streaming.
  
  Select Use Spark-Submit Options. The supported spark-submit options are:
  
  --py-files
  
  --files
  
  --jars
  
  --class
  
  --conf The aribtary Spark configuration property in key=value format. If a value contains spaces, wrap it in quotes, "key=value". Pass many configurations as separate arguments, for example,
  --conf <key1>=<value1> --conf <key2>=<value2>
  
  application-jar The path to a bundled JAR including your application and all its dependencies.
  
  application-arguments The arguments passed to the main method of your main class.
  
  In the Spark-Submit options text box, enter the options in the format:
  --py-files oci://<bucket_name>@<objectstore_namespace>/<file_name> .pyoci://<bucket_name>@<objectstore_namespace>/<dependencies_file_name.zip> --files oci://<bucket_name>@<objectstore_namespace>/<file_name>.json --jars oci://<bucket_name>@<objectstore_namespace>/<file_name>.jar --conf spark.sql.crossJoin.enabled=true oci://<bucket_name>@<objectstore_namespace>/<file_name>.py oci://<argument2_path_to_input> oci://<argument3_path_to_output>
  For example, to use Spark Oracle Datasource, use the following option:
  --conf spark.oracle.datasource.enable=true
  Important
  
  Data Flow doesn't support URIs beginning local:// or hdfs://. The URI must start oci://, so all the files (including main-application) must be in Oracle Cloud Infrastructure Object Storage, and you must use the fully qualified domain name (FQDN) for each file.
  
  (Optional) If you have an archive.zip file, upload archive.zip to Oracle Cloud Infrastructure Object Storage and populate Archive URI with the path to it. There are two ways to do this:
  
  Select the file from the Object Storage file name list. Select Change compartment if the bucket is in a different compartment.
  
  Select Enter the file path manually and enter the file name and the path to it using this format:
  oci://<bucket_name>@<namespace_name>/<file_name>
  
  Under Application log location, specify where you want to ingest Oracle Cloud Infrastructure Logging in one of the following ways:
  
  Select the dataflow-logs bucket from the Object Storage file name list. Select Change compartment if the bucket is in a different compartment.
  
  Select Enter the bucket path manually and enter the bucket path to it using this format:
  oci://dataflow-logs@<namespace_name>
  
  Don't select Enter the bucket path manually, and select the file.
  
  (Optional) Select the Metastore from the list. If the metastore is in a different compartment, select Change compartment first, and select a different compartment, then select the Metastore from the list. The Default managed table location is automatically populated based on your metastore.
6. (Optional) In the Tags section, add one or more tags to the <resourceType>. If you have permissions to create a resource, then you also have permissions to apply free-form tags to that resource. To apply a defined tag, you must have permissions to use the tag namespace. For more information about tagging, see Resource Tags. If you're not sure whether to apply tags, skip this option or ask an administrator. You can apply tags later.
7. (Optional) Select Show advanced options, and provide the following values.
  
  (Optional) Select Use resource principal auth to enable faster starting or if you expect the Run to last more than 24 hours. You must have Resource Principal Policies set up.
  
  Check Enable Delta Lake to use Delta Lake.
  
  Select the Delta Lake version. The value you select is reflected in the Spark configuration properties Key/Value pair.
  
  Select the logs group.
  
  (Optional) Select Enable Spark Oracle data source to use Spark Oracle Datasource.
  
  (Optional) In the Logs section, select the logs groups and the application logs for Oracle Cloud Infrastructure Logging. If the logs groups are in a different compartment, select Change compartment.
  
  Add Spark Configuration Properties. Enter a Key and Value pair.
  
  Select + Another property to add another configuration property.
  
  Repeat steps b and c until you've added all the configuration properties.
  
  Override the default value for the warehouse bucket by populating Warehouse Bucket URI in the format:
  oci://<warehouse-name>@<tenancy>
  
  For Choose network access, select one of the following options:
  
  If you're Attaching a Private Endpoint to Data Flow, select the Secure Access to Private Subnet radio button. Select the private endpoint from the resulting list.
  Note
  
  You can't use an IP address to connect to the private endpoint, you must use the FQDN.
  
  If you're not using a private endpoint, select the Internet Access (No Subnet) radio button.
  
  (Optional) To enable data lineage collection:
  
  Select Enable data lineage collection.
  
  Select Enter data catalog into manually or select a Data Catalog instance from a configurable compartment in the current tenancy.
  
  (Optional) If you selected Enter data catalog into manually in the previous step, enter the values for Data catalog tenancy OCID, Data catalog compartment OCID, and Data Catalog instance ODID.
  
  (Optional) For batch jobs only, for Max run duration in minutes, enter a value between 60 (1 hour) and 10080 (7 days). If you don't enter a value, the submitted run continues until it succeeds, fails, is canceled, or reaches its default maximum duration (24 hours).
8. Select Create to create the Application, or select Save as stack to create it later.
  To change the values for Name and File URL in the future, see Editing an Application.
Use the create command and required parameters to create an application:
Command
oci data-flow application create [OPTIONS]
For a complete list of flags and variable options for CLI commands, see the CLI Command Reference.
Run the CreateApplication operation to create an application.

Oracle Cloud Infrastructure Documentation Try Free Tier

Creating a Spark-Submit Data Flow Application

Oracle Cloud Infrastructure Documentation
Try Free Tier