Supported Spark Properties in Data Flow

For every run of a Data Flow application, you can add Spark Properties in the Spark Configuration Properties field.

For more information on these, see Spark Configuration Guide.
Important

When you're running in Data Flow, don't change the value of spark.master. If you do, the job doesn't use all the resources you provisioned.

Data Flow Proprietary Spark Configuration List

Spark configurations proprietary to Data Flow and how to use them.

Data Flow Spark configuration List
Spark Configuration Usage Description Applicable Spark Versions
dataflow.auth Setting the configuration value to 'resource_principal' enables resource principal authentication for the Data Flow run. This configuration is required for runs that are intended for running longer than 24 hours. Before enabling resource principal, set up the appropriate policy. All
spark.dataflow.acquireQuotaTimeout Data Flow gives you the option to submit jobs when you don't have enough resources to run them. The jobs are held in an internal queue and are released when resources become available. Data Flow keeps checking until the timeout value that you've set is finished. You set the spark.dataflow.acquireQuotaTimeout property to specify this timeout value. Set the property under Advanced options when creating an application, or when running an application. For example:
spark.dataflow.acquireQuotaTimeout = 1h
spark.dataflow.acquireQuotaTimeout = 30m
spark.dataflow.acquireQuotaTimeout = 45min
Use h to represent timeout hours and m or min to represent timeout minutes.

Note: If spark.dataflow.acquireQuotaTimeout isn't set, a run is only accepted if the required resources are available.

All
spark.archives#conda The spark.archives configuration serves exactly the same functionalities as its open source counterpart. When using Conda as the package manager to submit PySpark jobs in OCI Data Flow, attach #conda to the artifact package entries so that Data Flow extracts the artifacts into a proper directory.
oci://<bucket-name>@<namespace-name>/<path>/artififact.tar.gz#conda
For more information, see Integrating Conda Pack with Data Flow).
3.2.1 or later
spark.dataflow.streaming.restartPolicy.restartPeriod

Note: Applicable to Data Flow Streaming type runs only.

This property specifies a minimum delay between restarts for a Streaming application. The default value for it's set to 3 minutes to prevent transient issues causing many restarts in a specific time period.

3.0.2, 3.2.1 or later
spark.dataflow.streaming.restartPolicy.maxConsecutiveFailures

Note: Applicable to Data Flow Streaming type runs only.

This property specifies the maximum number of consecutive failures that can occur before Data Flow stops restarting a failed Streaming application. The default value for this is 10.

3.0.2, 3.2.1 or later
spark.sql.streaming.graceful.shutdown.timeout

Note: Applicable to Data Flow Streaming type runs only.

Data Flow streaming runs uses the shutdown duration to preserve checkpoint data to restart from the prior state correctly. The configuration specifies the maximum time Data Flow streaming runs can use for gracefully preserving the checkpoint state before being forced to shut down. The default is 30 minutes.

3.0.2, 3.2.1 or later
spark.oracle.datasource.enabled

Spark Oracle Datasource is an extension of the Spark JDBC datasource. It simplifies the connection to Oracle databases from Spark. In addition to all the options provided by Spark's JDBC datasource, Spark Oracle Datasource simplifies connecting Oracle databases from Spark by providing:

  • An auto download wallet from the autonomous database, which means there's no need to download the wallet and keep it in Object Storage or Vault.
  • Automatica distribution of the wallet bundle from Object Storage to the driver and executor without any customized code from users.
  • JDBC driver JAR files, so removing the need to download them and include them in the archive.zip file. The JDBC driver is version 21.3.0.0.
To enable Spark Oracle Datasource, set the configuration, spark.oracle.datasource.enabled, to a value of true:
spark.oracle.datasource.enabled = true
For more information see, Spark Oracle Datasource.
3.0.2 or later
spark.scheduler.minRegisteredResourcesRatio

Default: 1.0

Note: Specified as a double between 0.0 and 1.0.

The minimum ratio of registered resources per total expected resource to wait for before scheduling a run in the Job layer. Adjusting this parameter involves a trade off between a faster job startup and ensuring adequate resource availability.

For example, a value of 0.8 means 80% of resources waited.

All
spark.dataflow.overAllocationRatio

Default: 1.0

Note: Specified as a double larger than, or equal to, 1.0.

The ratio of excessive resource creation to avoid job failure resulting from the failure to create a minor part of the instances. Extra instance creation is billed only during the creation phase and ended after the job starts.

For example, a value of 1.1 means that 10% more resources were created to accomodate the expected resources for customer jobs.

All