Supported Spark Properties in Data Flow
For every run of a Data Flow application, you can add Spark Properties in the Spark Configuration Properties field.
When you're running in Data Flow, don't change the value of
spark.master
. If you do, the job doesn't use all the resources you provisioned. Data Flow Proprietary Spark Configuration List
Spark configurations proprietary to Data Flow and how to use them.
Spark Configuration | Usage Description | Applicable Spark Versions |
---|---|---|
dataflow.auth | Setting the configuration value to 'resource_principal' enables resource principal authentication for the Data Flow run. This configuration is required for runs that are intended for running longer than 24 hours. Before enabling resource principal, set up the appropriate policy. | All |
spark.dataflow.acquireQuotaTimeout | Data Flow gives you the option to
submit jobs when you don't have enough resources to run them. The
jobs are held in an internal queue and are released when resources
become available. Data Flow keeps
checking until the timeout value that you've set is finished. You
set the spark.dataflow.acquireQuotaTimeout property
to specify this timeout value. Set the property under
Advanced options when creating an application,
or when running an
application. For example:
Use
h to represent timeout hours and
m or min to represent timeout
minutes.
Note: If |
All |
spark.archives#conda | The spark.archives configuration serves exactly the same
functionalities as its open source counterpart.
When using Conda as the package
manager to submit PySpark jobs in OCI
Data Flow, attach #conda to
the artifact package entries so that Data Flow extracts the artifacts
into a proper directory.
For
more information, see Integrating Conda Pack with Data
Flow). |
3.2.1 or later |
spark.dataflow.streaming.restartPolicy.restartPeriod | Note: Applicable to Data Flow Streaming type runs only. This property specifies a minimum delay between restarts for a Streaming application. The default value for it's set to 3 minutes to prevent transient issues causing many restarts in a specific time period. |
3.0.2, 3.2.1 or later |
spark.dataflow.streaming.restartPolicy.maxConsecutiveFailures | Note: Applicable to Data Flow Streaming type runs only. This property specifies the maximum number of consecutive failures that can occur before Data Flow stops restarting a failed Streaming application. The default value for this is 10. |
3.0.2, 3.2.1 or later |
spark.sql.streaming.graceful.shutdown.timeout | Note: Applicable to Data Flow Streaming type runs only. Data Flow streaming runs uses the shutdown duration to preserve checkpoint data to restart from the prior state correctly. The configuration specifies the maximum time Data Flow streaming runs can use for gracefully preserving the checkpoint state before being forced to shut down. The default is 30 minutes. |
3.0.2, 3.2.1 or later |
spark.oracle.datasource.enabled | Spark Oracle Datasource is an extension of the Spark JDBC datasource. It simplifies the connection to Oracle databases from Spark. In addition to all the options provided by Spark's JDBC datasource, Spark Oracle Datasource simplifies connecting Oracle databases from Spark by providing:
For
more information see, Spark Oracle Datasource. |
3.0.2 or later |
spark.scheduler.minRegisteredResourcesRatio |
Default: 1.0 Note: Specified as a double between 0.0 and 1.0. The minimum ratio of registered resources per total expected resource to wait for before scheduling a run in the Job layer. Adjusting this parameter involves a trade off between a faster job startup and ensuring adequate resource availability. For example, a value of 0.8 means 80% of resources waited. |
All |
spark.dataflow.overAllocationRatio |
Default: 1.0 Note: Specified as a double larger than, or equal to, 1.0. The ratio of excessive resource creation to avoid job failure resulting from the failure to create a minor part of the instances. Extra instance creation is billed only during the creation phase and ended after the job starts. For example, a value of 1.1 means that 10% more resources were created to accomodate the expected resources for customer jobs. |
All |