Spark Oracle Datasource

Spark Oracle Datasource is extension of the JDBC datasource provided by Spark.

Spark Oracle Datasource is an extension of the Spark JDBC datasource. It simplifies the connection to Oracle databases from Spark. In addition to all the options provided by Spark's JDBC datasource, Spark Oracle Datasource simplifies connecting Oracle databases from Spark by providing:
  • An auto download wallet from the autonomous database, which means there is no need to download the wallet and keep it in Object Storage or Vault.
  • It automatically distributes the wallet bundle from Object Storage to the driver and executor without any customized code fom users.
  • It includes JDBC driver JAR files, and so eliminates the need to download them and include them in your archive.zip file. The JDBC driver is version 21.3.0.0.

Use a Spark Oracle Datasource

There are two ways to use this data source in Data Flow.

  • In the Advanced Options section when creating, editing, or running an application, include the key:
    spark.oracle.datasource.enabled
    with the value: true. For more information, see the Create Applications section.
  • Use the Oracle Spark datasource format. For example in Scala:
    val df = spark.read
      .format("oracle")
      .option("adbId","autonomous_database_ocid")
      .option("dbtable", "schema.tablename")
      .option("user", "username")
      .option("password", "password")
      .load()
    More examples in other languages are available in the Spark Oracle Datasource Examples section.
The following three properties are available with Oracle datasource in addition to the properties provided by Spark's JDBC datasource:
Oracle Datasource Properties
Property Name Default Setting Description Scope
walletUri An Object Storage or HDFS-compatible URL. It contains the ZIP file of the Oracle Wallet needed for mTLS connections to an Oracle database. For more information on using the Oracle Wallet, see View TNS Names and Connection Strings for an Autonomous Database Instance Read/write
connectionId
  • Optional with adbld, <database_name>_medium from tnsnames.ora.
  • Required with walletUri option.
The connection identifier alias from tnsnames.ora file, as part of the Oracle wallet. For more information, see the Overview of Local Naming Parameters and the Glossary in the Oracle Database Net Services Reference. Read/write
adbId The Oracle Autonomous database OCID. For more information, see the Overview of Autonomous Database. Read/write
Note

The following limitations apply to the options:
  • adbId and walletUri cannot be used together.
  • connectionId must be provided with walletUri, but is optional with adbId.
  • adbId is not supported for databases with scan.
  • adbId is not supported for autonomous dedicated infrastructure and Exadata infrastructures.
You can use Spark Oracle Datasource in Data Flow with Spark 3.0.2 and higher versions.
To use Spark Oracle Datasource with Spark Submit, set the following option:
--conf spark.oracle.datasource.enable=true
The following databases, only, are supported with adbId:
  • Autonomous DataWarehouse Shared Infrastructure
    Note

    If you have this database in a VCN private subnet, use a Private Network to allowlist the FQDN of the autonomous database's private endpoint.
  • Autonomous Transaction Processing Shared Infrastructure (ATP-S)
  • Autonomous JSON Database Shared Infrastructure (AJD-S)
The following databases can be used with the walletUri option:
  • Autonomous Shared Infrastructure Database
  • Autonomous Dedicated Infrastructure Database (ADW-D), including Exadata infrastructure.
  • Autonomous Transaction Processing Shared Infrastructure
  • Autonomous Transaction Processing Dedicated Infrastructure (ATP-D)
  • Autonomous JSON Database Shared Infrastructure
  • Autonomous JSON Database Dedicated Infrastructure (AJD-D)
  • On premises Oracle database, which can be accessed from Data Flow's network, either through fastconnect or site-to-site VPN.