Best Practices for Building Python Spark Streaming Applications

Tips and best practices for configuring Python build projects in Data Flow.

Python Spark applications are more complex than Java applications as the dependencies are required at the same time for both JVM and Python runtimes. The runtimes each have their own project and package management systems. To help with packaging dependencies from different runtime, Data Flow has a dependency packager tool. The tool packages all dependencies into a single archive that needs to be uploaded to Object Storage. Data Flow provides dependencies from that archive to the Spark application.

The archive ensures availability when stored in Oracle Cloud Infrastructure Object Storage, the same reproducibility (antifactory is dynamic and so might produce a different dependency tree) and it stops downloading the same dependencies from external sources.

More information on how to set up and use the dependency manager is available in the section on Spark-Submit Functionality in Data Flow.

When building archive.zip for your application list the required Java libraries in packages.txt and the dependency packager packages them together with their dependencies.

Provide a dependency using any of the following options:

Use --packages option or spark.jars.packages Spark configuration. An Application running in a private endpoint must let traffic from the private subnet go to the public internet for the package to download.
Provide the Object Storage location in --jars or spark.jars as a comma-separated list.
Use python or structured_streaming_java_dependencies_for_python create archive.zip.

For example, to include spark-sql-kafka-0-10_2.12, add it in packages.txt:

org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1

Run this command:

docker run --pull always --rm -v $(pwd):/opt/dataflow -it phx.ocir.io/oracle/dataflow/dependency-packager:latest

Resulting in an archive.zip file:

  adding: java/ (stored 0%)
  adding: java/org.lz4_lz4-java-1.7.1.jar (deflated 3%)
  adding: java/org.slf4j_slf4j-api-1.7.30.jar (deflated 11%)
  adding: java/org.xerial.snappy_snappy-java-1.1.8.2.jar (deflated 1%)
  adding: java/org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.0.2.jar (deflated 8%)
  adding: java/org.apache.spark_spark-sql-kafka-0-10_2.12-3.0.2.jar (deflated 5%)
  adding: java/com.github.luben_zstd-jni-1.4.4-3.jar (deflated 1%)
  adding: java/org.spark-project.spark_unused-1.0.0.jar (deflated 42%)
  adding: java/org.apache.kafka_kafka-clients-2.4.1.jar (deflated 8%)
  adding: java/org.apache.commons_commons-pool2-2.6.2.jar (deflated 9%)
  adding: version.txt (deflated 59%)
archive.zip is generated!

Note

In Spark 3.2.1 most of the source code and libraries used to run Data Flow are hidden. You no longer have to match the Data Flow SDK versions, and no longer have third-party dependency conflicts with Data Flow. See the Develop Oracle Cloud Infrastructure Data Flow Applications Locally, Deploy to The Cloud tutorial for more information. For the required library versions, see Migrate Data Flow to Spark 3.2.1. Resolve the streaming dependency using the options mentioned before. Examples of structured streaming Java dependencies for Python are available on github.

It might be necessary to shade some Java libraries.

If using Spark 2.4.4 or Spark 3.0.2, you might have to shade your libraries. Create a separate Maven project to build a fat JAR that contains all the Java dependencies and other tweaks such as shading in one place. Include it as custom JAR using the dependency packager. For example, using oci-java-sdk-addons-sasl, as the Oracle Cloud Infrastructure SDK is compiled against later versions of some third-party libraries, and so runtime failures might occur.

An example Maven project:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
 
  <groupId>com.example</groupId>
  <artifactId>SaslFat</artifactId>
  <version>1.0-SNAPSHOT</version>
 
  <properties>
    <maven.compiler.source>8</maven.compiler.source>
    <maven.compiler.target>8</maven.compiler.target>
    <spark.version>3.0.2</spark.version>
  </properties>
 
  <dependencies>
    <dependency>
      <groupId>com.oracle.oci.sdk</groupId>
      <artifactId>oci-java-sdk-addons-sasl</artifactId>
      <optional>false</optional>
      <version>1.36.1</version>
    </dependency>
    <dependency>
      <groupId>com.google.protobuf</groupId>
      <artifactId>protobuf-java</artifactId>
      <version>3.11.1</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.12</artifactId>
      <version>${spark.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.12</artifactId>
      <version>${spark.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql-kafka-0-10_2.12</artifactId>
      <version>${spark.version}</version>
    </dependency>
  </dependencies>
 
  <build>
    <plugins>
 
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>3.2.4</version>
        <configuration>
          <createDependencyReducedPom>false</createDependencyReducedPom>
          <transformers>
            <transformer
              implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
            <transformer
              implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"/>
          </transformers>
          <relocations>
            <relocation>
              <pattern>com.google.</pattern>
              <shadedPattern>com.shaded.google.</shadedPattern>
            </relocation>
            <relocation>
              <pattern>com.oracle.bmc.</pattern>
              <shadedPattern>com.shaded.oracle.bmc.</shadedPattern>
              <excludes>
                <exclude>com.oracle.bmc.auth.sasl.*</exclude>
              </excludes>
            </relocation>
          </relocations>
          <!-- exclude signed Manifests -->
          <filters>
            <filter>
              <artifact>*:*</artifact>
              <excludes>
                <exclude>META-INF/*.SF</exclude>
                <exclude>META-INF/*.DSA</exclude>
                <exclude>META-INF/*.RSA</exclude>
              </excludes>
            </filter>
          </filters>
          <artifactSet>
            <excludes>
              <exclude>${project.groupId}:${project.artifactId}</exclude>
            </excludes>
          </artifactSet>
        </configuration>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
 
    </plugins>
  </build>
</project>

Place SaslFat-1.0-SNAPSHOT.jar in the working directory of the dependency packager and run the command:

docker run --pull always --rm -v $(pwd):/opt/dataflow -it phx.ocir.io/oracle/dataflow/dependency-packager:latest

SaslFat-1.0-SNAPSHOT.jar is packaged into archive.zip as a Java dependency:

  adding: java/ (stored 0%)
  adding: java/SaslFat-1.0-SNAPSHOT.jar (deflated 8%)
  adding: version.txt (deflated 59%)
archive.zip is generated!

Or, you can manually create such an archive.zip that contains the java folder with SaslFat-1.0-SNAPSHOT.jar in it.

Oracle Cloud Infrastructure Documentation Try Free Tier

Best Practices for Building Python Spark Streaming Applications

Oracle Cloud Infrastructure Documentation
Try Free Tier