Best Practices for Building Python Spark Streaming Applications
Tips and best practices for configuring Python build projects in Data Flow.
Python Spark applications are more complex than Java applications as the dependencies are required at the same time for both JVM and Python runtimes. The runtimes each have their own project and package management systems. To help with packaging dependencies from different runtime, Data Flow has a dependency packager tool. The tool packages all dependencies into a single archive that needs to be uploaded to Object Storage. Data Flow provides dependencies from that archive to the Spark application.
The archive ensures availability when stored in Oracle Cloud Infrastructure Object Storage, the same reproducibility (antifactory is dynamic and so might produce a different dependency tree) and it stops downloading the same dependencies from external sources.More information on how to set up and use the dependency manager is available in the section on Spark-Submit Functionality in Data Flow.
- When building archive.zip for your application list the required Java libraries in
packages.txt
and the dependency packager packages them together with their dependencies. - Provide a dependency using any of the following options:For example, to include
- Use
--packages
option orspark.jars.packages
Spark configuration. An Application running in a private endpoint must let traffic from the private subnet go to the public internet for the package to download. - Provide the Object Storage location in
--jars
orspark.jars
as a comma-separated list. - Use
python
orstructured_streaming_java_dependencies_for_python create archive.zip
.
spark-sql-kafka-0-10_2.12
, add it inpackages.txt
:Run this command:org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1
Resulting in an archive.zip file:docker run --pull always --rm -v $(pwd):/opt/dataflow -it phx.ocir.io/oracle/dataflow/dependency-packager:latest
adding: java/ (stored 0%) adding: java/org.lz4_lz4-java-1.7.1.jar (deflated 3%) adding: java/org.slf4j_slf4j-api-1.7.30.jar (deflated 11%) adding: java/org.xerial.snappy_snappy-java-1.1.8.2.jar (deflated 1%) adding: java/org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.0.2.jar (deflated 8%) adding: java/org.apache.spark_spark-sql-kafka-0-10_2.12-3.0.2.jar (deflated 5%) adding: java/com.github.luben_zstd-jni-1.4.4-3.jar (deflated 1%) adding: java/org.spark-project.spark_unused-1.0.0.jar (deflated 42%) adding: java/org.apache.kafka_kafka-clients-2.4.1.jar (deflated 8%) adding: java/org.apache.commons_commons-pool2-2.6.2.jar (deflated 9%) adding: version.txt (deflated 59%) archive.zip is generated!
Note
In Spark 3.2.1 most of the source code and libraries used to run Data Flow are hidden. You no longer have to match the Data Flow SDK versions, and no longer have third-party dependency conflicts with Data Flow. See the Develop Oracle Cloud Infrastructure Data Flow Applications Locally, Deploy to The Cloud tutorial for more information. For the required library versions, see Migrate Data Flow to Spark 3.2.1. Resolve the streaming dependency using the options mentioned before. Examples of structured streaming Java dependencies for Python are available on github.
- Use
- It might be necessary to shade some Java libraries.
- If using Spark 2.4.4 or Spark 3.0.2, you might have to shade your libraries. Create a
separate Maven project to build a fat JAR that contains all the Java dependencies and
other tweaks such as shading in one place. Include it as custom JAR using the dependency
packager. For example, using
oci-java-sdk-addons-sasl
, as the Oracle Cloud Infrastructure SDK is compiled against later versions of some third-party libraries, and so runtime failures might occur.An example Maven project:
Place<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.example</groupId> <artifactId>SaslFat</artifactId> <version>1.0-SNAPSHOT</version> <properties> <maven.compiler.source>8</maven.compiler.source> <maven.compiler.target>8</maven.compiler.target> <spark.version>3.0.2</spark.version> </properties> <dependencies> <dependency> <groupId>com.oracle.oci.sdk</groupId> <artifactId>oci-java-sdk-addons-sasl</artifactId> <optional>false</optional> <version>1.36.1</version> </dependency> <dependency> <groupId>com.google.protobuf</groupId> <artifactId>protobuf-java</artifactId> <version>3.11.1</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>${spark.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.12</artifactId> <version>${spark.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql-kafka-0-10_2.12</artifactId> <version>${spark.version}</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.2.4</version> <configuration> <createDependencyReducedPom>false</createDependencyReducedPom> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"/> </transformers> <relocations> <relocation> <pattern>com.google.</pattern> <shadedPattern>com.shaded.google.</shadedPattern> </relocation> <relocation> <pattern>com.oracle.bmc.</pattern> <shadedPattern>com.shaded.oracle.bmc.</shadedPattern> <excludes> <exclude>com.oracle.bmc.auth.sasl.*</exclude> </excludes> </relocation> </relocations> <!-- exclude signed Manifests --> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> <artifactSet> <excludes> <exclude>${project.groupId}:${project.artifactId}</exclude> </excludes> </artifactSet> </configuration> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> </execution> </executions> </plugin> </plugins> </build> </project>
SaslFat-1.0-SNAPSHOT.jar
in the working directory of the dependency packager and run the command:docker run --pull always --rm -v $(pwd):/opt/dataflow -it phx.ocir.io/oracle/dataflow/dependency-packager:latest
SaslFat-1.0-SNAPSHOT.jar
is packaged into archive.zip as a Java dependency:
Or, you can manually create such an archive.zip that contains theadding: java/ (stored 0%) adding: java/SaslFat-1.0-SNAPSHOT.jar (deflated 8%) adding: version.txt (deflated 59%) archive.zip is generated!
java
folder withSaslFat-1.0-SNAPSHOT.jar
in it.