Best Practices for Building Java Spark Streaming Applications
Tips and best practices for configuring Java build projects in Data Flow.
Maven is used for the following examples to show the various configuration possibilities.
- Use a consistent version of Spark across your libraries and Data Flow
- The easiest way to do this is define a common variable and reuse it. In the following
example, the common variable is called
spark.version
.<properties> ... <spark.version>3.0.2</spark.version> ... </properties> <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>${spark.version}</version> <scope>provided</scope> </dependency>
Note
Ensure the Scala version inartifactId
is suitable for the version of Spark. In the example,artifactId
is set tospark-core_2.12
for Scala 2.12.
- Include Spark binaries that are already part of the standard Spark bundle with the
scope
ofprovided
, to avoid any code duplication. - Expanding the previous example:
<properties> ... <spark.version>3.0.2</spark.version> ... </properties> <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>${spark.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.12</artifactId> <version>${spark.version}</version> <scope>provided</scope> </dependency> ...
- Package Spark binaries that aren't part of the standard bundle with your Spark application.
- This example includes
spark-sql-kafka-0-10_2.12
inartifactId
.<properties> ... <spark.version>3.0.2</spark.version> ... </properties> <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql-kafka-0-10_2.12</artifactId> <version>${spark.version}</version> </dependency> ...
- The Oracle Cloud Infrastructure SDK might be compiled against a different version of third-party libraries, which might result in runtime failures. To avoid such failures, package the Oracle Cloud Infrastructure SDK and third-party libraries with your Spark application and move any newer third-party libraries into a shaded namespace.
- For example:
<dependencies> <dependency> <groupId>com.oracle.oci.sdk</groupId> <artifactId> </artifactId> <optional>false</optional> <version>1.36.1</version> </dependency> <dependency> <groupId>com.google.protobuf</groupId> <artifactId>protobuf-java</artifactId> <version>3.11.1</version> </dependency> </dependencies> ... <build> <plugins> ... <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.2.4</version> <configuration> <!-- The final uber jar file name will not have a version component. --> <finalName>${project.artifactId}</finalName> <createDependencyReducedPom>false</createDependencyReducedPom> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"/> </transformers> <relocations> <relocation> <pattern>com.google.</pattern> <shadedPattern>com.shaded.google.</shadedPattern> </relocation> <relocation> <pattern>com.oracle.bmc.</pattern> <shadedPattern>com.shaded.oracle.bmc.</shadedPattern> </relocation> </relocations> <!-- exclude signed Manifests --> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> </configuration> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> </execution> </executions> </plugin> ...
Sample POM.xml File for Java Spark Streaming Application
A sample POM.xml for use when building a Java-based Spark streaming application for use with Data Flow.
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>StructuredKafkaWordCount</artifactId>
<version>1.0.0-SNAPSHOT</version>
<properties>
<spark.version>3.2.1</spark.version>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>com.oracle.oci.sdk</groupId>
<artifactId>oci-java-sdk-addons-sasl</artifactId>
<optional>false</optional>
<version>1.36.1</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>3.11.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.12/3.2.1</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<skipAssembly>false</skipAssembly>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.4</version>
<configuration>
<!-- The final uber jar file name will not have a version component. -->
<finalName>${project.artifactId}</finalName>
<createDependencyReducedPom>false</createDependencyReducedPom>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"/>
</transformers>
<relocations>
<relocation>
<pattern>com.google.</pattern>
<shadedPattern>com.shaded.google.</shadedPattern>
</relocation>
<relocation>
<pattern>com.oracle.bmc.</pattern>
<shadedPattern>com.shaded.oracle.bmc.</shadedPattern>
</relocation>
</relocations>
<!-- exclude signed Manifests -->
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>