Using Spark-Submit Options

Use industry standard spark-submit compatible options to run applications using Data Flow third-party dependencies.

PySpark lets you upload Python files (.py), zipped Python packages (.zip), and Egg files (.egg) to the executors in one of the following ways:

  • Setting the configuration spark.submit.pyFiles.
  • Setting the --py-files option in Spark scripts.

You can use Spark-Submit compatible options for each of options.

However, these approaches don't let you add packages built as Wheels and so don't let you include dependencies with native code. For those dependencies, if you're using Spark 3.2.1 or later, first create a Conda pack and pass that Conda pack to Data Flow by adding Configuration for spark.archives or providing --archives as a Spark-Submit compatible option.

You can create Conda packs for Spark 3.2.1 dependencies by following the instructions at https://conda.github.io/conda-pack/spark.html and https://docs.conda.io/projects/conda-build/en/stable/user-guide/wheel-files.html.

For Spark 3.2.1 or later, when you use spark.archives or --archives for providing the conda-packs, they're unpacked under /opt/spark/work-dir directory by default when the spark-submit command is similar to:

spark-submit --archives pyspark_conda_env.tar.gz app.py
If the spark-submit command is similar to:
spark-submit --archives pyspark_conda_env.tar.gz#your_environment app.py

Then it's unpacked under /opt/spark/work-dir/<your_environment> directory.

See Spark-Submit Functionality in Data Flow for more examples on the spark-submit approach.