Using Spark-Submit Options
Use industry standard spark-submit compatible options to run applications using Data Flow third-party dependencies.
PySpark lets you upload Python files (.py
), zipped Python packages (.zip
), and Egg files (.egg
) to the executors in one of the following ways:
- Setting the configuration
spark.submit.pyFiles
.
- Setting the
--py-files
option in Spark scripts.
You can use Spark-Submit compatible options for each of options.
However, these approaches don't let you add packages built as Wheels and so don't let you include dependencies with native code. For those
dependencies, if you're using Spark 3.2.1 or later, first create a Conda pack and pass that
Conda pack to Data Flow by adding Configuration for
spark.archives
or providing --archives
as a Spark-Submit compatible option.
You can create Conda packs for Spark 3.2.1 dependencies by following the instructions at https://conda.github.io/conda-pack/spark.html and https://docs.conda.io/projects/conda-build/en/stable/user-guide/wheel-files.html.
For Spark 3.2.1 or later, when you use spark.archives
or --archives
for providing the conda-packs, they're unpacked under /opt/spark/work-dir
directory by default when the spark-submit command is similar to:
spark-submit --archives pyspark_conda_env.tar.gz app.py
spark-submit --archives pyspark_conda_env.tar.gz#your_environment app.py
Then it's unpacked under /opt/spark/work-dir/<your_environment>
directory.
See Spark-Submit Functionality in Data Flow for more examples on the spark-submit approach.