Follow these steps to integrate Conda pack with Data Flow.
Conda is one of the most widely-used Python package management systems.
By using conda-pack, PySpark users can directly use
a Conda environment to ship third-party Python packages. If using Data Flow with
Spark 3.2.1, you can integrate it with Conda pack.
-
Generate your environment's conda pack tar.gz file by installing and using Conda Pack for Python 3.8.13. You must use Python 3.8.13, as this is the supported version with Spark 3.2.1. For more information on supported versions, see the Before you Begin with Data Flow section.
For example, these steps create a sample conda pack file with Python 3.8 and NumPy:
- Log into a docker container with the image
oraclelinux:7-slim
, or use an Oracle Linux 7 machine.docker run -it --entrypoint /bin/bash oraclelinux:7-slim
- Install the Conda Linux Installer.
curl -O https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
chmod u+x Anaconda3-2022.05-Linux-x86_64.sh
./Anaconda3-2022.05-Linux-x86_64.sh
- Create a Python 3.8.environment.
source ~/.bashrc
conda create -n mypython3.8 python=3.8
conda activate mypython3.8
- Install NumPy.
pip install numpy
conda pack -f -o mypython3.8.tar.gz
- Copy the
tar.gz
file from the docker container to your local machine.docker cp <container_id>:/mypython3.8.tar.gz
-
Upload your local tar.gz file to Object store.
Make a note of the URI to the file. It is similar to
oci://<bucket-name>@<namespace-name>/<path>/conda_env.tar.gz
-
In your Applications and Runs to be created or updated, set
spark.archives
to:
oci://<bucket-name>@<namespace-name>/<path>/conda_env.tar.gz#conda
where #conda
tells Data Flow
to set conda
as the effective environment name at
/opt/spark/wor-dir/conda/
and to use the Python version
given at /opt/spark/work-dir/conda/bin/python3
for the
driver and executor pods.
- (Optional)
Alternatively, you can use your own environment name, but it requires the
setting of
PYSPARK_PYTHON
in your code. For more information,
see using Conda with Python
Packaging.