Providing a Dependency Archive

Your Java or Scala applications might need extra JAR files that you can't, or don't, want to bundle in a Fat JAR. Or you might want to include native code or other assets to make available within the Spark runtime.

When the spark-submit options don't work, Data Flow has the option of providing a ZIP archive (archive.zip) along with your application for bundling third-party dependencies. The ZIP archive can be created using a Docker-based tool. The archive.zip is installed on all Spark nodes before running the application. If you construct the archive.zip correctly, the Python libraries are added to the runtime, and the JAR files are added to the Spark classpath. The libraries added are isolated to one Run. That means they don't interfere with other concurrent Runs or later Runs. Only one archive can be provided per Run.

Anything in the archive must be compatible with the Data Flow runtime. For example, Data Flow runs on Oracle Linux using particular versions of Java and Python. Binary code compiled for other OSs, or JAR files compiled for other Java versions, might cause the Run to fail. Data Flow provides tools to help you build archives with compatible software. However, these archives are ordinary Zip files, so you're free to create them any way you want. If you use your own tools, you're responsible for ensuring compatibility.

Dependency archives, similarly to your Spark applications, are loaded to Data Flow. Your Data Flow Application definition contains a link to this archive, which can be overridden at runtime. When you run your Application, the archive is downloaded and installed before the Spark job runs. The archive is private to the Run. This means, for example, that you can run concurrently two different instances of the same Application, with different dependencies, but without any conflicts. Dependencies don't persist between Runs, so there aren't any problems with conflicting versions for other Spark applications that you might run.

Building a Dependency Archive Using the Data Flow Dependency Packager

  1. Download docker.
  2. Download the packager tool image:
    ARM Shape:
    docker pull phx.ocir.io/axmemlgtri2a/dataflow/dependency-packager-linux_arm64_v8:latest
    AMD Shape:
    docker pull phx.ocir.io/axmemlgtri2a/dataflow/dependency-packager-linux_x86_64:latest
  3. For Python dependencies, create a requirements.txt file. For example, it might look similar to:
    numpy==1.18.1
    pandas==1.0.3
    pyarrow==0.14.0
    Note

    Don't include pyspark or py4j. These dependencies are provided by Data Flow, and including them causes Runs to fail.
    The Data Flow Dependency Packager uses Python's pip tool to install all dependencies. If you have Python wheels that can't be downloaded from public sources, place them in a directory beneath where you build the package. See them in requirements.txt with a prefix of /opt/dataflow/. For example:
    /opt/dataflow/<my-python-wheel.whl>

    where <my-python-wheel.whl> represents the name of the Python wheel. Pip sees it as a local file and installs it normally.

  4. For Java dependencies, create a file called packages.txt. For example, it might look similar to:
    ml.dmlc:xgboost4j:0.90
    ml.dmlc:xgboost4j-spark:0.90
    https://repo1.maven.org/maven2/com/nimbusds/nimbus-jose-jwt/8.11/nimbus-jose-jwt-8.11.jar

    The Data Flow Dependency Packager uses Apache Maven to download dependency JAR files. If you have JAR files that can't be downloaded from public sources, place them in a local directory beneath where you build the package. Any JAR files in any subdirectory where you build the package are included in the archive.

  5. Use docker container to create the archive.
    Note

    The Python version must be set to 3.11 when using Spark 3.5.0, 3.8 when using Spark 3.2.1, and 3.6 when using Spark 3.0.2 or Spark 2.4.4. In the following commands, <python_version> represents this number.

    Use this command if using MacOS or Linux:

    AMD64:

    docker run --platform linux/amd64 --rm -v $(pwd):/opt/dataflow 
    --pull always -it phx.ocir.io/axmemlgtri2a/dataflow/dependency-packager-linux_x86_64:latest -p <python_version>

    ARM64:

    docker run --platform linux/arm64 --rm -v $(pwd):/opt/dataflow 
    --pull always -it phx.ocir.io/axmemlgtri2a/dataflow/dependency-packager-linux_arm64_v8:latest -p <python_version>

    If using Windows command prompt as the Administrator, use this command:

    AMD64:

    docker run --platform linux/amd64 --rm -v %CD%:/opt/dataflow 
    --pull always -it phx.ocir.io/axmemlgtri2a/dataflow/dependency-packager-linux_x86_64:latest -p <python_version>

    ARM64:

    docker run --platform linux/arm64 --rm -v %CD%:/opt/dataflow 
    --pull always -it phx.ocir.io/axmemlgtri2a/dataflow/dependency-packager-linux_arm64_v8:latest -p <python_version>

    If using Windows Powershell as the Administrator, use this command:

    AMD64:

    docker run --platform linux/amd64 --rm -v ${PWD}:/opt/dataflow 
    --pull always -it phx.ocir.io/axmemlgtri2a/dataflow/dependency-packager-linux_x86_64:latest -p <python_version>

    ARM64:

    docker run --platform linux/arm64 --rm -v ${PWD}:/opt/dataflow 
    --pull always -it phx.ocir.io/axmemlgtri2a/dataflow/dependency-packager-linux_arm64_v8:latest -p <python_version>

    To use Podman to create the archive, use this command in Linux:

    AMD64:

    podman run --platform linux/amd64 --rm -v $(pwd):/opt/dataflow:Z -u root 
    -it phx.ocir.io/axmemlgtri2a/dataflow/dependency-packager-linux_x86_64:latest -p <python_version>

    ARM64:

    podman run --platform linux/arm64 --rm -v $(pwd):/opt/dataflow:Z -u root 
    -it phx.ocir.io/axmemlgtri2a/dataflow/dependency-packager-linux_arm64_v8:latest -p <python_version>

    These commands create a file called archive.zip.

    The working directory is represented by pwd. The flag -v indicates docker volume mapping to the local file system.

  6. You can add static content. You might want to include other content in the archive. For example, you might want to deploy a data file, an ML model file, or an executable that the Spark program calls at runtime. You do this by adding files to archive.zip after you created it in Step 4.

    For Java applications:

    1. Unzip archive.zip.
    2. Add the JAR files in the java/ directory only.
    3. Zip the file.
    4. Upload it to Object Storage.
    For Python applications:
    1. Unzip archive.zip.
    2. Add you local modules to only these three subdirectories of the python/ directory:
      python/lib
      python/lib32
      python/lib64
    3. Zip the file.
    4. Upload it to object store.
    Note

    Only these four directories are allowed for storing the Java and Python dependencies.

    When the Data Flow application runs, the static content is available on any node under the directory where you chose to place it. For example, if you added files under python/lib/ in the archive, they're available in the /opt/dataflow/python/lib/ directory on any node.

  7. Upload archive.zip to object store.
  8. Add the library to the application. See Creating a Java or Scala Data Flow Application or Creating a PySpark Data Flow Application for how to do this.

The Structure of the Dependency Archive

Dependency archives are ordinary ZIP files. Advanced users might choose to build archives with their own tools rather than using the Data Flow Dependency Packager. A correctly constructed dependency archive has this general outline:

python
python/lib
python/lib/python3.6/<your_library1>
python/lib/python3.6/<your_library2>
python/lib/python3.6/<...>
python/lib/python3.6/<your_libraryN>
python/lib/user
python/lib/user/<your_static_data>
java
java/<your_jar_file1>
java/<...>
java/<your_jar_fileN>
Note

Data Flow extracts archive files under /opt/dataflow directory.

Validate an Archive.zip File Using the Data Flow Dependency Packager

You can use the Data Flow Dependency Packager to validate an archive.zip file locally, before uploading the file to Object Storage.

Navigate to the directory containing the archive.zip file, and run the following commands, depending on the shape:

ARM64:
docker run --platform linux/arm64 --rm -v $(pwd):/opt/dataflow  --pull always -it phx.ocir.io/axmemlgtri2a/dataflow/dependency-packager-linux_arm64_v8:latest  -p 3.11 --validate archive.zip
AMD64:
docker run --platform linux/amd64 --rm -v $(pwd):/opt/dataflow  --pull always -it phx.ocir.io/axmemlgtri2a/dataflow/dependency-packager-linux_x86_64:latest   -p 3.11 --validate archive.zip

Example Requirements.txt and Packages.txt Files

This example of a requirements.txt file includes the Data Flow SDK for Python version 2.14.3 in a Data Flow Application:
-i https://pypi.org/simple
certifi==2020.4.5.1
cffi==1.14.0
configparser==4.0.2
cryptography==2.8
oci==2.14.3
pycparser==2.20
pyopenssl==19.1.0
python-dateutil==2.8.1
pytz==2020.1
six==1.15.0
This example of a requirements.txt file includes a mix of PyPI sources, web sources, and local sources for Python wheel files:
-i https://pypi.org/simple
blis==0.4.1
catalogue==1.0.0
certifi==2020.4.5.1
chardet==3.0.4
cymem==2.0.3
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en-core-web-sm
idna==2.9
importlib-metadata==1.6.0 ; python_version < '3.8'
murmurhash==1.0.2
numpy==1.18.3
plac==1.1.3
preshed==3.0.2
requests==2.23.0
spacy==2.2.4
srsly==1.0.2
thinc==7.4.0
tqdm==4.45.0
urllib3==1.25.9
wasabi==0.6.0
zipp==3.1.0
/opt/dataflow/mywheel-0.1-py3-none-any.whl
To connect to Oracle databases such as ADW, you need to include Oracle JDBC JAR files. Download and extract the compatible driver JAR files into a directory under where you build the package. For example, to package the Oracle 18.3 (18c) JDBC driver, ensure all these JARs are present:
ojdbc8-18.3.jar
oraclepki-18.3.jar
osdt_cert-18.3.jar
osdt_core-18.3.jar
ucp-18.3.jar