PySpark

A description of the PySpark 3.2 and Feature Store on Python 3.8 (version 3.0) conda environment.


Released	February 9, 2024
Description	The Feature Store conda environment includes feature store package which provides a centralized solution for data transformation and access during training and serving, establishing a standardized pipeline for data ingestion and querying and the Data Flow magic commands to manage the lifecycle of a remote Data Flow Session cluster and remotely run spark code snippets in the cluster. This conda provides support for ingesting data in the delta format, making it a first-class citizen within the system. Oracle Data Science feature store offers support for DCAT Hive Metastore, which serves as a registry for schema metadata and lets users register and manage the metadata associated with schemas. To get started with the Feature store environment, review the getting-started notebook, using the Launcher.
Python Version	3.8
Slug	`fspyspark32_p38_cpu_v3`
Object Storage Path	`oci://service-conda-packs@id19sfcrra6z/service_pack/cpu/PySpark_3.2_and_Feature_Store/3.0/fspyspark32_p38_cpu_v3`
Top Libraries	Data Flow Sparkmagic (1.0.14) oracle-ads(v2.10.0) oraclejdk (v8) pyspark (v3.2.1) sparksql-magic (v0.0.3) oracle-ml-insights (v1.0.4) spark-nlp (v4.2.1) transformers (v4.32.1) langchain (v0.0.267) For a complete list of preinstalled Python libraries, see fspyspark32_p38_cpu_v3.txt.

A description of the PySpark 3.5 and Data Flow CPU on Python 3.11 (version 1.0) conda environment.


Released	September 25, 2024
Description	This conda environment includes the Data Flow magic commands to manage the lifecycle of a remote Data Flow Session cluster and remotely execute spark code snippets in the cluster. Use PySparkSQL to analyze structured and semi-structured data that is store on Object Storage. PySpark leverages the full power of a notebook session by using parallel computing. Data Flow is also integrated with the Data Catalog Hive Metastore. To get started with this conda environment, review the Getting Started notebook, using the Launcher.
Python Version	3.11
Object Storage Path	`oci://service-conda-packs@id19sfcrra6z/service_pack/cpu/PySpark_3.5_and_Data_Flow/1.0/pyspark35_p311_cpu_x86_64_v1`
Slug	`pyspark35_p311_cpu_x86_64_v1`
Top Libraries	Data Flow Sparkmagic (1.0.88) oracle-ads(2.11.17) oraclejdk (11) pyspark (3.5.0) python (3.11) sparksql-magic (0.0.3) spark-nlp (v5.3.3) For a complete list of preinstalled Python libraries, see pyspark35_p311_cpu_v1.txt.

A description of the PySpark 3.2 and Data Flow CPU on Python 3.8 (version 3.0) conda environment.


Released	July 10, 2023
Description	This conda environment includes the Data Flow magic commands to manage the life cycle of a remote Data Flow Session cluster and remotely run spark code snippets in the cluster. This conda environment allows data scientists to leverage Apache Spark including the machine learning algorithms in MLlib. Use PySparkSQL to analyze structured and semi-structured data stores in Object Storage. PySpark leverages the full power of a notebook session by using parallel computing. Use PySparkSQL to analyze structured and semi-structured data stored in Object Storage Data Flow is also integrated with the Data Catalog Hive Metastore To get started with this conda environment, review the Getting Started notebook, using the Launcher.
Python Version	3.8
Object Storage Path	`oci://service-conda-packs@id19sfcrra6z/service_pack/cpu/PySpark_3.2_and_Data_Flow/3.0/pyspark32_p38_cpu_v3`
Slug	`pyspark32_p38_cpu_v3`
Top Libraries	Data Flow Sparkmagic (1.0.14) oracle-ads(v2.8.7) oraclejdk (v8) pyspark (v3.2.1) sparksql-magic (v0.0.3) spark-nlp (v4.2.1) For a complete list of preinstalled Python libraries, see pyspark32_p38_cpu_v3.txt.

A description of the PySpark 3.2 and Data Flow CPU on Python 3.8 (version 2.0) conda environment.


Released	December 1, 2022
Description	This conda environment includes the Data Flow magic commands to manage the life cycle of a remote Data Flow Session cluster and remotely execute spark code snippets in the cluster. This conda environment allows data scientists to leverage Apache Spark including the machine learning algorithms in MLlib. Use PySparkSQL to analyze structured and semi-structured data stores in Object Storage. PySpark leverages the full power of a notebook session by using parallel computing. Use PySparkSQL to analyze structured and semi-structured data stored in Object Storage Data Flow is also integrated with the Data Catalog Hive Metastore To get started with this conda environment, review the Getting Started notebook, using the Launcher.
Python Version	3.8
Object Storage Path	`oci://service-conda-packs@id19sfcrra6z/service_pack/cpu/PySpark_3.2_and_Data_Flow/2.0/pyspark32_p38_cpu_v2`
Slug	`pyspark32_p38_cpu_v2`
Top Libraries	Data Flow Sparkmagic (1.0.7.e08b59192e8) oracle-ads(v2.6.8) oraclejdk (v8) pyspark (v3.2.1) sparksql-magic (v0.0.3) spark-nlp (v4.2.1) For a complete list of preinstalled Python libraries, see pyspark32_p38_cpu_v2.txt.