With Data Flow, you can configure Data Science notebooks to run applications interactively against
Data Flow.
Data Flow uses fully managed Jupyter Notebooks to enable data
scientists and data engineers to create, visualize, collaborate, and debug data engineering
and data science applications. You can write these applications in Python, Scala, and PySpark.
You can also connect a Data Science notebook
session to Data Flow to run applications. The Data Flow kernels and applications run on Oracle Cloud Infrastructure Data Flow.
Apache Spark is
a distributed compute system designed to process data at scale. It supports large-scale SQL,
batch, and stream processing, and machine learning tasks. Spark SQL
provides database-like support. To query structured data, use Spark SQL. It's an ANSI standard
SQL implementation.
Data Flow Sessions support autoscaling Data Flow cluster capabilities. For more information, see Autoscaling in the Data Flow documentation.
Data Flow Sessions support the use of conda environments as customizable Spark runtime
environments.
Figure 1. Data Science Notebook to Data Flow
Limitations
Data Flow Sessions last up to 7 days or 10,080
mins (maxDurationInMinutes).
Data Flow Sessions have a default idle timeout
value of 480 mins (8 hours) (idleTimeoutInMinutes). You can configure a different
value.
The Data Flow Session is only available through a
Data Science Notebook Session.
The notebook session must be in the region where the service was enabled
for the tenancy.
The notebook session must be in the compartment of the dynamic group of
notebook sessions.
Install and activate the pyspark32_p38_cpu_v3 conda environment in the notebook
session:
Copy
copy odsc conda install -s pyspark32_p38_cpu_v3
Note
We recommend installing the updated conda environment
pyspark32_p38_cpu_v3, as it includes support for automatically restarting
sessions that have stopped because they have exceeded the maximum timeout
duration.
The previous stable conda environment was
pyspark32_p38_cpu_v2.
Activate the pyspark32_p38_cpu_v3 conda environment:
For a list of all the supported commands, use the %help
command.
The commands in the following steps apply for both Spark 3.5.0 and Spark 3.2.1.
Spark 3.5.0 is used in the examples. Set the value of
sparkVersion according to the version of Spark used.
Set up authentication in ADS.
The ADS SDK is used to control the authentication type used in in Data Flow Magic.
The API KEY authentication is used by default. To change authentication
type, use the ads.set_auth("resource_principal")
command:
Install the pre-trained spark-nlp models and pipelines.
If you need any pre-trained spark-nlp models, download them and unzip them in
the conda environment folder. Data Flow
doesn't yet support egress to the public internet. You can't dynamically
download pre-trained models from AWS S3 in Data Flow.
Test it with a snippet of code from the spark-nlp GitHub repository:
Copy
%%spark
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
# Start SparkSession with Spark NLP
# start() functions has 3 parameters: gpu, m1, and memory
# sparknlp.start(gpu=True) will start the session with GPU support
# sparknlp.start(m1=True) will start the session with macOS M1 support
# sparknlp.start(memory="16G") to change the default driver memory in SparkSession
spark = sparknlp.start()
# Download a pre-trained pipeline
pipeline = PretrainedPipeline('explain_document_dl', lang='en', disk_location="/opt/spark/work-dir/conda/sparknlp-models/")
# Your testing dataset
text = """
Lawrence Joseph Ellison (born August 17, 1944) is an American business magnate and investor who is the co-founder,
executive chairman, chief technology officer (CTO) and former chief executive officer (CEO) of the
American computer technology company Oracle Corporation.[2] As of September 2022, he was listed by
Bloomberg Billionaires Index as the ninth-wealthiest person in the world, with an estimated
fortune of $93 billion.[3] Ellison is also known for his 98% ownership stake in Lanai,
the sixth-largest island in the Hawaiian Archipelago.[4]
"""
# Annotate your testing dataset
result = pipeline.annotate(text)
# What's in the pipeline
print(list(result.keys()))
# Check the results
print(result['entities'])
The variable sc represents the Spark and it's available when the
%%spark magic command is used. The following cell is a toy
example of how to use sc in a Data FlowMagic cell. The cell calls the .parallelize() method
which creates an RDD, numbers, from a list
of numbers. Information about the RDD is printed. The .toDebugString() method
returns a description of the
RDD.
Copy
%%spark
print(sc.version)
numbers = sc.parallelize([4, 3, 2, 1])
print(f"First element of numbers is {numbers.first()}")
print(f"The RDD, numbers, has the following description\n{numbers.toDebugString()}")
Spark SQL 🔗
Using the -c
sql option lets you run Spark SQL commands in a cell. In this section,
the citi bike dataset is used. The following cell reads the
dataset into a Spark dataframe and saves it as a table. This example is used to show
Spark SQL.
The Citibike dataset is uploaded to object storage. You might need
to do the same in your realm.
The following
example uses the -c sql option to tell Data FlowMagic that the contents of the cell is
SparkSQL. The -o <variable> option takes the results of the
Spark SQL operation and stores it in the defined variable. In this case, the
df_bike_trips are a Pandas dataframe that's available to be
used in the
notebook.
Copy
%%spark -c sql -o df_bike_trips
SELECT _c0 AS Duration, _c4 AS Start_Station, _c8 AS End_Station, _c11 AS Bike_ID FROM bike_trips;
Print the first few rows of
data:
Copy
df_bike_trips.head()
Similarly you can use sqlContext to query the
table:
Copy
%%spark
df_bike_trips_2 = sqlContext.sql("SELECT * FROM bike_trips")
df_bike_trips_2.show()
Finally, you can describe the
table:
Copy
%%spark -c sql
SHOW TABLES
Auto-visualization Widget 🔗
Data FlowMagic comes with autovizwidget which enables the
visualization of Pandas dataframes. The display_dataframe()
function takes a Pandas dataframe as a parameter and generates an interactive GUI in
the notebook. It has tabs that show the visualization of the data in various forms,
such as tabular, pie charts, scatter plots, and area and bar graphs.
The following cell calls display_dataframe() with the
df_people dataframe that was created in the Spark SQL section of the
notebook:
Copy
from autovizwidget.widget.utils import display_dataframe
display_dataframe(df_bike_trips)
Matplotlib 🔗
A common task that data scientists perform is to visualize their data. With large
datasets, it's usually not possible and is almost always not preferable to pull the
data from the Data Flow Spark cluster into the
notebook session. This example proves how to use server-side resources to generate a
plot and include it in the notebook.
The df_bike_trips dataframe is defined in the session and is reused. To produce a
Matplotlib, include the required libraries and generate the plot. Use the
%matplot plt magic command to display the plot in the notebook,
even though it's rendered on the
server-side:
Copy
%%spark
import matplotlib.pyplot as plt
df_bike_trips.groupby("_c4").count().toPandas().plot.bar(x="_c4", y="count")
%matplot plt