Reference section for Oracle Cloud Infrastructure Data Flow including a
Glossary of terms and links to REST API documentation.
Glossary
Concept Glossary 🔗
Apache Spark: Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
Spark Application: Spark Applications are applications that use the Spark API to perform
distributed data processing tasks. Spark Applications can be
written in several languages including Java, Python and more.
Spark Applications are files such as JAR files that are run
within the Spark framework.
Data Flow Application:
Data Flow Applications are templates that bundle
together a Spark Application and specify a Spark version, default resource sizing and
default application parameters.
Data Flow Run:
Data Flow Applications can be
run many times. Each of these results in a Data Flow Run object.
Resource sizes and parameters can be optionally overridden at
run time by the user. A Data Flow Run tracks information relevant to the Application's
execution, such as statistics and logs.
Spark UI: The Spark UI is included with Apache Spark and is an important tool for debugging and diagnosing Spark applications. You can access the Spark UI for any Data Flow Run, subject to the Run's authorization policies.
Spark Logs: Spark generates log files which are useful for debugging and diagnostics. Each
Data Flow Run
automatically stores log files which you can access through the
Spark UI or API, subject to the Run's authorization
policies.
Free-form resource requests aren't supported when using Livy API.
The implementation accepts two new parameters:
driverShape
executorShape
If one or both is missing a warning is logged. If an invalid shape is passed,
the resultant error lists the supported shapes. There is a shapes endpoint that
returns the valid shapes with their sizes/types.
IP Address Allowlist 🔗
If you read or write data using Data Flow without using private
networking solutions such as VCN, you need to allowlist certain IP addresses where Data Flow runs. This is the list of IP addresses that
you can allowlist: