Getting Started with Oracle Cloud Infrastructure Data Flow

This tutorial introduces you to Oracle Cloud Infrastructure Data Flow, a service that lets you run any Apache Spark Application  at any scale with no infrastructure to deploy or manage.

If you've used Spark before, you'll get more out of this tutorial, but no prior Spark knowledge is required. All Spark applications and data have been provided for you. This tutorial shows how Data Flow makes running Spark applications easy, repeatable, secure, and simple to share across the enterprise.

In this tutorial you learn:
  1. How to use Java to perform ETL in a Data Flow Application .
  2. How to use SparkSQL in a SQL Application.
  3. How to create and run a Python Application to perform a simple machine learning task.

You can also perform this tutorial using spark-submit from CLI or using spark-submit and Java SDK.

Before You Begin

To successfully perform this tutorial, you must have Set Up Your Tenancy and be able to Access Data Flow.

1. ETL with Java

An exercise to learn how to create a Java application in Data Flow

The steps here are for using the Console UI. You can complete this exercise using spark-submit from CLI or spark-submit with Java SDK.

2. SparkSQL Made Simple

In this exercise, you run a SQL script to perform basic profiling of a dataset.

This exercise uses the output you generated in 1. ETL with Java. You must have completed it successfully before you can try this one.

The steps here are for using the Console UI. You can complete this exercise using spark-submit from CLI or spark-submit with Java SDK.

3. Machine Learning with PySpark

Use PySpark to perform a simple machine learning task over input data.

This exercise uses the output from 1. ETL with Java as its input data. You must have successfully completed the first exercise before you can try this one. This time, your objective is to identify the best bargains among the various Airbnb listings using Spark machine learning algorithms.

The steps here are for using the Console UI. You can complete this exercise using spark-submit from CLI or spark-submit with Java SDK.

What's Next

Now you can create and run Java, Python, or SQL applications with Data Flow, and explore the results.

Data Flow handles all details of deployment, tear down, log management, security, and UI access. With Data Flow, you focus on developing Spark applications without worrying about the infrastructure.