Administer Data Flow

Learn how to administer Data Flow, including how to tune application runs, performance optimization, tuning object storage access, and best practices for Spark performance, and how to troubleshoot and correct common problems.

Tune Data Flow Runs

Tuning Overview

Before you tune a Spark application run in Data Flow, it's important to understand how it runs Spark applications. When you run a Spark application, a cluster of VMs is provisioned based on the VM shapes and counts you chose. Spark is run within this cluster. These VMs are private to your Spark application, but run within a shared, multi-tenant hardware and software environment.

When you start an application, you select one VM shape type for the Driver and another shape for the Workers. You also specify how many Workers you want. There's always only one Driver. The Driver and Workers are automatically sized to consume all CPU and memory resources in the VMs. If your workload needs larger or smaller Java Virtual Machines, you can control this by choosing a larger or smaller VM instance type.

Data Flow is designed to make using Oracle Object Storage simple and transparent. As a result, Object Storage is always one of the first things you should investigate when examining an under-performing application.

Performance Optimization

Throwing more resources at the problem is sometimes the best way to go, especially if the application is already heavily optimized. Data Flow simplifies this by tracking runtime history of all Spark applications, and centralizing them in one place. Within the Data Flow UI, load the application you want to optimize. In the Application detail screen, you see historical Runs and the resources used during those Runs. Often, hitting your SLA is as simple as using extra CPU and memory resources.
Figure 1. Performance Optimization
There is a figure representing a developer. An arrow flows to a box labelled Data Flow Home Page. An arrow labelled Runs Tab flows to a box labelled Locate Application to be Optimized. An arrow labelled Open Application flows to a box labelled View Runtime History Graph. An arrow labelled Select App from Catalog flows to a box labelled Tune Default Resources. Finally an arrow flows from it to End.

Tune Object Storage Access

Object Storage is deployed in all Oracle Cloud Infrastructure data centers. Access to Object Storage is highly performant, provided your Spark application is running in the same Oracle Cloud Infrastructure Region as where your data is stored. If data reads or writes are slow, confirm you're opening your Data Flow Application in the same region as your data. You can see the active region in the Oracle Cloud Infrastructure UI. REST API must also be targeted to a specific region.

Spark Performance Best Practices

If you suspect your Spark application isn't well optimized, consider using these optimizations:
  1. Use object storage. Object storage provides substantially more bandwidth than reading data from an RDBMS. Copying data to object storage ahead of time substantially speeds up processing.
  2. Use Parquetfile whenever possible. Parquetfile is up to ten times smaller, and your jobs only read the data they need rather than entire files
  3. Partition your datasets appropriately. Most analytics applications only access the last week or so of data. Ensure you partition your data so that recent data is in separate files from older data.
  4. Identify data skew issues by looking for long-running executors within the Spark UI.
  5. Avoid driver bottlenecks. Collecting data in Spark sends all the data back to the Spark driver. Perform collect operations as late as possible in your jobs. If you must collect a large dataset, consider scaling the driver node up to a larger VM shape. This ensures you have adequate memory and CPU resources.

Common Problems with Jobs

Data Flow jobs fail for many reasons, but most are typically caused by:

  • Application errors.
  • Out-of-memory errors.
  • Transient runtime problems.
There are three tools to diagnose and correct failing jobs in Data Flow:
  • Logs from Oracle Cloud Infrastructure Logging are available if you have followed the steps to Enable Oracle Cloud Infrastructure Logging Logs in Data Flow.
  • The Spark UI.
  • The Spark log , stdout and stderr stream for the Spark driver and all executors.

Both of these are securely accessed from your browser by loading the Run in question.

Figure 2. Accessing Log Files
There is a figure representing a developer. An arrow flows to a box labelled Data Flow Home Page. An arrow labelled Runs Tab flows to a box labelled Locate Failing Run. An arrow labelled Open Run flows to a box labelled Choose Debugging Approach. An arrow labelled Spark UI link flows to a box labelled Spark UI, whilst a second arrow labelled View Logs Link flows from Choose Debugging Approach to a box labelled Driver and Executor Logs.

When a job fails, first look at the Driver stderr log file, as most errors appear here. Next, check the stderr log files for Executors. If the log files don't contain specific errors, load the Spark UI to allow further investigation of the Spark application.

Based on the type of problem:
  1. Application errors need to be corrected at source. Then a new Application can be created.
  2. Tackle out-of-memory errors by running larger instances, by processing less data through fine-grained partitioning, or by optimizing the application to be more efficient.
  3. Data Flow strives to shield you from transient runtime problems. If they persist, contact Oracle support.
  4. An invalid request returns a bad request error, which you can see in the log files.
  5. If the user isn't authorized (but the request is valid), then a Not Authorized error is returned.

Was this article helpful?