Developing Data Flow Applications

Learn about the Library , including reusable Spark application templates and application security. Also learn how to create and view applications, edit applications, delete applications, and apply arguments or parameters.

Data Flow automatically stops long running batch jobs (more than 24 hours) using a delegation token. In this case, if the application isn't finished with processing the data, you might get run failure and the job remains unfinished. To prevent this, use the following options to limit the total time the application can run:
When Creating Applications using the Console
Under Advanced Options, specify the duration in Max run duration minutes.
When Creating Applications using the CLI
Pass command line option of --max-duration-in-minutes <number>
When Creating Applications using the SDK
Provide optional argument max_duration_in_minutes
When Creating Applications using the API
Set the optional argument maxDurationInMinutes

Reusable Spark Application Templates

An Application  is an infinitely reusable Spark application template.

Data Flow Applications consist of a Spark application, its dependencies, default parameters, and a default run-time resource specification. Once a Spark developer creates a Data Flow Application, anyone can use it without worrying about the complexities of deploying it, setting it up, or running it. You can use it through Spark analytics in custom dashboards, reports, scripts, or REST API calls. There is on the left is a figure representing Spark developers. An arrow passes to a box representing published applications. The arrow is labelled Publish: Parameterized Application. To the right of the box is another figure representing non-developers. An arrow flows from the non-developers to the box and is labelled Execute: Custom Reports and Custom Dashboards.

Every time you invoke the Data Flow Application, you create a Run . It fills in the details of the application template and launches it on a specific set of IaaS resources. There is a box labelled Data Flow Application. It contains a list: Link to Spark Code, Dependencies, Default Driver/Executor Shape and Count, Arguments, and Default Parameters. An arrow labelled Run an Application passes to another box labelled Data Flow Run. It contains the list: Link to Spark Code, Dependencies, Specific Driver/Executor Shapes and Counts, Arguments, Specific Parameters, Spark UI, and Log Output.