Best Practice

The following are best practices when using Spark dynamic allocation with Data Flow.

Short Runs

If Spark processing takes less than 10 minutes, using dynamic allocation makes no sense. The overhead makes the runtime longer than when resources are statically allocated.

Run Start Up Time

If the Spark application doesn't need all the resources during the initial execution or processing time, then by using dynamic allocation, you can improve run start-up time. Specify a minimum number of nodes required for the initial time of the application, with the rest of the nodes allocated in parallel when the application runs.

Hot and Cold Start

Data Flow retains a tenant pool of resources that's shared across runs to speed up resource allocation. A hot start is when Dynamic Allocation tries to borrow from this pool, and so, executors are added within a few minutes.

If no runs occur in the tenancy for five minutes, then the resource pool might be released. The reallocation of resources might take 10, or even 20, minutes. A cold start occurs when the resource pool has been released. Dynamic Allocation might then take 10 to 20, minutes to dynamically add executors.

Idle Timeout

More time is needed to add an executor than to release it. So setting too low a value for spark.dynamicAllocation.executorIdleTimeout might increase the overall run duration, because of the overhead of adding back executors.

Was this article helpful?