The following are best practices when using Spark dynamic allocation with Data Flow.
Short Runs
If Spark processing takes less than 10 minutes, using dynamic allocation makes no sense.
The overhead makes the runtime longer than when resources are statically allocated.
Run Start Up Time
If the Spark application doesn't need all the resources during the initial execution or
processing time, then by using dynamic allocation, you can improve run start-up time.
Specify a minimum number of nodes required for the initial time of the application, with the
rest of the nodes allocated in parallel when the application runs.
Hot and Cold Start 🔗
Data Flow retains a tenant pool of resources that's shared
across runs to speed up resource allocation. A hot start is when Dynamic Allocation tries to
borrow from this pool, and so, executors are added within a few minutes.
If no runs occur in the tenancy for five minutes, then the resource pool might be released.
The reallocation of resources might take 10, or even 20, minutes. A cold start occurs when
the resource pool has been released. Dynamic Allocation might then take 10 to 20, minutes to
dynamically add executors.
Idle Timeout 🔗
More time is needed to add an executor than to release it. So setting too low a value for spark.dynamicAllocation.executorIdleTimeout might increase the overall run duration, because of the overhead of adding back executors.