Data Flow now supports Pools

  • Services: Data Flow
  • Release Date: June 21, 2023

A Data Flow Pool is a group of pre-allocated Compute resources that can be used to run Data Flow based Spark workloads with faster startup time.

Use-cases:
• Time sensitive large production workloads with many executors, which need faster startup time in seconds.
• Critical production workloads aren’t affected by dynamic development workloads because their resources can be allocated from different Pools.
• Cost and usage separation between development and production workloads with IAM policies that let you submit specific Data Flow Runs to specific Pools.
• Execute a large number of Data Flow Runs back-to-back with less startup time.
• Queueing Data Flow Runs in a pool for efficient use of resources and cost control.
• Automatic start of a Pool based on a schedule; automatic termination based on idle time.

For more information, see the Data Flow Service Limits documentation.