About External Tables with Source File Partitioning

On Autonomous Database you can create partitioned external tables from Hive style partitioned data or from simple folder partitioned data stored on your Cloud Object Store.

Using source file partitioning, instead of supplying a complete partition specification the procedure derives partitioning information from the file path for certain file patterns. For example, consider the following data file specifications:

  • Hive style: for example: sales/country=USA/year=2020/month=01/file1.csv

  • Simple folder partitioning style: for example: sales/USA/2020/01/file1.parquet

Using one of these common partitioning formats greatly simplifies both the creation and management of partitioned external tables. In addition, even though partition columns may not appear in the data file, they can still be queried using SQL. Partitioning data also improves query performance by dramatically reducing the amount of data scanned. In this example, when you query ‘USA’ data, the query can skip scanning the files for other countries.

Hive Format Partitioned Data in Cloud Object Store

Hive offers a standard metadata format for big data processing engines. Partitioned data in Cloud Object Store that is generated in Hive format is represented in a folder/subfolder format. For example, on Cloud Object Store a Hive format data file is stored as follows:

table/partition1=partition1_value/partition2=partition2_value/data_file.csv

Files saved in Hive partitioned format provide partition information in the data file path name. The data file path name includes information about the object contents, including partition column names and partition column values (the data file does not include the partition columns and their associated values).

For example, consider an external partitioned SALES table created from Hive format data on Cloud Object Store:

.../sales/country=USA/year=2020/month=01/file1.csv
.../sales/country=USA/year=2020/month=01/file2.csv
.../sales/country=USA/year=2020/month=02/file3.csv
.../sales/country=USA/year=2020/month=03/file1.csv
.../sales/country=FRA/year=2020/month=03/file1.csv

The Hive format partition information shows the data files in Cloud Object Store are partitioned by country, year, and month and the values for these partition columns are also specified within the Hive format path name for each data file (the path name includes values for the partitioned columns: country, year, and month).

The column names in the path will be used by the API to simplify the table definition.

Simple Folder Format Partitioned Data in Cloud Object Store

Partitioned data in Cloud Object Store that is generated in folder format is represented in a folder/subfolder format, similar to Hive format partitioned data, but the information in the path shows the column values and does not include the column names. Also, with folder format partitioned data the partition order specified in the object name is significant, and must match the order in the table columns.

For example, on Cloud Object Store a folder format data file is stored as follows:

table/partition1_value/partition2_value/*.parquet

The path includes both partition column values, in partition column order, and the data files. Autonomous Database allows you to create an external partitioned table from folder format data and you can perform a query using the specified partitions.

Files saved in folder partitioned format provide the data partition column values in the file name. Unlike Hive, the paths do not include the column name, therefore the column names must be provided. The order of partition columns is important and the order in the file name for column partition names must match the order in the partition_columns parameter.

About Querying Partitioned Data in Cloud Object Store

When you query external partitioned data in Hive format, the query engine understands and utilizes the partitioning information from the file path name. For example, consider an external partitioned SALES table where the source file, sales/country=USA/year=2020/month=02/file3.csv on Object Store includes the following sales data:

tents, 291
canoes, 22
backpacks, 378

The country values in the path name, and the time period values for month and year are not specified as columns within the data file. The partition column values are specified only in the path name with values shown: USA, 2020, and 02. After you create an external partitioned table with this data file you can use the partition columns and their values when you run a query on the external partitioned table.

For example:

SELECT year, month, product, units 
     FROM SALES WHERE year='2020' AND month='02' AND country='USA'

The benefit of creating an external partitioned table with data generated as Hive format partitioned data is that the query engine is optimized to partition prune the data to select the correct partition and the query only selects data from one partition and only needs to search a single data file. Thus, the query would only require a scan of the file3.csv file (/sales/country=USA/year=2020/month=02/file3.csv). For large amounts of data, such partition pruning can provide significant performance improvements.

Using standard Oracle Database external tables, the partition column must be available as a column within the data file to use it for queries or partition definitions. Without the special handling that is available with external partitioned tables on Autonomous Database, this would be a problem if you want to use data stored in Hive format on Cloud Object Store, as you would need to regenerate the data files to include the partition as a column in the data file.

About Creating Partitioned External Tables

When you use unstructured data stored in Hive format on Cloud Object Store and you create an external partitioned table, the columns and their types cannot be derived from the source file. Thus, the columns and their data types must be specified with the column_list parameter. To create the partitioned external tables, use the procedure DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE to specify the partition columns and their types as follows:

  • The root for the file list is specified in the path name with the file_uri_list parameter. For example, http://.../sales/*

  • The column names and data types are specified with the column_list parameter.

  • The option partition_columns in the format parameter specifies the partition columns.

  • The generated DLL includes the columns specified in the path name.

For this example, when the external table is created the country, year, and month columns are added in the column_list parameter. The external table is created with the country, year, and month columns, which are not in the data files, and list partitions are created enabling partition pruning.

When you use structured data, such as Parquet, Avro, or ORC files stored in folder format on Cloud Object Store, the columns and their data types are known. and you do not need to specify the column list as is required with unstructured data. To create the partitioned external tables, use the procedure DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE to specify the partition columns and their types as follows:

  • The root for the file list is specified in the path name with the file_uri_list parameter. For example, http://.../sales/*
  • The column_list parameter is not required for structured files. If you do not specify the column list, you must define the partition columns and their data types when you create the external partitioned table. Use the option partition_columns in the format parameter to specify the partition columns and their data types.
  • The generated DLL includes the columns specified in the path name.

See Query External Partitioned Data with Hive Format Source File Organization and Query External Partitioned Data with Folder Format Source File Organization for complete examples.

External Partitioning: CSV Source Files with Hive-style Folders

Shows how to create external partitioned tables with CSV source files stored on Cloud Object Store in Hive-style folders.

Source file list:

.../sales/country=USA/year=2020/month=01/file1.csv
.../sales/country=USA/year=2020/month=01/file2.csv
.../sales/country=USA/year=2020/month=02/file3.csv
.../sales/country=USA/year=2020/month=03/file1.csv
.../sales/country=FRA/year=2020/month=03/file1.csv

API:

DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE (
   table_name        => 'mysales',
   credential_name   => 'mycredential', 
   file_uri_list     => 'https://objectstorage.us-phoenix-1.oraclecloud.com/.../sales/*.csv', 
   column_list       => 'product varchar2(100), units number, country varchar2(100), year number, month varchar2(2)', 
   field_list        => 'product, units', --[Because country, year and month are not in the file, they are not listed in the field list]
   format            => '{"type":"csv","partition_columns":["country", "year", "month"]}');
Note

The partition_columns in the format parameter must match the column names found in the path (for example, the country column matches “country=…”)

External Partitioning: CSV Source Files with Simple Folders

Shows how to create external partitioned tables with CSV source files stored on Cloud Object Store in simple folder format.

Source file list:

.../sales/USA/2020/01/file1.csv
.../sales/USA/2020/01/file2.csv
.../sales/USA/2020/02/file3.csv
.../sales/USA/2020/03/file1.csv
.../sales/FRA/2020/03/file1.csv

API:

DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE(
   table_name        => 'mysales',
   credential_name   => 'mycredential', 
   file_uri_list     =>  'https://objectstorage.us-phoenix-1.oraclecloud.com/.../sales/*.csv', 
   column_list       => 'product varchar2(100), units number, country varchar2(100), year number, month varchar2(2)', 
   field_list        => 'product, units', --[Because country, year and month are not in the file, they are not listed in the field list]
   format            => '{"type":"csv","partition_columns":["country", "year", "month"]}');
Note

The API call is the same as in the previous example, but the order of the partition_columns in the format parameter is significant because the column name is not in the file path.

External Partitioning: Parquet Source Files with Hive-style Folders

Shows how to create external partitioned tables with Parquet source files stored on Cloud Object Store in Hive-style folders.

Source file list:

.../sales/USA/2020/01/file1.parquet
.../sales/USA/2020/01/file2.parquet
.../sales/USA/2020/02/file3.parquet
.../sales/USA/2020/03/file1.parquet
.../sales/FRA/2020/03/file1.parquet

API:

DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE (
   table_name            => 'mysales',
     credential_name     => 'mycredential', 
     file_uri_list       => 'https://objectstorage.us-phoenix-1.oraclecloud.com/.../sales/*.parquet',
     format              => 
         json_object(    'type' value 'parquet', 
               'schema' value 'first',
               'partition_columns' value 
                  json_array(
                      json_object('name' value 'country', 'type' value 'varchar2(100)'),
                      json_object('name' value 'year', 'type' value 'number'),
                      json_object('name' value 'month', 'type' value 'varchar2(2)')
            )
        )

);
Note

The column_list parameter is not specified. As shown, for each partition column specify both the name and data type in the format parameter partition_columns.

External Partitioning: Parquet with Simple Folders

Shows how to create external partitioned tables with Parquet source files stored on Cloud Object Store in simple folder format.

Source file list:

.../sales/USA/2020/01/file1.parquet
.../sales/USA/2020/01/file2.parquet
.../sales/USA/2020/02/file3.parquet
.../sales/USA/2020/03/file1.parquet
.../sales/FRA/2020/03/file1.parquet

API:

DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE (
   table_name        => 'mysales',
   credential_name   => 'mycredential', 
   file_uri_list     => 'https://objectstorage.us-phoenix-1.oraclecloud.com/.../sales/*.parquet', 
   format            => 
     json_object(    'type' value 'parquet', 
                'schema' value 'first',
         'partition_columns' value 
           json_array(
              json_object('name' value 'country', 'type' value 'varchar2(100)'),
              json_object('name' value 'year', 'type' value 'number'),
              json_object('name' value 'month', 'type' value 'varchar2(2)')
            )
        )

);
Note

The column_list parameter is not specified. You must include both the name and data type for the partition columns. In addition, the order of the partition_columns in the format clause matters because the column name is not in the file path.