About External Tables with Source File Partitioning
On Autonomous Database you can create partitioned external tables from Hive style partitioned data or from simple folder partitioned data stored on your Cloud Object Store.
Using source file partitioning, instead of supplying a complete partition specification the procedure derives partitioning information from the file path for certain file patterns. For example, consider the following data file specifications:
-
Hive style: for example:
sales/country=USA/year=2020/month=01/file1.csv
-
Simple folder partitioning style: for example:
sales/USA/2020/01/file1.parquet
Using one of these common partitioning formats greatly simplifies both the creation and management of partitioned external tables. In addition, even though partition columns may not appear in the data file, they can still be queried using SQL. Partitioning data also improves query performance by dramatically reducing the amount of data scanned. In this example, when you query ‘USA’ data, the query can skip scanning the files for other countries.
Hive Format Partitioned Data in Cloud Object Store
Hive offers a standard metadata format for big data processing engines.
Partitioned data in Cloud Object Store that is generated in Hive format is
represented in a folder/subfolder
format. For example, on Cloud
Object Store a Hive format data file is stored as follows:
table/partition1=partition1_value/partition2=partition2_value/data_file.csv
Files saved in Hive partitioned format provide partition information in the data file path name. The data file path name includes information about the object contents, including partition column names and partition column values (the data file does not include the partition columns and their associated values).
For example, consider an external partitioned SALES
table created from Hive format data on Cloud Object Store:
.../sales/country=USA/year=2020/month=01/file1.csv
.../sales/country=USA/year=2020/month=01/file2.csv
.../sales/country=USA/year=2020/month=02/file3.csv
.../sales/country=USA/year=2020/month=03/file1.csv
.../sales/country=FRA/year=2020/month=03/file1.csv
The Hive format partition information shows the data files in Cloud
Object Store are partitioned by country
,
year
, and month
and the values for
these partition columns are also specified within the Hive format path name for
each data file (the path name includes values for the partitioned columns:
country
, year
, and
month
).
The column names in the path will be used by the API to simplify the table definition.
Simple Folder Format Partitioned Data in Cloud Object Store
Partitioned data in Cloud Object Store that is generated in folder format
is represented in a folder/subfolder
format, similar to Hive
format partitioned data, but the information in the path shows the column values and
does not include the column names. Also, with folder format partitioned data the
partition order specified in the object name is significant, and must match the
order in the table columns.
For example, on Cloud Object Store a folder format data file is stored as follows:
table/partition1_value/partition2_value/*.parquet
The path includes both partition column values, in partition column order, and the data files. Autonomous Database allows you to create an external partitioned table from folder format data and you can perform a query using the specified partitions.
Files saved in folder partitioned format provide the data partition column values in
the file name. Unlike Hive, the paths do not include the column name, therefore the
column names must be provided. The order of partition columns is important and the
order in the file name for column partition names must match the order in the
partition_columns
parameter.
About Querying Partitioned Data in Cloud Object Store
When you query external partitioned data in Hive format, the query
engine understands and utilizes the partitioning information from the file path
name. For example, consider an external partitioned SALES
table
where the source file,
sales/country=USA/year=2020/month=02/file3.csv
on Object
Store includes the following sales data:
tents, 291
canoes, 22
backpacks, 378
The country
values in the path name, and the time period
values for month
and year
are not specified as
columns within the data file. The partition column values are specified only in the
path name with values shown: USA, 2020, and 02. After you create an external
partitioned table with this data file you can use the partition columns and their
values when you run a query on the external partitioned table.
For example:
SELECT year, month, product, units
FROM SALES WHERE year='2020' AND month='02' AND country='USA'
The benefit of creating an external partitioned table with data generated
as Hive format partitioned data is that the query engine is optimized to partition
prune the data to select the correct partition and the query only selects data from
one partition and only needs to search a single data file. Thus, the query would
only require a scan of the file3.csv
file
(/sales/country=USA/year=2020/month=02/file3.csv
). For
large amounts of data, such partition pruning can provide significant performance
improvements.
Using standard Oracle Database external tables, the partition column must be available as a column within the data file to use it for queries or partition definitions. Without the special handling that is available with external partitioned tables on Autonomous Database, this would be a problem if you want to use data stored in Hive format on Cloud Object Store, as you would need to regenerate the data files to include the partition as a column in the data file.
About Creating Partitioned External Tables
When you use unstructured data stored in Hive format on Cloud Object
Store and you create an external partitioned table, the columns and their types
cannot be derived from the source file. Thus, the columns and their data types must
be specified with the column_list
parameter. To create the
partitioned external tables, use the procedure DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE
to specify the
partition columns and their types as follows:
-
The root for the file list is specified in the path name with the
file_uri_list
parameter. For example,http://.../sales/*
-
The column names and data types are specified with the
column_list
parameter. -
The option
partition_columns
in theformat
parameter specifies the partition columns. -
The generated DLL includes the columns specified in the path name.
For this example, when the external table is created the
country
, year
, and month
columns are added in the column_list
parameter. The external table
is created with the country
, year
, and
month
columns, which are not in the data files, and list
partitions are created enabling partition pruning.
When you use structured data, such as Parquet, Avro, or ORC files stored
in folder format on Cloud Object Store, the columns and their data types are known.
and you do not need to specify the column list as is required with unstructured
data. To create the partitioned external tables, use the procedure DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE
to specify the partition columns and their types as follows:
- The root for the file list is specified in the path name with the
file_uri_list
parameter. For example,http://.../sales/*
- The
column_list
parameter is not required for structured files. If you do not specify the column list, you must define the partition columns and their data types when you create the external partitioned table. Use the optionpartition_columns
in theformat
parameter to specify the partition columns and their data types. - The generated DLL includes the columns specified in the path name.
See Query External Partitioned Data with Hive Format Source File Organization and Query External Partitioned Data with Folder Format Source File Organization for complete examples.
- External Partitioning: CSV Source Files with Hive-style Folders
Shows how to create external partitioned tables with CSV source files stored on Cloud Object Store in Hive-style folders. - External Partitioning: CSV Source Files with Simple Folders
Shows how to create external partitioned tables with CSV source files stored on Cloud Object Store in simple folder format. - External Partitioning: Parquet Source Files with Hive-style Folders
Shows how to create external partitioned tables with Parquet source files stored on Cloud Object Store in Hive-style folders. - External Partitioning: Parquet with Simple Folders
Shows how to create external partitioned tables with Parquet source files stored on Cloud Object Store in simple folder format.
External Partitioning: CSV Source Files with Hive-style Folders
Shows how to create external partitioned tables with CSV source files stored on Cloud Object Store in Hive-style folders.
Source file list:
.../sales/country=USA/year=2020/month=01/file1.csv
.../sales/country=USA/year=2020/month=01/file2.csv
.../sales/country=USA/year=2020/month=02/file3.csv
.../sales/country=USA/year=2020/month=03/file1.csv
.../sales/country=FRA/year=2020/month=03/file1.csv
API:
DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE
(
table_name => 'mysales',
credential_name => 'mycredential',
file_uri_list => 'https://objectstorage.us-phoenix-1.oraclecloud.com/.../sales/*.csv',
column_list => 'product varchar2(100), units number, country varchar2(100), year number, month varchar2(2)',
field_list => 'product, units', --[Because country, year and month are not in the file, they are not listed in the field list]
format => '{"type":"csv","partition_columns":["country", "year", "month"]}');
The
partition_columns
in the format
parameter must match the column names found in the path
(for example, the country
column matches
“country=…
”)
Parent topic: About External Tables with Source File Partitioning
External Partitioning: CSV Source Files with Simple Folders
Shows how to create external partitioned tables with CSV source files stored on Cloud Object Store in simple folder format.
Source file list:
.../sales/USA/2020/01/file1.csv
.../sales/USA/2020/01/file2.csv
.../sales/USA/2020/02/file3.csv
.../sales/USA/2020/03/file1.csv
.../sales/FRA/2020/03/file1.csv
API:
DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE
(
table_name => 'mysales',
credential_name => 'mycredential',
file_uri_list => 'https://objectstorage.us-phoenix-1.oraclecloud.com/.../sales/*.csv',
column_list => 'product varchar2(100), units number, country varchar2(100), year number, month varchar2(2)',
field_list => 'product, units', --[Because country, year and month are not in the file, they are not listed in the field list]
format => '{"type":"csv","partition_columns":["country", "year", "month"]}');
The API call is the same as in the previous example, but the order of the
partition_columns
in the
format
parameter is significant because the column name is not in
the file path.
Parent topic: About External Tables with Source File Partitioning
External Partitioning: Parquet Source Files with Hive-style Folders
Shows how to create external partitioned tables with Parquet source files stored on Cloud Object Store in Hive-style folders.
Source file list:
.../sales/USA/2020/01/file1.parquet
.../sales/USA/2020/01/file2.parquet
.../sales/USA/2020/02/file3.parquet
.../sales/USA/2020/03/file1.parquet
.../sales/FRA/2020/03/file1.parquet
API:
DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE
(
table_name => 'mysales',
credential_name => 'mycredential',
file_uri_list => 'https://objectstorage.us-phoenix-1.oraclecloud.com/.../sales/*.parquet',
format =>
json_object( 'type' value 'parquet',
'schema' value 'first',
'partition_columns' value
json_array(
json_object('name' value 'country', 'type' value 'varchar2(100)'),
json_object('name' value 'year', 'type' value 'number'),
json_object('name' value 'month', 'type' value 'varchar2(2)')
)
)
);
The
column_list
parameter is not specified. As shown, for each partition column specify both the name
and data type in the format
parameter
partition_columns
.
Parent topic: About External Tables with Source File Partitioning
External Partitioning: Parquet with Simple Folders
Shows how to create external partitioned tables with Parquet source files stored on Cloud Object Store in simple folder format.
Source file list:
.../sales/USA/2020/01/file1.parquet
.../sales/USA/2020/01/file2.parquet
.../sales/USA/2020/02/file3.parquet
.../sales/USA/2020/03/file1.parquet
.../sales/FRA/2020/03/file1.parquet
API:
DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE (
table_name => 'mysales',
credential_name => 'mycredential',
file_uri_list => 'https://objectstorage.us-phoenix-1.oraclecloud.com/.../sales/*.parquet',
format =>
json_object( 'type' value 'parquet',
'schema' value 'first',
'partition_columns' value
json_array(
json_object('name' value 'country', 'type' value 'varchar2(100)'),
json_object('name' value 'year', 'type' value 'number'),
json_object('name' value 'month', 'type' value 'varchar2(2)')
)
)
);
The
column_list
parameter is not specified. You must include both the name and data type for the
partition columns. In addition, the order of the partition_columns
in
the format clause matters because the column name is not in the file path.
Parent topic: About External Tables with Source File Partitioning