About External Tables with Source
File Partitioning
On Autonomous Database you
can create partitioned external tables from Hive style partitioned data or from simple
folder partitioned data stored on your Cloud Object Store.
Using source file partitioning, instead of supplying a complete partition
specification the procedure derives partitioning information from the file path for
certain file patterns. For example, consider the following data file specifications:
Hive style: for example:
sales/country=USA/year=2020/month=01/file1.csv
Simple folder partitioning style: for example:
sales/USA/2020/01/file1.parquet
Using one of these common partitioning formats greatly simplifies both the
creation and management of partitioned external tables. In addition, even though
partition columns may not appear in the data file, they can still be queried using SQL.
Partitioning data also improves query performance by dramatically reducing the amount of
data scanned. In this example, when you query ‘USA’ data, the query can skip scanning
the files for other countries.
Hive Format Partitioned Data in
Cloud Object Store
Hive offers a standard metadata format for big data processing engines.
Partitioned data in Cloud Object Store that is generated in Hive format is
represented in a folder/subfolder format. For example, on Cloud
Object Store a Hive format data file is stored as follows:
Files saved in Hive partitioned format provide partition information in
the data file path name. The data file path name includes information about the
object contents, including partition column names and partition column values (the
data file does not include the partition columns and their associated values).
For example, consider an external partitioned SALES
table created from Hive format data on Cloud Object Store:
The Hive format partition information shows the data files in Cloud
Object Store are partitioned by country,
year, and month and the values for
these partition columns are also specified within the Hive format path name for
each data file (the path name includes values for the partitioned columns:
country, year, and
month).
The column names in the path will be used by the API to simplify the table
definition.
Simple Folder Format Partitioned
Data in Cloud Object Store
Partitioned data in Cloud Object Store that is generated in folder format
is represented in a folder/subfolder format, similar to Hive
format partitioned data, but the information in the path shows the column values and
does not include the column names. Also, with folder format partitioned data the
partition order specified in the object name is significant, and must match the
order in the table columns.
For example, on Cloud Object Store a folder format data file is stored as
follows:
table/partition1_value/partition2_value/*.parquet
The path includes both partition column values, in partition column
order, and the data files. Autonomous Database allows you to create an external partitioned table from folder
format data and you can perform a query using the specified partitions.
Files saved in folder partitioned format provide the data partition column values in
the file name. Unlike Hive, the paths do not include the column name, therefore the
column names must be provided. The order of partition columns is important and the
order in the file name for column partition names must match the order in the
partition_columns parameter.
About Querying Partitioned Data in
Cloud Object Store
When you query external partitioned data in Hive format, the query
engine understands and utilizes the partitioning information from the file path
name. For example, consider an external partitioned SALES table
where the source file,
sales/country=USA/year=2020/month=02/file3.csv on Object
Store includes the following sales data:
tents, 291
canoes, 22
backpacks, 378
The country values in the path name, and the time period
values for month and year are not specified as
columns within the data file. The partition column values are specified only in the
path name with values shown: USA, 2020, and 02. After you create an external
partitioned table with this data file you can use the partition columns and their
values when you run a query on the external partitioned table.
For example:
SELECT year, month, product, units
FROM SALES WHERE year='2020' AND month='02' AND country='USA'
The benefit of creating an external partitioned table with data generated
as Hive format partitioned data is that the query engine is optimized to partition
prune the data to select the correct partition and the query only selects data from
one partition and only needs to search a single data file. Thus, the query would
only require a scan of the file3.csv file
(/sales/country=USA/year=2020/month=02/file3.csv). For
large amounts of data, such partition pruning can provide significant performance
improvements.
Using standard Oracle Database external tables, the partition column
must be available as a column within the data file to use it for queries or
partition definitions. Without the special handling that is available with external
partitioned tables on Autonomous Database,
this would be a problem if you want to use data stored in Hive format on Cloud
Object Store, as you would need to regenerate the data files to include the
partition as a column in the data file.
About Creating Partitioned
External Tables
When you use unstructured data stored in Hive format on Cloud Object
Store and you create an external partitioned table, the columns and their types
cannot be derived from the source file. Thus, the columns and their data types must
be specified with the column_list parameter. To create the
partitioned external tables, use the procedure DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE to specify the
partition columns and their types as follows:
The root for the file list is specified in the path name with
the file_uri_list parameter. For example,
http://.../sales/*
The column names and data types are specified with the
column_list parameter.
The option partition_columns in the
format parameter specifies the partition columns.
The generated DLL includes the columns specified in the path
name.
For this example, when the external table is created the
country, year, and month
columns are added in the column_list parameter. The external table
is created with the country, year, and
month columns, which are not in the data files, and list
partitions are created enabling partition pruning.
When you use structured data, such as Parquet, Avro, or ORC files stored
in folder format on Cloud Object Store, the columns and their data types are known.
and you do not need to specify the column list as is required with unstructured
data. To create the partitioned external tables, use the procedure DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE
to specify the partition columns and their types as follows:
The root for the file list is specified in the path name with the
file_uri_list parameter. For example,
http://.../sales/*
The column_list parameter is not required for
structured files. If you do not specify the column list, you must define the
partition columns and their data types when you create the external partitioned
table. Use the option partition_columns in the
format parameter to specify the partition columns and their
data types.
The generated DLL includes the columns specified in the path name.
DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE (
table_name => 'mysales',
credential_name => 'mycredential',
file_uri_list => 'https://objectstorage.us-phoenix-1.oraclecloud.com/.../sales/*.csv',
column_list => 'product varchar2(100), units number, country varchar2(100), year number, month varchar2(2)',
field_list => 'product, units', --[Because country, year and month are not in the file, they are not listed in the field list]
format => '{"type":"csv","partition_columns":["country", "year", "month"]}');
Note
The partition_columns
in the format parameter must match the column names found in the path
(for example, the country column matches
“country=…”)
DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE(
table_name => 'mysales',
credential_name => 'mycredential',
file_uri_list => 'https://objectstorage.us-phoenix-1.oraclecloud.com/.../sales/*.csv',
column_list => 'product varchar2(100), units number, country varchar2(100), year number, month varchar2(2)',
field_list => 'product, units', --[Because country, year and month are not in the file, they are not listed in the field list]
format => '{"type":"csv","partition_columns":["country", "year", "month"]}');
Note
The API call is the same as in the
previous example, but the order of the partition_columns in the
format parameter is significant because the column name is not in
the file path.
DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE (
table_name => 'mysales',
credential_name => 'mycredential',
file_uri_list => 'https://objectstorage.us-phoenix-1.oraclecloud.com/.../sales/*.parquet',
format =>
json_object( 'type' value 'parquet',
'schema' value 'first',
'partition_columns' value
json_array(
json_object('name' value 'country', 'type' value 'varchar2(100)'),
json_object('name' value 'year', 'type' value 'number'),
json_object('name' value 'month', 'type' value 'varchar2(2)')
)
)
);
Note
The column_list
parameter is not specified. As shown, for each partition column specify both the name
and data type in the format parameter
partition_columns.
DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE (
table_name => 'mysales',
credential_name => 'mycredential',
file_uri_list => 'https://objectstorage.us-phoenix-1.oraclecloud.com/.../sales/*.parquet',
format =>
json_object( 'type' value 'parquet',
'schema' value 'first',
'partition_columns' value
json_array(
json_object('name' value 'country', 'type' value 'varchar2(100)'),
json_object('name' value 'year', 'type' value 'number'),
json_object('name' value 'month', 'type' value 'varchar2(2)')
)
)
);
Note
The column_list
parameter is not specified. You must include both the name and data type for the
partition columns. In addition, the order of the partition_columns in
the format clause matters because the column name is not in the file path.