About Querying Apache Iceberg Tables Autonomous Database supports querying of Apache Iceberg tables stored in Amazon Web Services (AWS) or in Oracle Cloud Infrastructure (OCI) Object Storage.
Examples: Querying Apache Iceberg Tables These examples show how to query Apache Iceberg tables on Amazon Web Services (AWS) and Oracle Cloud Infrastructure (OCI), using a data catalog and using direct URLs for the root manifest file.
Autonomous Database supports
querying of Apache Iceberg tables stored in Amazon Web Services (AWS) or in Oracle Cloud
Infrastructure (OCI) Object Storage.
Supported Configurations
These specific configurations are supported:
Iceberg tables on AWS:
Iceberg tables registered with AWS Glue Data Catalog,
created with Spark or Athena.
Iceberg tables stored on OCI Object Storage by directly providing the
URL for the root metadata file.
Restrictions
Partitioned Iceberg tables
Oracle does not
support Iceberg partitioned tables.
Row-level updates of Iceberg tables
Oracle does
not support merge-on-read for Iceberg table updates. Queries
encountering deleted files in the Iceberg metadata will fail. For more
information about merge-on-read, see Enum
RowLevelOperationMode.
Schema evolution
The schema for Oracle external
tables is fixed and reflects the Iceberg schema version at external
table creation time. Queries fail if the Iceberg metadata points to a
different schema version compared to the one used to create the Iceberg
table. If the Iceberg schema changes after the external table has been
created, it is recommended that the external table be
recreated.
Concepts Related to Querying Apache Iceberg
Tables 🔗
An understanding of the following concepts is helpful for querying Apache
Iceberg tables.
Iceberg Catalog
The Iceberg catalog is a service that manages table metadata, such
table snapshots, the table schema and partitioning information. In order to
query the latest snapshot for an Iceberg table, query engines must first
access the catalog and obtain the location of the most recent metadata file.
There are already a number of available catalog implementations, including
AWS Glue, Hive, Nessie, and Hadoop. Autonomous Database supports the AWS
Glue catalog and the HadoopCatalog implementation used by Spark.
The metadata file is a JSON document that keeps track of the table
snapshots, partitioning scheme and schema information. The metadata
file is the entry point to a hierarchy of manifest lists and
manifest files. The manifests track the table's data files along
with information including partitioning and column statistics. See
the Iceberg Table
Specification, for more information.
Transactions
Iceberg supports row-level updates to tables using either
copy-on-write or merge-on-read. Copy-on-write generates new data
files that reflect the updated rows, while merge-on-read generates
new "delete files" that must be merged with the data files during
reading. Oracle supports copy-on-write. Queries on iceberg tables
fail if they encounter a delete file. For more information, see
RowLevelOperationMode.
Schema Evolution
Iceberg supports schema evolution. Schema changes are reflected in the
Iceberg metadata using a schema ID. Note that Oracle external tables
have a fixed schema, determined by the most current schema version
at table creation time. Iceberg queries fail when the queried
metadata points to a different schema version compared to the one
used at table creation time. For more information, see Schema Evolution.
Partitioning
Iceberg supports advanced partitioning options such as
hidden partitioning and partition evolution that rely on
processing/altering the table's metadata without costly data layout
changes.
These examples show how to query Apache Iceberg tables on Amazon Web
Services (AWS) and Oracle Cloud Infrastructure (OCI), using a data catalog and using
direct URLs for the root manifest file.
In the protocol_config section we specify that the
table uses AWS Glue as the catalog type, and set the catalog's
region to us-west-1.
The
credential AWS_CRED is created using
dbms_cloud.create_credential with an AWS
API key. The Glue Data Catalog instance is determined by the account
ID associated with AWS_CRED for region
us-west-1, as there is a single Glue Data
Catalog region for each account. The
iceberg_table_path element in the
protocol_config section uses a
$database_name.$table_name path to specify
the Glue table name and database name. Finally, the
column_list and
field_list parameters are left null because
the table's schema is automatically derived from the Iceberg
metadata.
Query an Iceberg table on AWS using the location of the root
metadata file
If we know the location of the metadata file for an Iceberg table, we can
create an external table without specifying a catalog, as follows:
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE (
table_name => 'iceberg_parquet_time_dim',
credential_name => 'AWS_CRED',
file_uri_list => 'https://my-iceberg-bucket.s3.us-west-1.amazonaws.com/iceberg-loc/metadata/00004-1758ee2d-a204-4fd9-8d52-d17e5371a5ce.metadata.json',
format =>'{"access_protocol":{"protocol_type":"iceberg"}}');
END;
/
We use the file_uri_list parameter to specify the
location of the metadata file in AWS S3 virtual-hosted-style URL
format. For information on this format, see Methods for accessing an AWS
S3 bucket.
In this example, the database accesses the metadata file directly, so
there is no need to provide a protocol_config
section in the format parameter. When using the
metadata file location to create an external table, the database
queries the most recent snapshot referenced by the metadata file.
Subsequent updates to the Iceberg table, that create new snapshots
and new metadata files, will not be visible to the database.
Query
an Iceberg table that uses Hadoop Catalog on OCI
In this example, we query the Iceberg
table icebergTablePy, created using OCI Data Flow,
where Spark uses the HadoopCatalog implementation for the Iceberg
catalog. HadoopCatalog uses a warehouse directory
and puts the Iceberg metadata in a
$database_name/$table_name subfolder under
this directory. It also uses a version-hint.text
file that contains the version number for the most recent metadata
file version. See Iceberg Support
on OCI Data Flow for the example on Github.
The sample table db.icebergTablePy was
created using a warehouse folder, named
iceberg, in OCI bucket
my-iceberg-bucket. The storage layout on
OCI for table icebergTablePy is shown below:
Create an external table for table
db.icebergTablePy as follows:
Query an Iceberg table on OCI using the location of the root
metadata file
We can query the Iceberg table described in the previous section by
directly using the URL for the metadata file, as follows:
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE (
table_name => 'iceberg_parquet_time_dim4',
credential_name => 'OCI_CRED',
file_uri_list => 'https://objectstorage.uk-cardiff-1.oraclecloud.com/n/my-tenancy/b/my-iceberg-bucket/o/iceberg/db/icebergTablePy/metadata/v2.metadata.json',
format =>'{"access_protocol":{"protocol_type":"iceberg"}}'
);
END;
/
In this example, we use the file_uri_list parameter to
specify the URI for the metadata file using the native OCI URI
format. When using the metadata file URI, the external table always
queries the latest snapshot stored in the specific file. Subsequent
updates that generate new snapshots and new metadata files are not
accessible to the query.