Query Apache Iceberg Tables

Autonomous Database supports querying Apache Iceberg tables.

About Querying Apache Iceberg Tables

Autonomous Database supports querying of Apache Iceberg tables stored in Amazon Web Services (AWS) or in Oracle Cloud Infrastructure (OCI) Object Storage.

Supported Configurations

These specific configurations are supported:

Restrictions

  • Partitioned Iceberg tables

    Oracle does not support Iceberg partitioned tables.

  • Row-level updates of Iceberg tables

    Oracle does not support merge-on-read for Iceberg table updates. Queries encountering deleted files in the Iceberg metadata will fail. For more information about merge-on-read, see Enum RowLevelOperationMode.

  • Schema evolution

    The schema for Oracle external tables is fixed and reflects the Iceberg schema version at external table creation time. Queries fail if the Iceberg metadata points to a different schema version compared to the one used to create the Iceberg table. If the Iceberg schema changes after the external table has been created, it is recommended that the external table be recreated.

Concepts Related to Querying Apache Iceberg Tables

An understanding of the following concepts is helpful for querying Apache Iceberg tables.

Iceberg Catalog

The Iceberg catalog is a service that manages table metadata, such table snapshots, the table schema and partitioning information. In order to query the latest snapshot for an Iceberg table, query engines must first access the catalog and obtain the location of the most recent metadata file. There are already a number of available catalog implementations, including AWS Glue, Hive, Nessie, and Hadoop. Autonomous Database supports the AWS Glue catalog and the HadoopCatalog implementation used by Spark.

For more information, see Iceberg Catalog, Optimistic Concurrency, Catalog Implementations, and Class HadoopCatalog.

Metadata Files

The metadata file is a JSON document that keeps track of the table snapshots, partitioning scheme and schema information. The metadata file is the entry point to a hierarchy of manifest lists and manifest files. The manifests track the table's data files along with information including partitioning and column statistics. See the Iceberg Table Specification, for more information.

Transactions

Iceberg supports row-level updates to tables using either copy-on-write or merge-on-read. Copy-on-write generates new data files that reflect the updated rows, while merge-on-read generates new "delete files" that must be merged with the data files during reading. Oracle supports copy-on-write. Queries on iceberg tables fail if they encounter a delete file. For more information, see RowLevelOperationMode.

Schema Evolution

Iceberg supports schema evolution. Schema changes are reflected in the Iceberg metadata using a schema ID. Note that Oracle external tables have a fixed schema, determined by the most current schema version at table creation time. Iceberg queries fail when the queried metadata points to a different schema version compared to the one used at table creation time. For more information, see Schema Evolution.

Partitioning

Iceberg supports advanced partitioning options such as hidden partitioning and partition evolution that rely on processing/altering the table's metadata without costly data layout changes.

Examples: Querying Apache Iceberg Tables

These examples show how to query Apache Iceberg tables on Amazon Web Services (AWS) and Oracle Cloud Infrastructure (OCI), using a data catalog and using direct URLs for the root manifest file.

For detailed information on creating external tables for Apache Iceberg, see CREATE_EXTERNAL_TABLE Procedure for Apache Iceberg.

Query an Iceberg table on AWS using a Glue Data Catalog

In this example, we query the Iceberg table iceberg_parquet_time_dim.Description of example_1_table.png follows

The table belongs to Glue database my-iceberg-db and is stored in folder s3://my-iceberg-bucket/iceberg-loc.

The table details for iceberg_parquet_time_dim are shown here:

Description of example_1_details_v1.png follows

We can create an external table for iceberg_parquet_time_dim as follows:

BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE (
    table_name       => 'iceberg_parquet_time_dim',
    credential_name  => 'AWS_CRED',
    file_uri_list    => '',
    format           =>
      '{"access_protocol":
        {
         "protocol_type": "iceberg",
         "protocol_config":
          {
           "iceberg_catalog_type": "aws_glue",
           "iceberg_glue_region": "us-west-1",
           "iceberg_table_path": "my-iceberg-db.iceberg_parquet_time_dim"
          }
        }
      }'
  );
END;
/

In the protocol_config section we specify that the table uses AWS Glue as the catalog type, and set the catalog's region to us-west-1.

The credential AWS_CRED is created using dbms_cloud.create_credential with an AWS API key. The Glue Data Catalog instance is determined by the account ID associated with AWS_CRED for region us-west-1, as there is a single Glue Data Catalog region for each account. The iceberg_table_path element in the protocol_config section uses a $database_name.$table_name path to specify the Glue table name and database name. Finally, the column_list and field_list parameters are left null because the table's schema is automatically derived from the Iceberg metadata.

For further information on creating the credential, see CREATE_CREDENTIAL Procedure. For information on AWS Glue resources, see Specifying AWS Glue resource ARNs.

Query an Iceberg table on AWS using the location of the root metadata file

If we know the location of the metadata file for an Iceberg table, we can create an external table without specifying a catalog, as follows:

BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE (
    table_name       => 'iceberg_parquet_time_dim',
    credential_name  => 'AWS_CRED',
    file_uri_list    => 'https://my-iceberg-bucket.s3.us-west-1.amazonaws.com/iceberg-loc/metadata/00004-1758ee2d-a204-4fd9-8d52-d17e5371a5ce.metadata.json',
    format           =>'{"access_protocol":{"protocol_type":"iceberg"}}');
END;
/

We use the file_uri_list parameter to specify the location of the metadata file in AWS S3 virtual-hosted-style URL format. For information on this format, see Methods for accessing an AWS S3 bucket.

In this example, the database accesses the metadata file directly, so there is no need to provide a protocol_config section in the format parameter. When using the metadata file location to create an external table, the database queries the most recent snapshot referenced by the metadata file. Subsequent updates to the Iceberg table, that create new snapshots and new metadata files, will not be visible to the database.

Query an Iceberg table that uses Hadoop Catalog on OCI

In this example, we query the Iceberg table icebergTablePy, created using OCI Data Flow, where Spark uses the HadoopCatalog implementation for the Iceberg catalog. HadoopCatalog uses a warehouse directory and puts the Iceberg metadata in a $database_name/$table_name subfolder under this directory. It also uses a version-hint.text file that contains the version number for the most recent metadata file version. See Iceberg Support on OCI Data Flow for the example on Github.

The sample table db.icebergTablePy was created using a warehouse folder, named iceberg, in OCI bucket my-iceberg-bucket. The storage layout on OCI for table icebergTablePy is shown below:

Description of example_3_table_v1.png follows

Create an external table for table db.icebergTablePy as follows:

BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE (
    table_name       => 'iceberg_parquet_time_dim3',
    credential_name  => 'OCI_CRED',
    file_uri_list    => '',
    format           =>'{"access_protocol":{"protocol_type":"iceberg",
        "protocol_config":{"iceberg_catalog_type": "hadoop",
        "iceberg_warehouse":"https://objectstorage.uk-cardiff-1.oraclecloud.com/n/my-tenancy/b/my-iceberg-bucket/o/iceberg",
        "iceberg_table_path": "db.icebergTablePy"}}}');
END;
/

Query an Iceberg table on OCI using the location of the root metadata file

We can query the Iceberg table described in the previous section by directly using the URL for the metadata file, as follows:

BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE (
    table_name       => 'iceberg_parquet_time_dim4',
    credential_name  => 'OCI_CRED',
    file_uri_list    => 'https://objectstorage.uk-cardiff-1.oraclecloud.com/n/my-tenancy/b/my-iceberg-bucket/o/iceberg/db/icebergTablePy/metadata/v2.metadata.json',
    format           =>'{"access_protocol":{"protocol_type":"iceberg"}}'
    );
  END;
/

In this example, we use the file_uri_list parameter to specify the URI for the metadata file using the native OCI URI format. When using the metadata file URI, the external table always queries the latest snapshot stored in the specific file. Subsequent updates that generate new snapshots and new metadata files are not accessible to the query.

For more information on the native OCI URI format, see Cloud Object Storage URI Formats.