Use Full-Text Search on Files in Object Storage

The PL/SQL package DBMS_CLOUD enables you to build a text index on the object store files, which allows you to search the text and use wildcards with your search.

About Full-Text Search on Files in Object Storage

You can create a text index on files in object storage. A text index allows you to perform a word-based search across very large data sets in object store.

DBMS_CLOUD provides fast and efficient ways to manage data in object store. The DBMS_CLOUD APIs let you create, copy, download, delete, and traverse files present in object store. When you define external tables you can run SQL queries on data stored in your object store (or with hybrid partitioned external tables, across data in your database and in object store). When you use DBMS_CLOUD to define a text index, this allows you to search your data for text and use wildcards.

Autonomous Database support for word-based search works for commonly used data formats, for example CSV or JSON and with formatted documents (binary), for example PDF and DOC (MS Word) formats. You can configure a refresh rate that indicates the frequency in minutes at which the index is refreshed for any new uploads or deletes.

A local table with the standard suffix INDEX_NAME$TXTIDX is created when you create an index on the object storage, and you can utilize the table to perform a search using the CONTAINS keyword.

See Indexing with Oracle Text for more information.

Create a Text Index on Object Storage Files

Use DBMS_CLOUD.CREATE_EXTERNAL_TEXT_INDEX to create a text index on files in object storage.

Formatted documents (binary) are supported when you specify the binary_files format option with DBMS_CLOUD.CREATE_EXTERNAL_TEXT_INDEX.

You can include a stop word list when you specify the stop_words format option with DBMS_CLOUD.CREATE_EXTERNAL_TEXT_INDEX.

See Indexing with Oracle Text for more information on Oracle Text stop words and working with binary files.

  1. Create a credential object to access the source location.

    See CREATE_CREDENTIAL Procedure for more information.

  2. Run the DBMS_CLOUD.CREATE_EXTERNAL_TEXT_INDEX procedure to create a text index on the object storage files.
    BEGIN 
    DBMS_CLOUD.CREATE_EXTERNAL_TEXT_INDEX (
            credential_name => 'DEFAULT_CREDENTIAL',
            location_uri    => 'https://objectstorage.us-phoenix-1.oraclecloud.com/n/namespace-string/b/bucketname/o/ts_data/'
            index_name      => 'EMP',
            format          => JSON_OBJECT ('refresh_rate' value 10)
    );
    END;
    /

    This example creates a text index EMP on the object storage files located at the URI specified in the location_uri parameter. The refresh_rate option in the format parameter specifies that the index EMP is refreshed at an interval of 10 minutes.

    This creates a local table INDEX_NAME$TXTIDX. You can utilize the table INDEX_NAME$TXTIDX to perform a search using CONTAINS.

    For example:

    
     SELECT object_name FROM EMP$TXTIDX
          WHERE CONTAINS(object_name,'king') > 0;
    

    This query returns the object or file names that contain the string king.

    See Text Index Reference Table for more information.

    You can query an external table using the EXTERNAL MODIFY clause to retrieve the actual records.

    
    SELECT * FROM EMPEXTTAB EXTERNAL MODIFY ((location_url object_name));
    
    Note

    The external table EMPEXTTAB is a sample external table that is created on the same location_url.

    See Query External Data with Autonomous Database for more information.

    See CREATE_EXTERNAL_TEXT_INDEX Procedure for more information.

    See Configure Policies and Roles to Access Resources for more information.

Drop an Index on the Cloud Storage Files

Use the DBMS_CLOUD.DROP_EXTERNAL_TEXT_INDEX procedure to drop a text index on object storage files.

Run the DBMS_CLOUD.DROP_EXTERNAL_TEXT_INDEX procedure to drop a text index on files in object storage.

BEGIN 
DBMS_CLOUD.DROP_EXTERNAL_TEXT_INDEX (
        index_name => 'EMP',
);
END;
/

This example drops the EMP text index.

See DROP_EXTERNAL_TEXT_INDEX Procedure for more information.

Text Index Reference Table

A local table is created within your database with a standard suffix INDEX_NAME$TXTIDX. This table is created internally when you run DBMS_CLOUD.CREATE_EXTERNAL_TEXT_INDEX.

You can query the INDEX_NAME$TXTIDX table to search for a string using the CONTAINS keyword. For example, when you call DBMS_CLOUD.CREATE_EXTERNAL_TEXT_INDEX procedure the INDEX_NAME value as EMP, this creates the EMP$TXTIDX the text reference table.

The text reference table has the following columns:
  • object_name: is the file name on the object storage that contains the searched text string.

  • object_path: is the object storage bucket or folder URI that contains the object storage file.

  • mtime: is the last modified timestamp of the object storage file. This is the time when the file was last accessed by DBMS_CLOUD.

For example:
SELECT object_path, object_name FROM EMP$TXTIDX WHERE CONTAINS(OBJECT_NAME, 'king') > 0;
 
OBJECT_PATH                                                                                      OBJECT_NAME
------------------------------------------------------------------------------------------      ------------------------------------
https://objectstorage.us-phoenix-1.oraclecloud.com/n/example1/b/adbs_data_share/o/ts_data/       data_2_20221026T195313585601Z.json

This query returns the file names and location URI on the object storage which contains the text string king, in either upper or lowercase.

SELECT object_name, mtime FROM EMP$TXTIDX;
 
OBJECT_NAME                     MTIME
----------------------------- -------------------------------------
data_1_20220531T165402Z.json    31-MAY-22 04.54.02.979000 PM +00:00
data_1_20220531T165427Z.json    31-MAY-22 04.54.27.997000 PM +00:00

This query returns file name and last modified timestamp of the object files on which the index EMP is created.

Monitor Text Index Creation

When you run DBMS_CLOUD.CREATE_EXTERNAL_TEXT_INDEX the text index creation operation is logged in the ALL_SCHEDULER_JOB_RUN_DETAILS view.

You can query the ALL_SCHEDULER_JOB_RUN_DETAILS view to obtain the status and any error reported by the index creation job.

The name of the DBMS_SCHEDULER job is derived from the INDEX_NAME parameter specified when you call DBMS_CLOUD.CREATE_EXTERNAL_TEXT_INDEX.

To query the ALL_SCHEDULER_JOB_RUN_DETAILS view, you must be logged in as the ADMIN user or have READ privilege on the ALL_SCHEDULER_JOB_RUN_DETAILS view.

For example, the following SELECT statement with a WHERE clause on job_name shows the run details for the job:

SELECT status, additional_info 
   FROM all_scheduler_job_run_details WHERE LOWER(job_name) = LOWER('index_name$JOB');

You can also query for the existence of an index creation scheduler job.

For example:
SELECT status 
    FROM all_scheduler_jobs where LOWER(job_name) = LOWER('index_name$JOB');

See CREATE_EXTERNAL_TEXT_INDEX Procedure for more information.