Reading data from OCI Autonomous Data Warehouse(ADW)

Overview

Insights can read data from OCI ADW tables using user-provided SQL queries. Insights provides two Data Readers to do this:

  • ADWNativeDataReader : Read data from ADW tables using Native Execution Engine

  • ADWDaskDataReader: Read data from ADW tables using Dask Execution Engine

The general workflow for generating metrics when using the ADW reader is:

  1. The OCI ADW instance contains tables to read data from.

  2. An Appropriate authentication method is enabled. Currently ADW Readers only support mTLS.

  3. Wallet files are made available to ADW Readers (either locally or using OCI Vault).

  4. Use the Insights Config API or the Insights Builder API to configure ADW Reader.

  5. Run the Builder API to generate the Insights Profile/Test Results.

Prerequisites and Installation

This section describes the prerequisites, installation and setup of ADW Reader

Prerequisites

  • OCI Autonomous Data Warehouse instance

  • Configure ADW instance to support mTLS Connection

    See here for details on how to enable `mTLS Connection `

  • Wallet files are either made available locally or using OCI Vault

  • If using OCI Vault, appropriate IAM policies are configured to allow custom application to read wallet files from Vault

Installation

See here for detailed instructions on installing the library

ADW library Installation

ADW Reader Authentication

ADW Readers connect to an OCI ADW instance using mTLS connection. This requires the use of wallet files which contain the required authentication information. Wallet files can be downloaded from OCI ADW and need to be made available to Insights library in one of the following ways:

  1. Local file system.

  2. OCI Vault.

Using Local file system

Insights ADW Reader can read wallet files from a local directory. The steps are:

  1. Download the wallet files from OCI ADW and save them in a local directory in an unzipped format.

  2. When using Insights Config API, add the wallet location, database username and database password in the reader configuration as follows:

    "reader": {
       "type": "ADWNativeDataReader",
       "params": {
         "auth": {
           "wallet_location": "/Users/your_name/Downloads/Wallet_folder",
           "db_username":"<username>",
           "db_password": "<password>"
         },
        }
    }
    
    • wallet_location: Set the path of the directory that contains the wallet files.

    • db_username: Set the username for the ADW instance.

    • db_password: Set the password for the ADW instance.

Note

  • We recommend using OCI Vault for production scenarios since using the local file system requires the username and password to be set in the configuration file

Using OCI Vault

Insights ADW Reader can read wallet files from OCI Vault. The steps are as below:

  1. Upload each wallet file to OCI Vault. This creates a secret for each file

  2. Write the appropriate policy to allow the principal (user/resource) to read secrets from OCI Vault

  3. While using Insights Config API, add OCI Vault details in the reader config as shown below:

    "reader": {
       "type": "ADWNativeDataReader",
       "params": {
         "auth": {
           "vault_ocid": "<vault_ocid>",
           "vault_tenancy_ocid": "<tanancy_ocid>"
         },
        }
    }
    
    • vault_ocid: The OCID of the vault that contains the secret.

    • vault_tenancy_ocid: The OCID of the tenancy where vault resides

  4. By default, The ADW Reader expects the following wallet files to be present in OCI Vault in base 64 encoded format:

[
    "cwallet.sso",
    "ewallet.pem",
    "ewallet.p12",
    "keystore.jks",
    "ojdbc.properties",
    "tnsnames.ora",
    "truststore.jks",
    "sqlnet.ora"
]
  1. We recommend storing the username as db_username and password as db_password as secret names in OCI Vault for production scenarios for better security. However, for development use cases one can explicitly specify the username and password in the monitor configuration.

6. If the secret names in OCI Vault are not same as default file names , provide the names explicitly in the configuration. As shown below, user has uploaded the cwallet.sso file as custom-cwallet.sso in OCI Vault. In this case, the ADW Reader is configured to read the custom wallet file names from OCI Vault.

"reader": {
   "type": "ADWNativeDataReader",
   "params": {
     "auth": {
       "vault_ocid": "<vault_ocid>",
        "vault_tenancy_ocid": "<tenancy_ocid>",
        "vault_secret_names": {
            "db_username": "custom_db_username",
            "db_password": "custom_db_password",
            "cwallet.sso": "custom-cwallet.sso",
            "ewallet.pem": "custom-ewallet.pem",
            "ewallet.p12": "custom-ewallet.p12",
            "keystore.jks": "custom-keystore.jks",
            "ojdbc.properties": "custom-ojdbc.properties",
            "tnsnames.ora": "custom-tnsnames.ora",
            "truststore.jks": "custom-truststore.jks",
            "sqlnet.ora": "custom-sqlnet.ora"
        }
     }
    }
}

Read data using ADW readers

ADW readers read data from OCI cloud hosted Autonomous Data Warehouse using SQL query as mandatory input, which in return, output the data into the dataframe on which you can run the data analytics specifying the metrics, test results

Read data using ADW Native Reader

Using Builder API

  1. Begin by loading the required libraries and modules:

from mlm_insights.builder.insights_builder import InsightsBuilder
  1. Set up the environment variable to enable the thick mode for reading the data from OCI ADW (if it isn’t already set during installation).

import os
os.environ['INSTANT_CLIENT_LIBRARY_PATH'] = '/Users/your_name/Downloads/instantclient_19_16'
  1. Create ADWNativeDataReader instance using Vault Authentication. For other authentication options See ADW Reader Authentication.

from mlm_insights.mlm_native.readers.adw_native_data_reader import ADWNativeDataReader

     params = {
           "auth": {
            "vault_ocid": "<vault_ocid>",
            "vault_tenancy_ocid": "<tanancy_ocid>"
        },
          'sql_query': "select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN, dept.department_name as DN, emp.salary as salary, emp.hire_date as hire_date from EMPLOYEES emp , DEPARTMENTS dept where emp.department_id = dept.department_id",
          'chunksize': 10000
        }
adw_data_reader = ADWNativeDataReader(**params)
  1. Create Builder component

run_result = InsightsBuilder().
    with_input_schema(get_input_schema()).
    with_metrics(metrics=get_metrics()).
    with_reader(reader=adw_data_reader).
    build().run()
  1. The resultant profile can be extracted from the run result object.

profile = run_result.profile

Using Config API

  1. Begin by loading the required libraries and modules:

from mlm_insights.config_reader.insights_config_reader import InsightsConfigReader
  1. Set up the environment variable to enable the thick mode for reading the data from OCI ADW (if it isn’t already set during installation).

import os
os.environ['INSTANT_CLIENT_LIBRARY_PATH'] = '/Users/your_name/Downloads/instantclient_19_16'
  1. Create the Insights builder using config and reference profile (optional)

run_result  = InsightsConfigReader(config_location=<location of monitor config path>)
                                .get_builder()
                                .build()
                                .run()
  1. The resultant profile can be extracted from the run result object.

profile = run_result.profile
  1. A sample Json Configuration for ADW Native Data reader

sample_config.json
{
    "input_schema": {
      "salary": {
        "data_type": "INTEGER",
        "variable_type": "CONTINUOUS",
        "column_type": "INPUT"
      },
      "fn": {
        "data_type": "STRING",
        "variable_type": "NOMINAL",
        "column_type": "INPUT"
      }
    },
    "dataset_metrics": [
      {
        "type": "RowCount"
      }
    ],
    "feature_metrics": {
      "salary": [
        {
          "type": "Count"
        },
        {
          "type": "Quartiles"
        },
        {
          "type": "FrequencyDistribution"
        },
        {
          "type": "ProbabilityDistribution"
        },
        {
          "type": "DistinctCount"
        },
        {
          "type": "Skewness"
        },
        {
          "type": "Variance"
        },
        {
          "type": "TypeMetric"
        },
        {
          "type": "KolmogorovSmirnov"
        }
      ],
      "fn": [
        {
          "type": "Count"
        },
        {
          "type": "TopKFrequentElements"
        },
        {
          "type": "TypeMetric"
        },
        {
          "type": "ChiSquare"
        }
      ]
    },
    "reader": {
      "type": "ADWNativeDataReader",
      "params": {
        "auth": {
          "wallet_location": "",
          "db_username":"",
          "db_password": ""
        },
        "sql_query": "select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN, dept.department_name as DN, emp.salary as salary, emp.hire_date as hire_date from EMPLOYEES emp , DEPARTMENTS dept where emp.department_id = dept.department_id",
        "chunksize": 10000
      }
    }
  }

The user API parameters of ADWNativeDataReader are defined in the following table

Parameter

Description

Examples

sql_query

  • ADW readers read data from OCI cloud hosted Autonomous Data Warehouse using SQL query as mandatory input parameter.

  • The datatypes provided in the input schema are honoured over the datatypes of the SQL schema in the query view.

  • You can query the data using valid Oracle SQL compatible with oracle-db python driver

  • Required : Yes

query example

select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN from employees

bind_variables

  • Include the place holders for bind variables in the SQL query and mention the bind variables as key value parameters.

  • Required: No

bind variables example
  • sql_query: ``select * EMPLOYEES where hire_date > TO_DATE(:input_date,’DD-MM-YY’) ``

  • bind_variables:{ "input_date" : "13-01-08"}

auth

  • Use this parameter to provide authentication details required to connect to ADW to read data

  • The auth should contain a configuration of:

  • wallet files

  • database username

  • database password

ADW Reader Authentication

chunksize

  • To read large dataset in a paginated format, Specify the number of rows to be read in a single page

  • Required: No

  • Default: If not provided all the data present in the dataset is read in a single page

chunksize:100000

Read data using ADW Dask Reader

Using Builder API

  1. Begin by loading the required libraries and modules:

from mlm_insights.builder.insights_builder import InsightsBuilder
  1. Set up the environment variable to enable the thick mode for reading the data from OCI ADW (if it isn’t already set during installation).

import os
os.environ['INSTANT_CLIENT_LIBRARY_PATH'] = '/Users/your_name/Downloads/instantclient_19_16'
  1. Create ADWDaskDataReader instance using Vault Authentication. For other authentication options See ADW Reader Authentication.

from mlm_insights.mlm_dask.readers.adw_dask_data_reader import ADWDaskDataReader

     params = {
          "auth": {
            "vault_ocid": "<vault_ocid>",
            "vault_tenancy_ocid": "<tanancy_ocid>"
        },
          'sql_query': "select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN, dept.department_name as DN, emp.salary as salary, emp.hire_date as hire_date from EMPLOYEES emp , DEPARTMENTS dept where emp.department_id = dept.department_id",
          'index_col': "en",
          'partition_size': 10240
        }
adw_data_reader = ADWDaskDataReader(**params)
  1. Create Builder component

run_result = InsightsBuilder().
    with_input_schema(get_input_schema()).
    with_metrics(metrics=get_metrics()).
    with_reader(reader=adw_data_reader).
    build().run()
  1. The resultant profile can be extracted from the run result object.

profile = run_result.profile

Using Config API

  1. Begin by loading the required libraries and modules:

from mlm_insights.config_reader.insights_config_reader import InsightsConfigReader
  1. Set up the environment variable to enable the thick mode for reading the data from OCI ADW (if it isn’t already set during installation)

import os
os.environ['INSTANT_CLIENT_LIBRARY_PATH'] = '/Users/your_name/Downloads/instantclient_19_16'
  1. Create the Insights builder using config and reference profile (optional)

run_result  = InsightsConfigReader(config_location=<location of monitor config path>)
                                .get_builder()
                                .build()
                                .run()
  1. The resultant profile can be extracted from the run result object.

profile = run_result.profile
  1. A sample Json Configuration for ADW Native Data reader

sample_config.json
{
    "input_schema": {
      "salary": {
        "data_type": "INTEGER",
        "variable_type": "CONTINUOUS",
        "column_type": "INPUT"
      },
      "fn": {
        "data_type": "STRING",
        "variable_type": "NOMINAL",
        "column_type": "INPUT"
      }
    },
    "dataset_metrics": [
      {
        "type": "RowCount"
      }
    ],
    "feature_metrics": {
      "salary": [
        {
          "type": "Count"
        },
        {
          "type": "Quartiles"
        },
        {
          "type": "FrequencyDistribution"
        },
        {
          "type": "ProbabilityDistribution"
        },
        {
          "type": "DistinctCount"
        },
        {
          "type": "Skewness"
        },
        {
          "type": "Variance"
        },
        {
          "type": "TypeMetric"
        },
        {
          "type": "KolmogorovSmirnov"
        }
      ],
      "fn": [
        {
          "type": "Count"
        },
        {
          "type": "TopKFrequentElements"
        },
        {
          "type": "TypeMetric"
        },
        {
          "type": "ChiSquare"
        }
      ]
    },
    "engine_detail": {
      "engine_name": "dask"
    },
    "reader": {
      "type": "ADWDaskDataReader",
      "params": {
        "auth": {
          "wallet_location": "",
          "db_username":"",
          "db_password": ""
        },
        "sql_query": "select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN, dept.department_name as DN, emp.salary as salary, emp.hire_date as hire_date from EMPLOYEES emp , DEPARTMENTS dept where emp.department_id = dept.department_id",
        "index_col": "en",
        "partition_size": 10240
      }
    }
  }

The user API parameters of ADWDaskDataReader are defined in the following table

Parameter

Description

Examples

sql_query

  • The ADW readers read data from OCI Cloud hosted Autonomous Data Warehouse using SQL query as mandatory input parameter.

  • The datatypes provided in the input schema are honoured over the datatypes of the SQL schema in the query view.

  • One can query the data using valid Oracle SQL compatible with oracle-db python driver

  • Required: Yes

query example

select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN from employees

bind_variables

  • Include the place holders for bind variables in the SQL query and mention the bind variables as key value parameters

  • Required: No

bind variables example
  • sql_query: ``select * EMPLOYEES where hire_date > TO_DATE(:input_date,’DD-MM-YY’) ``

  • bind_variables:{ "input_date" : "13-01-08"}

auth

  • Use this parameter to provide authentication details required to connect to ADW to read data

  • The auth should contains a configuration of

  • wallet files

  • database username

  • database password

ADW Reader Authentication

index_col

  • To read a large dataset efficiently in a partitioned manner , index_col serve as the partition key.

  • index_col column must be present in the select query view.

  • index_col must be NUMBER as its SQL datatype

  • Required: No

  • Default: If not provided, all the data is read using a single partition only.

index column example
  • sql_query:select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN from employees

  • index_col:en

no_of_partitions

  • no_of_partitions is used to specify the number of partitions required to read a large dataset.

  • index_col must be specified to use the number of partitions.

  • Required: No

  • Default: If not provided, a default value of 1 is used

number of partition example
  • sql_query:select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN from employees

  • index_col:en

  • no_of_partitions :3

partition_size

  • partition_size is used to specify the memory size of a partition required to read a large dataset

  • index_col must be specified to use the partition_size

  • The value of the partition size can take a numeric values like 10240 for 10 MB or string values like 10 MiB for 10 MB.

  • The number of partitions are calculated dynamically using size of the data read and partition_size value.

  • Required: No

  • Default: If not provided, a default value of 256 MiB is used

partition size example
  • sql_query:select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN from employees

  • index_col:en

  • partition_size :10240

Note

  • If the partition_size and no_of_partitions are not provided, the number of partitions are calculated using

the default value of partition_size which is 256 MiB.

  • If the partition_size and no_of_partitions are both provided no_of_partitions is honoured.