Reading data from OCI Autonomous Data Warehouse(ADW)¶

Overview
Prerequisites and Installation
ADW Reader Authentication
- Using Local file system
- Using OCI Vault
Read data using ADW readers
- Read data using ADW Native Reader
- Read data using ADW Dask Reader

Overview¶

Insights can read data from OCI ADW tables using user-provided SQL queries. Insights provides two Data Readers to do this:

ADWNativeDataReader : Read data from ADW tables using Native Execution Engine

ADWDaskDataReader: Read data from ADW tables using Dask Execution Engine

The general workflow for generating metrics when using the ADW reader is:

The OCI ADW instance contains tables to read data from.
An Appropriate authentication method is enabled. Currently ADW Readers only support mTLS.
Wallet files are made available to ADW Readers (either locally or using OCI Vault).
Use the Insights Config API or the Insights Builder API to configure ADW Reader.
Run the Builder API to generate the Insights Profile/Test Results.

Prerequisites and Installation¶

This section describes the prerequisites, installation and setup of ADW Reader

Prerequisites

OCI Autonomous Data Warehouse instance
Configure ADW instance to support mTLS Connection
See here for details on how to enable `mTLS Connection `
Wallet files are either made available locally or using OCI Vault
If using OCI Vault, appropriate IAM policies are configured to allow custom application to read wallet files from Vault

Installation

See here for detailed instructions on installing the library

ADW library Installation

ADW Reader Authentication¶

ADW Readers connect to an OCI ADW instance using mTLS connection. This requires the use of wallet files which contain the required authentication information. Wallet files can be downloaded from OCI ADW and need to be made available to Insights library in one of the following ways:

Local file system.
OCI Vault.

Using Local file system¶

Insights ADW Reader can read wallet files from a local directory. The steps are:

Download the wallet files from OCI ADW and save them in a local directory in an unzipped format.
When using Insights Config API, add the wallet location, database username and database password in the reader configuration as follows:
"reader": { "type": "ADWNativeDataReader", "params": { "auth": { "wallet_location": "/Users/your_name/Downloads/Wallet_folder", "db_username":"<username>", "db_password": "<password>" }, } }
- wallet_location: Set the path of the directory that contains the wallet files.
- db_username: Set the username for the ADW instance.
- db_password: Set the password for the ADW instance.

Note

We recommend using OCI Vault for production scenarios since using the local file system requires the username and password to be set in the configuration file

Using OCI Vault¶

Insights ADW Reader can read wallet files from OCI Vault. The steps are as below:

Upload each wallet file to OCI Vault. This creates a secret for each file
Write the appropriate policy to allow the principal (user/resource) to read secrets from OCI Vault

While using Insights Config API, add OCI Vault details in the reader config as shown below:

"reader": {
   "type": "ADWNativeDataReader",
   "params": {
     "auth": {
       "vault_ocid": "<vault_ocid>",
       "vault_tenancy_ocid": "<tanancy_ocid>"
     },
    }
}

vault_ocid: The OCID of the vault that contains the secret.
vault_tenancy_ocid: The OCID of the tenancy where vault resides

By default, The ADW Reader expects the following wallet files to be present in OCI Vault in base 64 encoded format:

[
    "cwallet.sso",
    "ewallet.pem",
    "ewallet.p12",
    "keystore.jks",
    "ojdbc.properties",
    "tnsnames.ora",
    "truststore.jks",
    "sqlnet.ora"
]

We recommend storing the username as db_username and password as db_password as secret names in OCI Vault for production scenarios for better security. However, for development use cases one can explicitly specify the username and password in the monitor configuration.

6. If the secret names in OCI Vault are not same as default file names , provide the names explicitly in the configuration. As shown below, user has uploaded the cwallet.sso file as custom-cwallet.sso in OCI Vault. In this case, the ADW Reader is configured to read the custom wallet file names from OCI Vault.

"reader": {
   "type": "ADWNativeDataReader",
   "params": {
     "auth": {
       "vault_ocid": "<vault_ocid>",
        "vault_tenancy_ocid": "<tenancy_ocid>",
        "vault_secret_names": {
            "db_username": "custom_db_username",
            "db_password": "custom_db_password",
            "cwallet.sso": "custom-cwallet.sso",
            "ewallet.pem": "custom-ewallet.pem",
            "ewallet.p12": "custom-ewallet.p12",
            "keystore.jks": "custom-keystore.jks",
            "ojdbc.properties": "custom-ojdbc.properties",
            "tnsnames.ora": "custom-tnsnames.ora",
            "truststore.jks": "custom-truststore.jks",
            "sqlnet.ora": "custom-sqlnet.ora"
        }
     }
    }
}

Read data using ADW readers¶

ADW readers read data from OCI cloud hosted Autonomous Data Warehouse using SQL query as mandatory input, which in return, output the data into the dataframe on which you can run the data analytics specifying the metrics, test results

Read data using ADW Native Reader¶

Using Builder API¶

Begin by loading the required libraries and modules:

from mlm_insights.builder.insights_builder import InsightsBuilder

Set up the environment variable to enable the thick mode for reading the data from OCI ADW (if it isn’t already set during installation).

import os
os.environ['INSTANT_CLIENT_LIBRARY_PATH'] = '/Users/your_name/Downloads/instantclient_19_16'

Create ADWNativeDataReader instance using Vault Authentication. For other authentication options See ADW Reader Authentication.

from mlm_insights.mlm_native.readers.adw_native_data_reader import ADWNativeDataReader

     params = {
           "auth": {
            "vault_ocid": "<vault_ocid>",
            "vault_tenancy_ocid": "<tanancy_ocid>"
        },
          'sql_query': "select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN, dept.department_name as DN, emp.salary as salary, emp.hire_date as hire_date from EMPLOYEES emp , DEPARTMENTS dept where emp.department_id = dept.department_id",
          'chunksize': 10000
        }
adw_data_reader = ADWNativeDataReader(**params)

Create Builder component

run_result = InsightsBuilder().
    with_input_schema(get_input_schema()).
    with_metrics(metrics=get_metrics()).
    with_reader(reader=adw_data_reader).
    build().run()

The resultant profile can be extracted from the run result object.

profile = run_result.profile

Using Config API¶

Begin by loading the required libraries and modules:

from mlm_insights.config_reader.insights_config_reader import InsightsConfigReader

Set up the environment variable to enable the thick mode for reading the data from OCI ADW (if it isn’t already set during installation).

import os
os.environ['INSTANT_CLIENT_LIBRARY_PATH'] = '/Users/your_name/Downloads/instantclient_19_16'

Create the Insights builder using config and reference profile (optional)

run_result  = InsightsConfigReader(config_location=<location of monitor config path>)
                                .get_builder()
                                .build()
                                .run()

The resultant profile can be extracted from the run result object.

profile = run_result.profile

A sample Json Configuration for ADW Native Data reader

sample_config.json

{
    "input_schema": {
      "salary": {
        "data_type": "INTEGER",
        "variable_type": "CONTINUOUS",
        "column_type": "INPUT"
      },
      "fn": {
        "data_type": "STRING",
        "variable_type": "NOMINAL",
        "column_type": "INPUT"
      }
    },
    "dataset_metrics": [
      {
        "type": "RowCount"
      }
    ],
    "feature_metrics": {
      "salary": [
        {
          "type": "Count"
        },
        {
          "type": "Quartiles"
        },
        {
          "type": "FrequencyDistribution"
        },
        {
          "type": "ProbabilityDistribution"
        },
        {
          "type": "DistinctCount"
        },
        {
          "type": "Skewness"
        },
        {
          "type": "Variance"
        },
        {
          "type": "TypeMetric"
        },
        {
          "type": "KolmogorovSmirnov"
        }
      ],
      "fn": [
        {
          "type": "Count"
        },
        {
          "type": "TopKFrequentElements"
        },
        {
          "type": "TypeMetric"
        },
        {
          "type": "ChiSquare"
        }
      ]
    },
    "reader": {
      "type": "ADWNativeDataReader",
      "params": {
        "auth": {
          "wallet_location": "",
          "db_username":"",
          "db_password": ""
        },
        "sql_query": "select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN, dept.department_name as DN, emp.salary as salary, emp.hire_date as hire_date from EMPLOYEES emp , DEPARTMENTS dept where emp.department_id = dept.department_id",
        "chunksize": 10000
      }
    }
  }

The user API parameters of ADWNativeDataReader are defined in the following table

Parameter	Description	Examples
`sql_query`	ADW readers read data from OCI cloud hosted Autonomous Data Warehouse using SQL query as mandatory input parameter. The datatypes provided in the input schema are honoured over the datatypes of the SQL schema in the query view. You can query the data using valid Oracle SQL compatible with `oracle-db` python driver Required : Yes	query example `select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN from employees`
`bind_variables`	Include the place holders for bind variables in the SQL query and mention the bind variables as key value parameters. Required: No	bind variables example `sql_query`: ``select * EMPLOYEES where hire_date > TO_DATE(:input_date,’DD-MM-YY’) `` `bind_variables`:`{ "input_date" : "13-01-08"}`
`auth`	Use this parameter to provide authentication details required to connect to ADW to read data The auth should contain a configuration of: wallet files database username database password Details are provided in ADW Reader Authentication Required: Yes	ADW Reader Authentication
`chunksize`	To read large dataset in a paginated format, Specify the number of rows to be read in a single page Required: No Default: If not provided all the data present in the dataset is read in a single page	`chunksize`:`100000`

Read data using ADW Dask Reader¶

Using Builder API¶

Begin by loading the required libraries and modules:

from mlm_insights.builder.insights_builder import InsightsBuilder

Set up the environment variable to enable the thick mode for reading the data from OCI ADW (if it isn’t already set during installation).

import os
os.environ['INSTANT_CLIENT_LIBRARY_PATH'] = '/Users/your_name/Downloads/instantclient_19_16'

Create ADWDaskDataReader instance using Vault Authentication. For other authentication options See ADW Reader Authentication.

from mlm_insights.mlm_dask.readers.adw_dask_data_reader import ADWDaskDataReader

     params = {
          "auth": {
            "vault_ocid": "<vault_ocid>",
            "vault_tenancy_ocid": "<tanancy_ocid>"
        },
          'sql_query': "select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN, dept.department_name as DN, emp.salary as salary, emp.hire_date as hire_date from EMPLOYEES emp , DEPARTMENTS dept where emp.department_id = dept.department_id",
          'index_col': "en",
          'partition_size': 10240
        }
adw_data_reader = ADWDaskDataReader(**params)

Create Builder component

run_result = InsightsBuilder().
    with_input_schema(get_input_schema()).
    with_metrics(metrics=get_metrics()).
    with_reader(reader=adw_data_reader).
    build().run()

The resultant profile can be extracted from the run result object.

profile = run_result.profile

Using Config API¶

Begin by loading the required libraries and modules:

from mlm_insights.config_reader.insights_config_reader import InsightsConfigReader

Set up the environment variable to enable the thick mode for reading the data from OCI ADW (if it isn’t already set during installation)

import os
os.environ['INSTANT_CLIENT_LIBRARY_PATH'] = '/Users/your_name/Downloads/instantclient_19_16'

Create the Insights builder using config and reference profile (optional)

run_result  = InsightsConfigReader(config_location=<location of monitor config path>)
                                .get_builder()
                                .build()
                                .run()

The resultant profile can be extracted from the run result object.

profile = run_result.profile

A sample Json Configuration for ADW Native Data reader

sample_config.json

{
    "input_schema": {
      "salary": {
        "data_type": "INTEGER",
        "variable_type": "CONTINUOUS",
        "column_type": "INPUT"
      },
      "fn": {
        "data_type": "STRING",
        "variable_type": "NOMINAL",
        "column_type": "INPUT"
      }
    },
    "dataset_metrics": [
      {
        "type": "RowCount"
      }
    ],
    "feature_metrics": {
      "salary": [
        {
          "type": "Count"
        },
        {
          "type": "Quartiles"
        },
        {
          "type": "FrequencyDistribution"
        },
        {
          "type": "ProbabilityDistribution"
        },
        {
          "type": "DistinctCount"
        },
        {
          "type": "Skewness"
        },
        {
          "type": "Variance"
        },
        {
          "type": "TypeMetric"
        },
        {
          "type": "KolmogorovSmirnov"
        }
      ],
      "fn": [
        {
          "type": "Count"
        },
        {
          "type": "TopKFrequentElements"
        },
        {
          "type": "TypeMetric"
        },
        {
          "type": "ChiSquare"
        }
      ]
    },
    "engine_detail": {
      "engine_name": "dask"
    },
    "reader": {
      "type": "ADWDaskDataReader",
      "params": {
        "auth": {
          "wallet_location": "",
          "db_username":"",
          "db_password": ""
        },
        "sql_query": "select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN, dept.department_name as DN, emp.salary as salary, emp.hire_date as hire_date from EMPLOYEES emp , DEPARTMENTS dept where emp.department_id = dept.department_id",
        "index_col": "en",
        "partition_size": 10240
      }
    }
  }

The user API parameters of ADWDaskDataReader are defined in the following table

Parameter	Description	Examples
`sql_query`	The ADW readers read data from OCI Cloud hosted Autonomous Data Warehouse using SQL query as mandatory input parameter. The datatypes provided in the input schema are honoured over the datatypes of the SQL schema in the query view. One can query the data using valid Oracle SQL compatible with `oracle-db` python driver Required: Yes	query example `select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN from employees`
`bind_variables`	Include the place holders for bind variables in the SQL query and mention the bind variables as key value parameters Required: No	bind variables example `sql_query`: ``select * EMPLOYEES where hire_date > TO_DATE(:input_date,’DD-MM-YY’) `` `bind_variables`:`{ "input_date" : "13-01-08"}`
`auth`	Use this parameter to provide authentication details required to connect to ADW to read data The auth should contains a configuration of wallet files database username database password Details are provided in ADW Reader Authentication Required: Yes	ADW Reader Authentication
`index_col`	To read a large dataset efficiently in a partitioned manner , `index_col` serve as the partition key. `index_col` column must be present in the select query view. `index_col` must be NUMBER as its SQL datatype Required: No Default: If not provided, all the data is read using a single partition only.	index column example `sql_query`:`select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN from employees` `index_col`:`en`
`no_of_partitions`	`no_of_partitions` is used to specify the number of partitions required to read a large dataset. `index_col` must be specified to use the number of partitions. Required: No Default: If not provided, a default value of 1 is used	number of partition example `sql_query`:`select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN from employees` `index_col`:`en` `no_of_partitions` :`3`
`partition_size`	`partition_size` is used to specify the memory size of a partition required to read a large dataset `index_col` must be specified to use the `partition_size` The value of the partition size can take a numeric values like `10240` for 10 MB or string values like `10 MiB` for 10 MB. The number of partitions are calculated dynamically using size of the data read and `partition_size` value. Required: No Default: If not provided, a default value of `256 MiB` is used	partition size example `sql_query`:`select emp.employee_id as en ,emp.first_name as FN, emp.last_name as LN from employees` `index_col`:`en` `partition_size` :`10240`

Note

If the partition_size and no_of_partitions are not provided, the number of partitions are calculated using

the default value of partition_size which is 256 MiB.

If the partition_size and no_of_partitions are both provided no_of_partitions is honoured.