Publish Lineage Information from Custom Applications to OCI
Data Catalog
In this tutorial, you set up data processing application to push and publish data lineage to Data Catalog. Key tasks include how to:
Data Catalog
Setup for accepting Data Lineage.
Add Openlineage-spark plugin to the Spark application to generate Data Lineage
Publish Data Lineage to Data Catalog.
Data lineage indicates the journey that data takes as it flows from data sources to consumption. Through metadata, data consumers can understand and visualize the transformations that the data went through in the data pipelines.
Data lineage can be generated by any data processing applications/services. Most of the standard data processing applications support generation and publication of data lineage.
Data Catalog provides options to capture data lineage from different services, including
A created data catalog instance. For more information, see Creating a Data Catalog Instance. It's not required to be a catalog admin, however the following IAM policy is required:
allow group lineage-group to CATALOG_LINEAGE_IMPORT in tenancy where all {target.catalog.id = <catalog-ocid>, target.data-asset.key=<data-asset-key>}
If you have administrative rights to your account, skip the rest of this section. Otherwise, have your administrator add the following policy to your account:
Copy
allow group <the-group-your-username-belongs> to manage all-resources in compartment catalog-compartment
In this section, you set up the data processing applications to push data lineage to Data Catalog.
The following task is an example of a Spark application.
Open the navigation menu and select Analytics & AI. Under Data Lake, select Data Catalog.
Fill in the following information:
Name:Lineage - Sales application
For Type, select Custom Lineage Provider.
Click Create.
Add Openlineage-spark Plugin to the Spark Application to Generate Data Lineage 🔗
Openlineage provides a Apache spark plugin that binds to spark-context and generates Data Lineage from it. This plugin can be extended to publish the Data Lineage to Data Catalog. The following snippets provide the plugin extension code and spark-submit options to invoke the plugin.
package io.openlineage.client.transports;
public class OciTransportBuilder implements TransportBuilder {
@Override
public String getType() {
return "oci";
}
@Override
public TransportConfig getConfig() {
return new OciConfig();
}
@Override
public Transport build(TransportConfig config) {
return new OciTransport((OciConfig) config);
}
}
Spark Submit Options
Copy
--packages "io.openlineage:openlineage-spark:1.8.0,com.oracle.oci.sdk:oci-java-sdk-common:{oci-sdk-version},com.oracle.oci.sdk:oci-java-sdk-common-httpclient-jersey:{oci-sdk-version},com.oracle.pic.dcat:datacatalog-java-client:{oci-sdk-version},io.openlineage:openlineage-oci-extension:{generated-from-above-mentioned-code-snippet}"
--conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener"
--conf "spark.openlineage.debugFacet=enabled"
--conf "spark.openlineage.transport.type=oci"
--conf "spark.openlineage.transport.catalogId={catalog instance OCID}"
--conf "spark.openlineage.transport.dataAssetKey={UUID of the dataAsset registered in Data catalog}"
--conf "spark.openlineage.transport.authProfile={user authentication profile}"
--conf "spark.openlineage.transport.authType={optionally specify security_token as auth type}"
--conf "spark.openlineage.application.name={application name to display on catalog UI}"
Publish Data Lineage to Data Catalog 🔗
To publish data lineage to Data Catalog, you must create an IAM user and setup authentication for this user on the system where the data processing applications are running. This document provides details on the available authentication methods to connect to OCI services. Specify the following IAM policy for the user
Copy
allow group lineage-group to CATALOG_LINEAGE_IMPORT in tenancy where all {target.catalog.id = <catalog-ocid>, target.data-asset.key=<data-asset-key>}
For any other data processing application in your ecosystem, you follow the previous steps.
Only the Data Lineage generation step varies depending on the data processing application. You can lookup the Openlineage integrations page to verify if a Openlineage plugin already exists for your application. Otherwise, you can write your own implementation to produce the Data Lineage payload in Openlineage format. The following is sample code to call importLineage endpoint of OCI Data Catalog service.
import oci
# Create a default config using DEFAULT profile in default location
# Refer to
# https://docs.cloud.oracle.com/en-us/iaas/Content/API/Concepts/sdkconfig.htm#SDK_and_CLI_Configuration_File
# for more info
config = oci.config.from_file()
# Initialize service client with default config file
data_catalog_client = oci.data_catalog.DataCatalogClient(config)
# Send the request to service, some parameters are not required, see API
# doc for more info
import_lineage_response = data_catalog_client.import_lineage(
catalog_id="ocid1.test.oc1..<unique_ID>EXAMPLE-catalogId-Value",
data_asset_key="EXAMPLE-dataAssetKey-Value",
import_lineage_details=oci.data_catalog.models.ImportLineageDetails(
lineage_payload="openlineage-payload"),
opc_retry_token="EXAMPLE-opcRetryToken-Value",
opc_request_id="AOR2TTPR06HMXDHS3NZU<unique_ID>")
# Get the data from response
print(import_lineage_response.data)