Publish Lineage Information from Custom Applications to OCI Data Catalog

In this tutorial, you set up data processing application to push and publish data lineage to Data Catalog. Key tasks include how to:

  • Data Catalog
  • Setup for accepting Data Lineage.
  • Add Openlineage-spark plugin to the Spark application to generate Data Lineage
  • Publish Data Lineage to Data Catalog.

Data lineage indicates the journey that data takes as it flows from data sources to consumption. Through metadata, data consumers can understand and visualize the transformations that the data went through in the data pipelines.

Data lineage can be generated by any data processing applications/services. Most of the standard data processing applications support generation and publication of data lineage.

Data Catalog provides options to capture data lineage from different services, including

Standard Practice for Capturing Data Lineage

For information on understanding of the standard for collection and analysis of data lineage, see OpenLineage

OpenLineage Specification

As per OpenLineage specification, the Data Lineage generated by any Data Processing application should be represented as follows:

OpenLineage Format Expand source

{

    "eventTime": "2019-08-24T14:15:22Z",
    "producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
    "schemaURL": "https://openlineage.io/spec/0-0-1/OpenLineage.json",
    "eventType": "START|RUNNING|COMPLETE|ABORT|FAIL|OTHER",
    "run": {
        "runId": "78c33d18-170c-44d3-a227-b3194f134f73",
        "facets": {
            "property1": {
                "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
                "_schemaURL": "https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/BaseFacet"
            },
            "property2": {
                "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
                "_schemaURL": "https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/BaseFacet"
            }
        }
    },
    "job": {
        "namespace": "my-scheduler-namespace",
        "name": "myjob.mytask",
        "facets": {
            "property1": {
                "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
                "_schemaURL": "https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/BaseFacet",
                "_deleted": true
            },
            "property2": {
                "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
                "_schemaURL": "https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/BaseFacet",
                "_deleted": true
            }
        }
    },
    "inputs": [
        {
            "namespace": "my-datasource-namespace",
            "name": "instance.schema.table",
            "facets": {
                "property1": {
                    "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
                    "_schemaURL": "https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/BaseFacet",
                    "_deleted": true
                },
                "property2": {
                    "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
                    "_schemaURL": "https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/BaseFacet",
                    "_deleted": true
                }
            },
            "inputFacets": {
                "property1": {
                    "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
                    "_schemaURL": "https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/BaseFacet"
                },
                "property2": {
                    "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
                    "_schemaURL": "https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/BaseFacet"
                }
            }
        }
    ],
    "outputs": [
        {
            "namespace": "my-datasource-namespace",
            "name": "instance.schema.table",
            "facets": {
                "property1": {
                    "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
                    "_schemaURL": "https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/BaseFacet",
                    "_deleted": true
                },
                "property2": {
                    "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
                    "_schemaURL": "https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/BaseFacet",
                    "_deleted": true
                }
            },
            "outputFacets": {
                "property1": {
                    "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
                    "_schemaURL": "https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/BaseFacet"
                },
                "property2": {
                    "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
                    "_schemaURL": "https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/BaseFacet"
                }
            }
        }
    ]
}

Before You Begin

To successfully perform this tutorial, you must have:

If you have administrative rights to your account, skip the rest of this section. Otherwise, have your administrator add the following policy to your account:
allow group <the-group-your-username-belongs> to manage all-resources in compartment catalog-compartment

See Common Policies for more examples.

Note

In the next section, you make a compartment for your data catalog instances, called catalog-compartment.