Data Lineage Overview

Data lineage indicates the journey that data takes as it flows from data sources to consumption. Through metadata, data consumers can understand and visualize the transformations that the data went through in the data pipelines.

Supported Data Sources for Lineage

In Data Catalog, lineage is supported for the following data sources:

  • Apache Hive Database
  • Autonomous Data Warehouse
  • Autonomous Transaction Processing
  • IBM DB2
  • Microsoft Azure SQL Database
  • Microsoft SQL Server Database
  • MySQL Database
  • Oracle Database
  • Oracle Object Storage
  • PostgreSQL

Data Lineage

In Data Catalog, you can view the lineage for the entities and their attributes. For example, both table and column level lineage. Lineage is available for data processed by Data Integration applications, Data Flow applications, or your custom applications. Each one requires configuration setup as explained in the sections below.

Data Lineage for Data Integration

To view the lineage in Data Catalog, you should:

When Data Catalog fetches the lineage information from Data Integration workspace, it contains information about data assets and tasks executed in the applications. Based on the lineage information, if there is no corresponding data asset in the catalog, Data Catalog creates that data asset. The name of this data asset is same as that defined in Data Integration workspace.

While working with data lineage, note the following:
  • Lineage is available only for data processed by Integration tasks and Data Loader tasks in the Data Integration workspace.

  • Column level lineage isn't available for tasks with Flatten, Pivot, and Function operators.

Data Lineage for Data Flow

To view the lineage for application in Data Flow, select the Enable data lineage collection check box in your application configuration in the OCI Data Flow workspace to generate lineage metadata. A data asset is automatically created in Data Catalog for the Data Flow service in the same tenancy the first time lineage metadata is pushed to the catalog. The name of this data asset is of the format OCI Data Flow – <tenancy name>. See Required IAM Policies for Data Flow Data Asset and Data Flow.

To capture lineage for applications running in Data Flow on a separate tenancy, you must create a data asset for that Data Flow service. Be sure to set the following policies.

The Data Flow data asset is updated at preset intervals as the lineage is updated within Data Flow.

Custom Lineage Ingestion

Data Catalog enables you to extend the lineage capability by providing lineage metadata for data processed/transformed in applications that Data Catalog does not natively support for lineage harvesting. This is achieved using the ImportLineage API.

  • Data Asset creation for Custom lineage provider: You must create a data asset for every custom lineage provider. It's important to note the data asset key of such data assets as they are used to identify the lineage provider in the ImportLineage API.

  • Ingesting custom lineage into the catalog: You can ingest lineage metadata into the catalog for data processed in applications or other data processing engines not natively supported for lineage harvesting by the OCI Data Catalog service. We support the lineage ingestion from Spark applications.

    The ImportLineage API accepts the lineage payload in an openLineage compatible format. For more details about the API see the ImportLineage.

  • Viewing custom ingested lineage in a lineage graph: In the lineage graph of a data entity, users can use a toggle in the UI to highlight paths that were provided by custom lineage providers using the ImportLineage API.

Viewing Data Lineage for an Entity

The lineage represents the flow of data from the source to this target entity.

Note

If a warning icon appears next to the name of a newly created data asset or its folders and entities, you must create a connection to harvest the folders and entities. This ensures that all attributes of the entities are available in the catalog as lineage metadata might contain only attributes that contribute to the lineage.
    1. In the Search field of the Home tab, enter the name of the entity.
    2. On the search results page, select the required entity.
    3. In the entity details page, click the Lineage tab.

    In the lineage graph, the entity on which you launch the lineage is identified by an anchor icon on it. The anchor object can appear anywhere on the lineage graph. The left side of this anchor object shows the lineage and the right side indicates the impact.

  • This task can't be performed using the CLI.

  • Run the FetchEntityLineage operation to fetch lineage for an entity.

Lineage Graph Visualization

The lineage graph contains process nodes and data nodes that are connected by lines to indicate the flow:

  • Process: Represents the Data Integration task objects, Data Flow applications, or custom applications. When you click a process node, you can find the Actions menu.

    For Data Integration, click Open in Data Integration to view the details of the Data Integration task run in the Data Integration Console.

    For Data Flow applications, click Open in Data Flow to view the details of the application in the Data Flow Console. If the applications are in a different tenancy you must sign in to the different OCI tenancy. To do this, copy the link and open it in a separate browser window.

    Data: Represents the Data Catalog objects. You can expand these nodes to view the column-level lineage. When you click a data node icon, you can find the Actions menu. Click Show object summary to view the summary of the Data Catalog object in a new tab.
    Note

    If Data Catalog does not accurately map a data asset from Data Integration, you might encounter a duplicate data asset in the lineage graph.

Lineage Graph in Data Catalog

Note

The lineage nodes aren't visible in Safari browser.

Enable the Show property panel toggle to view details such as Name, Path, and Description for a selected node.

When you open the lineage for an entity, you can view the following:
  • The entity-level lineage
  • The columns, by expanding the entity
  • The column-level lineage of a column by selecting the column