Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Lineage

Parent article

Use the Lumada Data Catalog data lineage tools to help track the relationships between data resources in your data environment, which is especially helpful when you frequently merge and duplicate data. Knowing where the data has come from can help you track down data quality problems, know whether the data can be trusted, or confirm that data from a particular system or region is included. Knowing where the data is going can help you determine who depends on it, and see how the data flows through systems and business processes.

Data Catalog uses resource data along with its metadata to discover cluster resources that are related to each other. It identifies data copies and merges resources with the horizontal and vertical subsets of these resources.

ImportantProfiled data is a pre-condition to lineage discovery. You must complete profiling for all data resources before you run a lineage discovery job.

For lineage terminology definitions, see Lineage terminology. For information about lineage types, see Types of lineage.

View lineage

When you are viewing a resource in the Data Canvas, you can view the lineage information for a resource by clicking View Lineage in the Data Lineage area of the screen. This opens the Data Lineage page.

The Lineage graph visually traces relationships between resources with overlapping data.

You can use the following tools in the Data Lineage menu to help you analyze the graph and decide whether to accept or reject the traced lineages.

Data Lineage page/Actions
LabelActivity
Find in graphType the name of the resource you want to find on the current lineage graph. Suggested best matches are highlighted in the graph.
ViewSelect the type of lineages to show within the current scope of the graph, from Suggested, Accepted, and Rejected.
Resource view/Field view (toggle)To show the resource lineage, click the resource icon, or to show the field lineage, click the field icon.
Upstream and DownstreamClick the down arrow to set the hop level in the current graph. Data Catalog can provide up to three (3) lineage hops upstream or downstream. The anchor resource is set at level 0 and is the default lineage graph for any resource for which lineage has not yet been discovered. Use the hop level to validate the authenticity of the data flowing into the anchor resource to determine whether to accept or reject the lineage.
Enter Focus/Leave Focus (toggle, available only in Field view)Focused mode is selected by default when you select a field and click View Lineage. It displays the lineage graph associated with the field in focus. Click Leave Focus to view all fields and respective lineage graphs associated with the resource of the selected entity.
Graph controlsThe graph controls include icons for zoom out, fit and center, and zoom in. Click to resize or reposition the lineage graph.

A data lineage graph can contain the following elements:

Graph Elements
Lineage elementDescription
Parent or source resourceDisplayed per hop level. A source with a dotted boundary indicates the presence of an upstream source, while a source with solid boundary indicates the parent resource.
Anchor or target resourceThe resource of interest for which lineage is being examined. It is indicated by a color-filled Resource node with a target icon.
Suggested lineageDenoted by a dotted line between the source and target Resource nodes via an Operation node.
Accepted lineageDenoted by a solid line between the source and target Resource nodes via an Operation node.
Rejected lineage
NoteOnly visible if the rejected lineage is set to be visible with the Rejected checkbox selected in the View element of the Data Lineage menu bar.
Denoted by a red line and link icon with a diagonal line through it in the Operation node between the source and target Resource nodes.

NoteEven if your user role does not have access to all the virtual folders, you can still view the full lineage graph. However, the resource details (Resource Name and Field Names) for the virtual folders will be masked and shown as LOCKED.

Lineage details

You can gain deeper insight into the lineage discovered with the details displayed on the Details pane, which change depending on whether you select the Operations, Resource, or Field node.

Operations node details

You can click the Operation node (the node with a link icon) to display the lineage information on the Details pane. The lineage actions that display vary depending on whether the lineage is accepted, rejected, or suggested, and which lineage actions are selected for viewing by the View setting in the Data Lineage menu bar.

Lineage actions

On the Data Lineage page, you can use the Details pane to accept or reject a suggested lineage or create your user-defined lineage as long as you have permission.

The Steward and Administrator roles have the permission to curate lineage.

  • Accept Lineage

    Accept a Suggested or Rejected lineage.

  • Add Source

    Establish your user-defined factual lineage to the operation when you specify the absolute path to a parent resource. When you add factual lineage, all suggested edges for the Operation node automatically become accepted. Factual lineage on the Operation node is then validated. Data Catalog performs path validations and actual metadata/data relation checks on your user-defined lineages on the Operation nodes, as displayed in Field Mapping details for the added resource.

  • Allow Discovery

    Allow discovery of the resource when a lineage discovery job is run by selecting the resource and clicking Allow Discovery. Allow Discovery is a toggle with Forbid Discovery, and both are set at the resource level.

  • Delete Lineage

    Delete an Accepted or Rejected lineage. Deleted lineages will be rediscovered on the next non-incremental Lineage Discovery job to automatically transition into Suggested lineage state.

    NoteTo discover previously-deleted lineage, you need to click Allow Discovery for the resource.
  • Forbid Discovery

    Forbid discovery of a resource when a lineage discovery job is run by selecting the resource and clicking Forbid Discovery. This is useful if you want to ignore backup files, for example. Forbid Discovery is a toggle with Allow Discovery, and both are set at the resource level.

  • Reject Lineage

    Reject a Suggested or Accepted lineage. Rejected lineages will not be rediscovered on the next non-incremental Lineage Discovery job to automatically transition into a Suggested lineage state.

  • View Mapping

    View the field level overlap relationships between the immediate source and target resources involved in that Operation node.

  • Visit Resource

    Go to a lineage view with this resource as the target resource.

Rejecting or deleting edges on the Operations node

Lumada Data Catalog does not support lineage actions for independent edges. Any action on an edge will be performed on the Operation node. Exercise caution when adding factual sources on an Operation node. These sources cannot be independently rejected or deleted without affecting the other resources associated with the operation.

To remove a factual source from an operation node:

  1. Reject the operation.
  2. Delete the operation.
  3. Re-run lineage discovery to recover any suggested lineages associated with the Operation node that you deleted in step# 2.

Additional lineage information

You can find additional information about the lineage on the Details pane. The panels visible depend upon the element selected in the lineage graph.

  • Actions

    Depending on the element selected, you can:

    • Accept Lineage
    • Add Source
    • Allow Discovery
    • Delete Lineage
    • Forbid Discovery
    • Reject Lineage
    • View Mapping
    • Visit Resource
    For more information, see Lineage actions.
  • Allow Discovery

    Allow discovery of the resource when a lineage discovery job is run.

  • Code

    Populated by Data Catalog with a default expression used to identify the relationship for the selected Operation node.

  • Comment

    Notes related to the lineage.

  • Description

    Description of the lineage node.

  • Expression

    Placeholder for a future Data Catalog use.

  • Forbid Discovery

    Forbid discovery of certain resources when a lineage discovery job is run.

  • Glossaries

    Glossaries, if any, related to the lineage element.

  • History

    Displays timestamps for the chronological history of the time of creation and time of last modification for the lineage operation.

  • Resource

    Name of the resource.

  • Resource Terms

    Terms, if any, related to the resource. You can click Add Term to assign an existing term to the resource.

Resource node details

On the Data Lineage page, click the Resource node to display the resource-related lineage actions and information on the Details pane.

  • Resource Type & Path

    The Code section identifies the type of resource (file/collection/table and so on) with the absolute path.

  • Actions

    If your user role allows, you can use the following actions depending on whether the resource is a target (anchor) resource, or a source resource:

    • Add Source

      For a target (anchor) resource, you can add and define factual lineage by specifying the absolute path to a parent resource. For a source resource, you can visit the resource.

NoteThe factual lineage on the Resource node cannot be validated because Data Catalog does not perform path validations, actual metadata, or data relation checks on your user-defined lineages. It only assumes that the relationship you have established with your user role is valid.
  • Resource Terms

    Lists any resource business terms associated with that resource. Users with permission can click Add Term to add a term.

  • Glossaries

    Lists the glossaries of any field business terms associated with or suggested on the fields of that resource.

  • Description

    The plain text description of the resource that is obtained from the Summary tab.

  • History

    Displays the time created and time last modified timestamps for the resource.

Field node details

On the Field node, you can select the target (anchor) resource from the drop-down menu to display the lineage for the field between the source and target resources. The details for the Field node provide the field-related information. If a node is selected, its status is shown (Suggested, Accepted, or Rejected).

  • Actions

    You can take the following recommended actions depending on whether the resource is a target (anchor) field, resource, or a source resource:

    • Add Lineage

      For a target (anchor) field, you can manually add lineage relationships with other fields. Select the field, right-click and select Add Lineage, then select the field from the navigation tree or pop-up (multiple field selection is also possible). It creates an accepted lineage graph since it is manually added.

    • Forbid Discovery or Allow Discovery

      For a target (anchor) field or resource, you can select Forbid Discovery or Allow Discovery (toggle). Forbid Discovery forbids the resource from being found from the next lineage job run. Allow Discovery allows that resource to be found when a user runs the lineage job again.

    • Add Source

      For a target (anchor) resource, you can select Add Source and add the source to it.

    • Visit Resource

      For a source resource, you can select Visit Resource, and it takes you to the resource view.

    NoteYou can also perform various actions or curations like Accept Lineage, Reject Lineage, and Delete Lineage on the field level, similar to resource level lineage.
  • Description

    Refers to the flattened text description associated with a field as set via Rest API or Hive/JDBC comments.

  • Glossaries

    Lists the glossaries of Field Terms associated with the resource field.

  • History

    Displays the chronological history of the time of creation and the time of the last modified timestamps for the resource.

Importing lineages

Lumada Data Catalog provides a framework to allow integration to other applications from email servers to data cleansing and visualization tools. The integration framework lets you define an action menu option at the resource level that will initiate a client-side or server-side operation.

With the help of this framework, you can import the lineages from third-party tools.

Integrating the Atlas third-party tool

In Data Catalog, you can integrate the Apache Atlas third-party tool to import lineages. To import lineages, click Tools on the left navigation menu and click Lineage – Import/Export. Upload a file with lineage information that you want to import. From the drop-down menu, select Import Operations from Atlas.

After you import the operations, you can view the imported lineages by selecting View Lineage from a resource selected in the Data Canvas. The Description displays Atlas Lineage Import. All the imported lineages are in the Accepted status.