Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Lineage

Parent article

Use the data lineage tools to help track the relationships between data resources in your data lake, which is especially helpful when you frequently merge and duplicate data. Lumada Data Catalog uses resource metadata and resource data to identify cluster resources that are related to each other. It identifies copies of the same data and merges between resources and the horizontal and vertical subsets of these resources. These relationships are the lineages of the resources.

Types of lineage

There are two types of lineages, inferred and factual.

  • Inferred Lineage

    This lineage is deduced by Lumada Data Catalog and inferred intelligently from metadata and data matches or overlaps between resources like timestamp, path, field values, and content.

  • Factual Lineage

    Any lineage that is not inferred is a factual lineage and is further categorized as:

    • Imported Lineage: This lineage is imported from third-party tools.
    • User-Defined Lineage: This lineage is defined by establishing the parent with the Add Parent action and by specifying the path to the parent resource. Data Catalog does not employ path or actual lineage checks on user-defined lineages and merely assumes the relationship the user has established is valid.

Useful terminology for lineage

Lineage uses the following common terms:

  • Anchor Resource

    The current resource for which the lineage is being examined.

  • Edges

    These are the trace branches between the Operation node and Resource nodes.

  • Factual lineage

    All non-suggested lineage is considered to be factual lineage and is derived from user curations or third-party tools.

  • Field-level lineage

    For an established resource lineage, field-level lineage traces the field relationships between source and target resources. The field-level mapping details the relationships.

  • Inferred lineage

    This lineage is suggested by Data Catalog and is inferred or derived based on the metadata and data-for-data resources across the platform.

  • Multi-hop lineage

    Data Catalog can trace lineage up to five levels upstream (to source) and five levels downstream (to targets).

  • Nodes

    A node is a grouping point on the lineage graph. Data Catalog identifies four types of nodes on the lineage graph:

    • Field node

      A drop-down field selector for the anchor resource to visualize field lineage between source and target resources.

    • Operation node

      This node identifies the lineage type with all the details and offers actionable options of:

      • Accept lineage
      • Reject lineage
      • Add source (for user-defined factual lineages)
    • Resource node

      A visual grouping of the fields that participate in the lineage.

    • System node

      A visual grouping of all resources belonging to the same system. These resources will appear inside the same system box.

  • Resource-level lineage

    This type of lineage traces the relationship between the upstream to downstream resources in a data lake.

  • Systems

    The system is the outermost boundary containing a resource. Every resource sits within a system. The system proves invaluable for lineage users to understand the flow of data across the firm.

    NoteThe system name is defined in Data Catalog as a property label on a Virtual Folder. It is used to enable common name cluster identification in the next-gen lineage. Refer to Managing Virtual Folders for details. If the system name is not set, it defaults to the Data Source name. Note that a system may contain multiple resources.
  • Target vs Source

    A target is any resource that is pulling data from another data source. A source is any resource that feeds data out to a target. For single-hop instances, the immediate upstream and downstream resources are also known as the parent and child of the target.

  • Upstream vs Downstream

    Data lineage represents the flow of data as it comes from a source upstream and flows into a target downstream. Using the anchor resource as a vantage point, you express flow in the graph from left to right on the map.

View lineage

On the Lineage tab, you can view the lineage information for the resource.

The Lineage graph visually traces relationships between resources with overlapping data.

The Lineage navigation bar offers tools to help you analyze the graph and decide whether to accept or reject the traced lineages.

Menu Bar / Actions
LabelActivity
Find in graphType the name of the resource you want to find on the current lineage graph and select the best match from the list that displays. Suggested best matches are also highlighted in the graph.
ViewSelect the type of lineages to show within the current scope of the graph, such as accepted, suggested, or rejected.
Field lineage (toggle)Click the switch to the (on) position to display the field lineage between the upstream and downstream resources. For more details refer to Field Lineage below.
Upstream/DownstreamClick to set the hop level in the current graph. Lumada Data Catalog can provide up to five (5) lineage hops upstream or downstream. The drop-down count is the highest hop level in Data Catalog. The anchor resource is set at level 0 and is the default lineage graph for any resource for which lineage has not yet been discovered. Use the hop level to validate the authenticity of the data flowing into the anchor resource to determine whether to accept or reject the lineage.
Graph controlsClick to resize, magnify, and refocus the lineage graph.
Canvas Elements
LabelDescription
Parent or Source ResourcesDisplayed per hop level. A source with a dotted system boundary indicates the presence of an upstream source, while a source with solid system boundary indicates the parent resource.
Anchor or Target ResourcesThe resource of interest for which lineage is being examined. It is indicated by a color-filled Resource node with an anchor icon.
Suggested LineageDenoted by a dotted grey line between the source and target Resource nodes via an Operation node.
Accepted LineageDenoted by a solid grey line between the source and target Resource nodes via an Operation node.
Rejected LineageDenoted by a dotted line and open link between the source and target Resource nodes via an Operation node.

NoteEven if your user role does not have access to all the virtual folders, you can still view the full lineage graph. However, the resource details (Resource Name and Field Names) for the virtual folders will be masked and shown as LOCKED.

Lineage details

In Lumada Data Catalog, the details displayed on the Recommendations pane of the Details tab offer deeper insight into the lineage discovered, depending on whether you select the Operations or Resource node.

Operations node details

In Lumada Data Catalog, you can click the Operation node (the node with a link icon) to display the lineage information on the Details tab. The lineage actions which display vary depending on whether the lineage is an accepted, rejected, or suggested lineage.

Lineage Actions

On the Lineage Actions tab, you can accept or reject a suggested lineage or create your user-defined lineage as long as your user roles are Steward or Administrator.

The Steward and Administrator roles have the privilege to perform the following lineage actions:

  • Add Operation Source

    Establishes your user-defined factual lineage to the operation when you specify the absolute path to a parent resource. When you add factual lineage, all suggested edges for the Operation node automatically become accepted. Factual Lineage on the Operation node is then validated. Data Catalog performs path validations and actual metadata/data relation checks on your user-defined lineages on the Operation nodes, as displayed in Field Mapping details for the added resource.

  • Accept lineage

    You can accept a Suggested or Rejected lineage.

  • Reject lineage

    You can reject a Suggested or Accepted lineage. Rejected lineages will not be rediscovered on the next non-incremental Lineage Discovery job to automatically transition into a Suggested lineage state.

  • Delete lineage

    You can only delete a Rejected lineage. Deleted lineages will be rediscovered on the next non-incremental Lineage Discovery job to automatically transition into Suggested lineage state.

  • View mapping

    In addition to identifying the resource lineage, Data Catalog is also able to provide you with the field level overlap relationships between the immediate source and target resources involved in that operation node.

    The types of field-level lineage identified are:

    • Partial Copy (vertical copy)
    • Copy
    • Union - generally would have two or more parents and the child resource is the union of the identified parents.
    • Join
    • Superset (horizontal and vertical copy)
    • Subset (horizontal and vertical copy)

Rejecting or deleting edges on the Operations node

Lumada Data Catalog does not support lineage actions for independent edges. Any action on an edge will be performed on the Operation node. Exercise caution when adding factual sources on an Operation node. These sources cannot be independently rejected or deleted without affecting the other resources associated with the operation.

To remove a factual source from an operation node:

  1. Reject the operation.
  2. Delete the operation.
  3. Re-run lineage discovery to recover any suggested lineages associated with the Operation node that you deleted in step# 2.

User annotations

The User Annotations section offers a placeholder for any annotations you want to make regarding the lineages discovered. Any edits in this section are not considered for any Data Catalog processing but used as a placeholder for your notes.

  • Expression

    Used as a placeholder field for a future Data Catalog use.

  • Description

    Used to define a filter that can be further exported to external tools for additional processing.

  • Comment

    Used for personal notes related to the lineage identified.

  • Code

    This field is populated by Data Catalog with a default expression used to identify the relationship for this Operation node. You can either use this as-is or modify it to reflect a more usable formula, like (Source) Partial Copy (Target) on field 3,6,9.

History Facet

In Lumada Data Catalog, History Facet displays the timestamps for the chronological history of the time of creation and time of last modification for the lineage operation.

Resource node details

On the Details tab, click the Resource node to display the resource-related lineage actions and information.

  • Resource Type & Path

    The Details tab identifies the type of resource (file/collection/table and so on) with the absolute path.

  • Actions

    If your user role allows, you can use the following actions depending on whether the resource is an anchor, a target resource, or a source resource:

    • Add Source

      For an anchor resource, you can add and define factual lineage by specifying the absolute path to a parent resource. For a source resource, you can visit the resource.

NoteThe factual lineage on the Resource node cannot be validated because Lumada Data Catalog does not perform path validations, actual metadata, or data relation checks on your user-defined lineages. It only assumes that the relationship you have established with your user role is valid.
  • Resource Terms

    Lists any resource business terms associated with that resource. Users with permission can click Add Term to add a term.

  • Glossaries

    Lists the glossaries of any field business terms associated with or suggested on the fields of that resource.

  • Description

    The plain text description of the resource that is obtained from the Summary tab.

  • History

    Displays the time created and time last modified timestamps for the resource.

Field node details

On the Field node, you can select the anchor resource from the drop-down menu to display the lineage for the field between the source and target resources. The details for the Field node provide the field-related information.

  • Field Data Type and Path

    The Details tab identifies the Field data type (string, Boolean, integer, or timestamp) from the Lumada Data Catalog discovery engine along with the absolute path to the field.

  • Actions

    You can take the following recommended actions depending on whether the resource is an anchor, target resource, or a source resource:

    • For an anchor resource, no action is offered.
    • For a source resource, select Visit Resource.
  • Status

    Lists the discovered field details like Data type, Selectivity, and Cardinality.

  • Field Term Glossaries

    Lists the glossaries of Field Terms associated with the resource field.

  • Description

    Refers to the flattened text description associated with a field as set via Rest API or Hive/JDBC comments.

  • History

    This facet displays the chronological history of the time of creation and the time of the last modified timestamps for the resource.

Importing lineages

Lumada Data Catalog provides a framework to allow integration to other applications from email servers to data cleansing and visualization tools. The integration framework lets you define an action menu option at the resource level that will initiate a client-side or server-side operation.

With the help of this framework, you can import the lineages from third-party tools.

Integrating the Atlas third-party tool

In Data Catalog, you can integrate the Apache Atlas third-party tool to import lineages. To import lineages, drill down to the resource detail view and click the More actions icon. From the drop-down menu, select Import Operations from Atlas.

After you import the operations, you can view the imported lineages on the Lineage tab of the resource detail view. The Description displays Atlas Lineage Import.