Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Lineage and origins

Parent article

Use the data lineage tools to help track the relationships between data resources in your data lake, which is especially helpful when you frequently merge and duplicate data. Lumada Data Catalog uses resource metadata and resource data to identify cluster resources that are related to each other. It identifies copies of the same data and merges between resources and the horizontal and vertical subsets of these resources. These relationships are the lineages of the resources.

The places where data comes into the cluster is called the origin of that data. Typically, the origin is the data source belonging to or coming from the resource. Data Catalog propagates the origin information across the lineage relationships to show the origin of the data.

Types of lineage

There are two types of lineages, inferred and factual.

  • Inferred Lineage

    This lineage is deduced by Lumada Data Catalog and inferred intelligently from metadata and data matches or overlaps between resources like timestamp, path, field values, and content.

  • Factual Lineage

    Any lineage that is not inferred is a factual lineage and is further categorized as:

    • Imported Lineage: This lineage is imported from third-party tools.
    • User-Defined Lineage: This lineage is defined by establishing the parent with the Add Parent action and by specifying the path to the parent resource. Data Catalog does not employ path or actual lineage checks on user-defined lineages and merely assumes the relationship the user has established is valid.

Useful terminology for lineage and origins

Lineage and Origins use the following common terms:

  • Anchor Resource

    The current resource for which the lineage is being examined.

  • Edges

    These are the trace branches between the Operation node and Resource nodes.

  • Factual lineage

    All non-suggested lineage is considered to be factual lineage and is derived from user curations or third-party tools.

  • Field-level lineage

    For an established resource lineage, field-level lineage traces the field relationships between source and target resources. The field-level mapping details the relationships.

  • Inferred lineage

    This lineage is suggested by Lumada Data Catalog and is inferred or derived based on the metadata and data-for-data resources across the platform.

  • Multi-hop lineage

    Data Catalog can trace lineage up to five levels upstream (to source) and five levels downstream (to targets).

  • Nodes

    A node is a grouping point on the lineage graph. Data Catalog identifies four types of nodes on the lineage graph:

    • Field node

      A drop-down field selector for the anchor resource to visualize field lineage between source and target resources.

    • Operation node

      This node identifies the lineage type with all the details and offers actionable options of:

      • Accept lineage
      • Reject lineage
      • Add source (for user-defined factual lineages)
    • Resource node

      A visual grouping of the fields that participate in the lineage.

    • System node

      A visual grouping of all resources belonging to the same system. These resources will appear inside the same system box.

  • Resource-level lineage

    This type of lineage traces the relationship between the upstream to downstream resources in a data lake.

  • Systems

    The system is the outermost boundary containing a resource. Every resource sits within a system. The system proves invaluable for lineage users to understand the flow of data across the firm.

    NoteThe system name is defined in Data Catalog as a property label on a Virtual Folder. It is used to enable common name cluster identification in the next-gen lineage. Refer to Managing Virtual Folders for details. If the system name is not set, it defaults to the Data Source name. Note that a system may contain multiple resources.
  • Target vs Source

    A target is any resource that is pulling data from another data source. A source is any resource that feeds data out to a target. For single-hop instances, the immediate upstream and downstream resources are also known as the parent and child of the target.

  • Upstream vs Downstream

    Data lineage represents the flow of data as it comes from a source upstream and flows into a target downstream. Using the anchor resource as a vantage point, you express flow in the graph from left to right on the map.

View lineage

On the Lineage tab, you can view the lineage information for the SRV resource.

View lineage

The Lineage graph visually traces relationships between resources with overlapping data.

The Lineage navigation bar offers tools to help you analyze the graph and decide whether to accept or reject the traced lineages.

Menu Bar / Actions
LabelActivity
Find in graphType the name of the resource you want to find on the current lineage graph and select the best match from the list that displays. Suggested best matches are also highlighted in the graph.
ViewSelect the type of lineages to show within the current scope of the graph, such as accepted, suggested, or rejected.
Field lineage (toggle)Click the switch to the (on) position to display the field lineage between the upstream and downstream resources. For more details refer to Field Lineage below.
Upstream/DownstreamClick to set the hop level in the current graph. Lumada Data Catalog can provide up to five (5) lineage hops upstream or downstream. The drop-down count is the highest hop level in Data Catalog. The anchor resource is set at level 0 and is the default lineage graph for any resource for which lineage has not yet been discovered. Use the hop level to validate the authenticity of the data flowing into the anchor resource to determine whether to accept or reject the lineage.
Graph controlsClick to resize, magnify, and refocus the lineage graph.
Canvas Elements
LabelDescription
Parent or Source ResourcesDisplayed per hop level. A source with a dotted system boundary indicates the presence of an upstream source, while a source with solid system boundary indicates the parent resource.
Anchor or Target ResourcesThe resource of interest for which lineage is being examined. It is indicated by a color-filled Resource node with an anchor icon.
Suggested LineageDenoted by a dotted grey line between the source and target Resource nodes via an Operation node.
Accepted LineageDenoted by a solid grey line between the source and target Resource nodes via an Operation node.
Rejected LineageDenoted by a dotted line and open link between the source and target Resource nodes via an Operation node.
SymbolsDescription
LinkedLinked
UnlinkedUnlinked
Lineage rejected

NoteEven if your user role does not have access to all the virtual folders, you can still view the full lineage graph. However, the resource details (Resource Name and Field Names) for the virtual folders will be masked and shown as LOCKED.
Lineage locked systems

Lineage details

In Lumada Data Catalog, the details displayed on the Recommendations pane of the Details tab offer deeper insight into the lineage discovered, depending on whether you select the Operations or Resource node.

Operations node details SRV

In Lumada Data Catalog, you can click the Operation node (the node with a link icon) to display the lineage information on the Details tab. The lineage actions which display vary depending on whether the lineage is an accepted, rejected, or suggested lineage.

Lineage Actions

On the Lineage Actions tab, you can accept or reject a suggested lineage or create your user-defined lineage as long as your user roles are Steward or Administrator.

Operation node details

The Steward and Administrator roles have the privilege to perform the following lineage actions:

  • Add Operation Source

    Establishes your user-defined factual lineage to the operation when you specify the absolute path to a parent resource. When you add factual lineage, all suggested edges for the Operation node automatically become accepted. Factual Lineage on the Operation node is then validated. Data Catalog performs path validations and actual metadata/data relation checks on your user-defined lineages on the Operation nodes, as displayed in Field Mapping details for the added resource.

  • Accept lineage

    You can accept a Suggested or Rejected lineage.

  • Reject lineage

    You can reject a Suggested or Accepted lineage. Rejected lineages will not be rediscovered on the next non-incremental Lineage Discovery job to automatically transition into a Suggested lineage state.

  • Delete lineage

    You can only delete a Rejected lineage. Deleted lineages will be rediscovered on the next non-incremental Lineage Discovery job to automatically transition into Suggested lineage state.

  • View mapping

    In addition to identifying the resource lineage, Data Catalog is also able to provide you with the field level overlap relationships between the immediate source and target resources involved in that operation node.Lineage field map

    The types of field-level lineage identified are:

    • Partial Copy (vertical copy)
    • Copy
    • Union - generally would have two or more parents and the child resource is the union of the identified parents.
    • Join
    • Superset (horizontal and vertical copy)
    • Subset (horizontal and vertical copy)

Rejecting or deleting edges on the Operations node

Lumada Data Catalog does not support lineage actions for independent edges. Any action on an edge will be performed on the Operation node. Exercise caution when adding factual sources on an Operation node. These sources cannot be independently rejected or deleted without affecting the other resources associated with the operation.

To remove a factual source from an operation node:

  1. Reject the operation.
  2. Delete the operation.
  3. Re-run lineage discovery to recover any suggested lineages associated with the Operation node that you deleted in step# 2.

User annotations

The User Annotations section offers a placeholder for any annotations you want to make regarding the lineages discovered. Any edits in this section are not considered for any Data Catalog processing but used as a placeholder for your notes.

Lineage details notepad
  • Expression

    Used as a placeholder field for a future Data Catalog use.

  • Description

    Used to define a filter that can be further exported to external tools for additional processing.

  • Comment

    Used for personal notes related to the lineage identified.

  • Code

    This field is populated by Data Catalog with a default expression used to identify the relationship for this Operation node. You can either use this as-is or modify it to reflect a more usable formula, like (Source) Partial Copy (Target) on field 3,6,9.

History Facet

In Lumada Data Catalog, History Facet displays the timestamps for the chronological history of the time of creation and time of last modification for the lineage operation.

Resource node details

On the Details tab, click the Resource node to display the resource-related lineage actions and information.

Field Node Details screen
  • Resource Type & Path

    The Details tab identifies the type of resource (file/collection/table and so on) with the absolute path.

  • Actions

    If your user role allows, you can use the following actions depending on whether the resource is an anchor, a target resource, or a source resource:

    • Add Lineage

      For an anchor resource, you can add and define factual lineage by specifying the absolute path to a parent resource. For a source resource, you can visit the resource.

Add lineage
NoteThe factual lineage on the Resource node cannot be validated because Lumada Data Catalog does not perform path validations, actual metadata, or data relation checks on your user-defined lineages. It only assumes that the relationship you have established with your user role is valid.
  • Resource Tags

    Lists any resource tags associated with that resource.

  • Field Tag domains

    Lists the tag domains of any field tags associated with or suggested on the fields of that resource.

  • Description

    The flattened (plain text-only) rich-text description of the resource that is obtained from the Overview tab.

  • History

    This facet displays the chronological history of time of creation and time of last modified timestamps for the resource.

Field lineage curation

You can add, edit, delete, or accept any inferred lineage relationship between two resources at the field level to establish the factual lineage. This process is called field lineage curation. You can use field lineage curation to reestablish the context and origin of fields that may have been lost when that data was copied or joined with other data.

This process is useful for protecting sensitive data. For example, when resources are fully or partially copied, or some fields of Resource A are joined with some fields of Resource B, the column and field names may not have carried over. If one of the participating fields in the copy or join is a sensitive data field, such as an SSN, as you make further copies or joins, the SSN context may be lost, potentially allowing it to be shared as public data. By reestablishing that relationship, you can prevent security breaches of sensitive data.

Impact analysis is another example where, if corrupt data is discovered in a field, you can see how many and which resources are impacted to take corrective action.

Data Catalog provides two ways of curating field-level lineages:

  • Curation of the lineage relationship

    You can use the Field lineage graph to curate the relationship of the nodes, as represented by edges, at the resource level of your data. In this graph, you can trace the lineage for every field in the anchor resource and select the edge of the relationship that you want to curate.

    Field lineage graph

    For example, data for the _c1 field can be traced to multiple resources both upstream and downstream as shown by the Field Mapping cards. Based on the overlap information, you can independently accept or delete this field lineage by clicking on the respective edges.

    You can select additional fields on the anchor resource from the drop-down list on the field node of the anchor resource.

    Some field nodes may have different lineage graphs. Some fields may also appear alone with no source feeding into the anchor resource, which implies that the field is a new field introduced in the anchor resource.

    See Curate the lineage relationship for details.

  • Curation using the Operation node

    To help you make a more informed curation decision, you can use the Operation node, which offers both a graphical visualization and overlap percentages. In this graph, you can trace the lineage for every field in the anchor resource and select the Operation node of the relationship that you want to curate.

    After you select an Operation node on a resource, a curation view opens with details about the field.

    Curation view

    The curation view provides these features and details:

    • Source(s)

      Identifies the number of resources participating in the operation and lists the full path for a maximum of two participating resources. If there are more than two participating resources, you can click the hyperlinked numeral to view each resource. You can accept or reject lineages for these participating resources in bulk from this common drop-down menu. Alternatively, you can accept, reject, or delete individual lineages by selecting the action from the Status drop-down menu.

      NoteYou cannot perform a bulk delete action in Data Catalog.
    • + Add source

      Click this button to add a new manual lineage to the Operation node. Refer to Resource node details for information.

    • Source fields

      The fields that are available from all participating resources along with the overlap information. In the Source fields table, sorting is supported on all columns.

    • Target fields

      The fields of the resource you are examining. In the Target fields table, sorting is supported on the Name column.

    • Status

      Discovered field lineages for the selected Target field are grouped at the top of the Source fields list with the link icon on the action drop-down menu that allows you to Accept or Delete the association.

      NoteUsing Delete on field lineages does not affect the resource lineage, and only removes the lineage suggestion that Data Catalog has discovered. A deleted lineage can be rediscovered when Lineage discovery is triggered again by selecting Allow Discovery on that resource.
    • Src

      Identifies the resource by number, and its path can be derived by referring to the Source(s) list.

    • Field

      The field number in the source resource.

    • Field Name

      The name of the field in the source resource.

    • Overlap %

      Lists the overlap percentages for the Source (S-fit) and Target (T-fit) resources. For example, an Overlap % of 100 indicates an ideal match, such that all values in the target and the source align for that field so the match should be accepted. Field lineages discovered to be weaker matches with a lower Overlap % appear further down in the list.

      Click the % links to display sample values from these resources.

    • Links

      Indicates the number of target fields linked to this source field. Links are common when similar data is repeated in multiple columns of the same resource.

    See Curate using the Operation node for details.

Curate the lineage relationship

Follow the steps below to curate field lineages using the lineage edge.

Procedure

  1. Navigate to the resource that you want to curate.

  2. Click Lineage.

    The lineage graph of the resource opens.
  3. Set the Field lineage switch to the on position.

    The lineage graph switches to display the field lineage.

    Field lineage switch

  4. Click the down arrow in the anchor resource, and then select the field that you want to curate from the drop-down list that displays.

  5. Click the lineage edge and then click Accept or Delete in the Recommendations pane.

    Selected lineage edge
  6. (Optional) You can continue to curate additional fields by selecting the fields and clicking the respective lineage edge and then clicking Accept or Delete in the Recommendations pane.

Results

Curation is complete. Accepted resources appear in gray edges, rejected resources appear in red edges, and deleted resources are removed from the Lineage graph.

Curate using the Operation node

Follow the steps below to curate field lineages using the Operation node.

Procedure

  1. Navigate to the resource you want to curate.

  2. Click Lineage.

    The lineage graph of the resource opens.
  3. Set the Field lineage switch to the off position.

  4. Click the Operation node that you want to examine.

    The Recommendations pane opens.

    Operation node selection

  5. Click Curate Lineage.

    The menu opens.

    Field lineage curation

  6. Select a field in the Target fields list.

    Click Add source to add a new manual lineage to this Operation node. See Resource node details for information.The Source fields list then shows the corresponding fields from all the participating resources along with the overlap information.
  7. Click Accept, Reject, or Delete in the Status drop-down menu to make individual selections for the resource.

    You can also perform a bulk operation for the sources by clicking the Source(s) drop-down menu and then selecting Accept, Reject, or Delete.The row of the accepted field lineage displays in green and the corresponding resource lineage is automatically accepted.

    An example of accepted lineage

  8. (Optional) You can continue to curate individual fields by selecting additional Target fields and then clicking Accept, Reject, or Delete in the Status drop-down menu.

Results

Curation is complete. Accepted resources appear in gray edges, rejected resources appear in red edges, and deleted resources are removed from the Lineage graph.

Field node details

On the Field node, you can select the anchor resource from the drop-down menu to display the lineage for the field between the source and target resources. The details for the Field node provide the field-related information.

Lineage details fldnode screenshot

  • Field Data Type and Path

    The Details tab identifies the Field data type (string, Boolean, integer, or timestamp) from the Lumada Data Catalog discovery engine along with the absolute path to the field.

  • Actions

    You can take the following recommended actions depending on whether the resource is an anchor, target resource, or a source resource:

    • For an anchor resource, no action is offered.
    • For a source resource, select Visit Resource.
  • Status

    Lists the discovered field details like Data type, Selectivity, and Cardinality.

  • Field Tag Domains

    Lists the tag domains of Field Tags associated with the resource field.

  • Description

    Refers to the flattened text description associated with a field as set via Rest API or Hive/JDBC comments.

  • History

    This facet displays the chronological history of the time of creation and the time of the last modified timestamps for the resource.

Multiple origins and origin propagation

The origin identifies the data source of the resource, even when it is part of a derived virtual folder. It identifies the data source to which the resource belongs.

Origin

Sometimes a resource can be a union of subsets with two different resources from two different data sources. When this type of lineage exists, you or someone with administrative privileges must propagate the origin(s) of this lineage to these resources. Once the origins have been propagated, you can view multiple origins for the resource. The multiple origins are derived from the merging of resource subsets across different data sources.

Importing lineages

Lumada Data Catalog provides a framework to allow integration to other applications from email servers to data cleansing and visualization tools. The integration framework lets you define an action menu option at the resource level that will initiate a client-side or server-side operation.

With the help of this framework, you can import the lineages from third-party tools.

Integrating third-party tools Atlas and Navigator

In Lumada Data Catalog, you can integrate third-party tools such as Atlas and Navigator to import lineages. To import lineages from either tool, drill down to the resource detail view and click the More actions icon. From the drop-down menu, select either Import Operations from Atlas or Import Operations from Navigator.

After you import the operations, you can view the imported lineages on the Lineage tab of the resource detail view. The Description displays Atlas Lineage Import or Navigator Lineage Import depending on the third-party tool you are using.

Import operations from Atlas