Lineage
Use the data lineage tools to help track the relationships between data resources in your data lake, which is especially helpful when you frequently merge and duplicate data. Lumada Data Catalog uses resource metadata and resource data to identify cluster resources that are related to each other. It identifies copies of the same data and merges between resources and the horizontal and vertical subsets of these resources. These relationships are the lineages of the resources.
Types of lineage
There are two types of lineages, inferred and factual.
Inferred Lineage
This lineage is deduced by Lumada Data Catalog and inferred intelligently from metadata and data matches or overlaps between resources like timestamp, path, field values, and content.
Factual Lineage
Any lineage that is not inferred is a factual lineage and is further categorized as:
- Imported Lineage: This lineage is imported from third-party tools.
- User-Defined Lineage: This lineage is defined by establishing the parent with the Add Parent action and by specifying the path to the parent resource. Data Catalog does not employ path or actual lineage checks on user-defined lineages and merely assumes the relationship the user has established is valid.
Useful terminology for lineage
Lineage uses the following common terms:
Anchor Resource
The current resource for which the lineage is being examined.
Edges
These are the trace branches between the Operation node and Resource nodes.
Factual lineage
All non-suggested lineage is considered to be factual lineage and is derived from user curations or third-party tools.
Field-level lineage
For an established resource lineage, field-level lineage traces the field relationships between source and target resources. The field-level mapping details the relationships.
Inferred lineage
This lineage is suggested by Data Catalog and is inferred or derived based on the metadata and data-for-data resources across the platform.
Multi-hop lineage
Data Catalog can trace lineage up to five levels upstream (to source) and five levels downstream (to targets).
Nodes
A node is a grouping point on the lineage graph. Data Catalog identifies four types of nodes on the lineage graph:
Field node
A drop-down field selector for the anchor resource to visualize field lineage between source and target resources.
Operation node
This node identifies the lineage type with all the details and offers actionable options of:
- Accept lineage
- Reject lineage
- Add source (for user-defined factual lineages)
Resource node
A visual grouping of the fields that participate in the lineage.
System node
A visual grouping of all resources belonging to the same system. These resources will appear inside the same system box.
Resource-level lineage
This type of lineage traces the relationship between the upstream to downstream resources in a data lake.
Systems
The system is the outermost boundary containing a resource. Every resource sits within a system. The system proves invaluable for lineage users to understand the flow of data across the firm.
NoteThe system name is defined in Data Catalog as a property label on a Virtual Folder. It is used to enable common name cluster identification in the next-gen lineage. Refer to Managing Virtual Folders for details. If the system name is not set, it defaults to the Data Source name. Note that a system may contain multiple resources.Target vs Source
A target is any resource that is pulling data from another data source. A source is any resource that feeds data out to a target. For single-hop instances, the immediate upstream and downstream resources are also known as the parent and child of the target.
Upstream vs Downstream
Data lineage represents the flow of data as it comes from a source upstream and flows into a target downstream. Using the anchor resource as a vantage point, you express flow in the graph from left to right on the map.
View lineage
On the Lineage tab, you can view the lineage information for the resource.
The Lineage graph visually traces relationships between resources with overlapping data.
The Lineage navigation bar offers tools to help you analyze the graph and decide whether to accept or reject the traced lineages.
Menu Bar / Actions | |
Label | Activity |
Find in graph | Type the name of the resource you want to find on the current lineage graph and select the best match from the list that displays. Suggested best matches are also highlighted in the graph. |
View | Select the type of lineages to show within the current scope of the graph, such as accepted, suggested, or rejected. |
Field lineage (toggle) | Click the switch to the (on) position to display the field lineage between the upstream and downstream resources. For more details refer to Field Lineage below. |
Upstream/Downstream | Click to set the hop level in the current graph. Lumada Data Catalog can provide up to five (5) lineage hops upstream or downstream. The drop-down count is the highest hop level in Data Catalog. The anchor resource is set at level 0 and is the default lineage graph for any resource for which lineage has not yet been discovered. Use the hop level to validate the authenticity of the data flowing into the anchor resource to determine whether to accept or reject the lineage. |
Graph controls | Click to resize, magnify, and refocus the lineage graph. |
Canvas Elements | |
Label | Description |
Parent or Source Resources | Displayed per hop level. A source with a dotted system boundary indicates the presence of an upstream source, while a source with solid system boundary indicates the parent resource. |
Anchor or Target Resources | The resource of interest for which lineage is being examined. It is indicated by a color-filled Resource node with an anchor icon. |
Suggested Lineage | Denoted by a dotted grey line between the source and target Resource nodes via an Operation node. |
Accepted Lineage | Denoted by a solid grey line between the source and target Resource nodes via an Operation node. |
Rejected Lineage | Denoted by a dotted line and open link between the source and target Resource nodes via an Operation node. |
Lineage details
In Lumada Data Catalog, the details displayed on the Recommendations pane of the Details tab offer deeper insight into the lineage discovered, depending on whether you select the Operations or Resource node.
Operations node details
In Lumada Data Catalog, you can click the Operation node (the node with a link icon) to display the lineage information on the Details tab. The lineage actions which display vary depending on whether the lineage is an accepted, rejected, or suggested lineage.
Lineage Actions
On the Lineage Actions tab, you can accept or reject a suggested lineage or create your user-defined lineage as long as your user roles are Steward or Administrator.
The Steward and Administrator roles have the privilege to perform the following lineage actions:
Add Operation Source
Establishes your user-defined factual lineage to the operation when you specify the absolute path to a parent resource. When you add factual lineage, all suggested edges for the Operation node automatically become accepted. Factual Lineage on the Operation node is then validated. Data Catalog performs path validations and actual metadata/data relation checks on your user-defined lineages on the Operation nodes, as displayed in Field Mapping details for the added resource.
Accept lineage
You can accept a Suggested or Rejected lineage.
Reject lineage
You can reject a Suggested or Accepted lineage. Rejected lineages will not be rediscovered on the next non-incremental Lineage Discovery job to automatically transition into a Suggested lineage state.
Delete lineage
You can only delete a Rejected lineage. Deleted lineages will be rediscovered on the next non-incremental Lineage Discovery job to automatically transition into Suggested lineage state.
View mapping
In addition to identifying the resource lineage, Data Catalog is also able to provide you with the field level overlap relationships between the immediate source and target resources involved in that operation node.
The types of field-level lineage identified are:
- Partial Copy (vertical copy)
- Copy
- Union - generally would have two or more parents and the child resource is the union of the identified parents.
- Join
- Superset (horizontal and vertical copy)
- Subset (horizontal and vertical copy)
Rejecting or deleting edges on the Operations node
Lumada Data Catalog does not support lineage actions for independent edges. Any action on an edge will be performed on the Operation node. Exercise caution when adding factual sources on an Operation node. These sources cannot be independently rejected or deleted without affecting the other resources associated with the operation.
To remove a factual source from an operation node:
- Reject the operation.
- Delete the operation.
- Re-run lineage discovery to recover any suggested lineages associated with the Operation node that you deleted in step# 2.
User annotations
The User Annotations section offers a placeholder for any annotations you want to make regarding the lineages discovered. Any edits in this section are not considered for any Data Catalog processing but used as a placeholder for your notes.
Expression
Used as a placeholder field for a future Data Catalog use.
Description
Used to define a filter that can be further exported to external tools for additional processing.
Comment
Used for personal notes related to the lineage identified.
Code
This field is populated by Data Catalog with a default expression used to identify the relationship for this Operation node. You can either use this as-is or modify it to reflect a more usable formula, like (Source) Partial Copy (Target) on field 3,6,9.
History Facet
In Lumada Data Catalog, History Facet displays the timestamps for the chronological history of the time of creation and time of last modification for the lineage operation.
Resource node details
On the Details tab, click the Resource node to display the resource-related lineage actions and information.
Resource Type & Path
The Details tab identifies the type of resource (file/collection/table and so on) with the absolute path.
Actions
If your user role allows, you can use the following actions depending on whether the resource is an anchor, a target resource, or a source resource:
Add Source
For an anchor resource, you can add and define factual lineage by specifying the absolute path to a parent resource. For a source resource, you can visit the resource.
Resource Terms
Lists any resource business terms associated with that resource. Users with permission can click Add Term to add a term.
Glossaries
Lists the glossaries of any field business terms associated with or suggested on the fields of that resource.
Description
The plain text description of the resource that is obtained from the Summary tab.
History
Displays the time created and time last modified timestamps for the resource.
Field node details
On the Field node, you can select the anchor resource from the drop-down menu to display the lineage for the field between the source and target resources. The details for the Field node provide the field-related information.
Field Data Type and Path
The Details tab identifies the Field data type (string, Boolean, integer, or timestamp) from the Lumada Data Catalog discovery engine along with the absolute path to the field.
Actions
You can take the following recommended actions depending on whether the resource is an anchor, target resource, or a source resource:
- For an anchor resource, no action is offered.
- For a source resource, select Visit Resource.
Status
Lists the discovered field details like Data type, Selectivity, and Cardinality.
Field Term Glossaries
Lists the glossaries of Field Terms associated with the resource field.
Description
Refers to the flattened text description associated with a field as set via Rest API or Hive/JDBC comments.
History
This facet displays the chronological history of the time of creation and the time of the last modified timestamps for the resource.
Importing lineages
Lumada Data Catalog provides a framework to allow integration to other applications from email servers to data cleansing and visualization tools. The integration framework lets you define an action menu option at the resource level that will initiate a client-side or server-side operation.
With the help of this framework, you can import the lineages from third-party tools.
Integrating the Atlas third-party tool
In Data Catalog, you can integrate the Apache Atlas third-party tool to import lineages. To import lineages, drill down to the resource detail view and click the More actions icon. From the drop-down menu, select Import Operations from Atlas.
After you import the operations, you can view the imported lineages on the Lineage tab of the resource detail view. The Description displays Atlas Lineage Import.