With resources in a data lake constantly changing and updating, and data frequently being merged or duplicated, it becomes important to keep track of the relationships between these data resources. Lumada Data Catalog uses resource metadata and resource data to identify cluster resources that are related to each other. It identifies copies of the same data and merges between resources and the horizontal and vertical subsets of these resources. These relationships are the lineages of the resources.
The places where data comes into the cluster is called the Origin(s) of that data. Typically, the origin is the data source belonging to or coming from the resource. Data Catalog propagates the origin information across the lineage relationships to show the origin of the data.
Types of lineage
There are two types of lineages, inferred and factual.
This lineage is deduced by Lumada Data Catalog and inferred intelligently from metadata and data matches or overlaps between resources like timestamp, path, field values, and content.
Any lineage that is not inferred is a factual lineage and is further categorized as:
- Imported Lineage: This lineage is imported from third-party tools.
- User-Defined Lineage: This lineage is defined by establishing the parent with the Add Parent action and by specifying the path to the parent resource. Data Catalog does not employ path or actual lineage checks on user-defined lineages and merely assumes the relationship the user has established is valid.
Useful terminology for lineage and origins
Lineage and Origins use the following common terms:
The current resource for which the lineage is being examined.
These are the trace branches between the Operation node and Resource nodes.
All non-suggested lineage is considered to be factual lineage and is derived from user curations or third-party tools.
For an established resource lineage, field-level lineage traces the field relationships between source and target resources. The field-level mapping details the relationships.
This lineage is suggested by Lumada Data Catalog and is inferred or derived based on the metadata and data-for-data resources across the platform.
Data Catalog can trace lineage up to five levels upstream (to source) and five levels downstream (to targets).
A node is a grouping point on the lineage graph. Data Catalog identifies four types of nodes on the lineage graph:
A drop-down field selector for the anchor resource to visualize field lineage between source and target resources.
This node identifies the lineage type with all the details and offers actionable options of:
- Accept lineage
- Reject lineage
- Add source (for user-defined factual lineages)
A visual grouping of the fields that participate in the lineage.
A visual grouping of all resources belonging to the same system. These resources will appear inside the same system box.
This type of lineage traces the relationship between the upstream to downstream resources in a data lake.
The system is the outermost boundary containing a resource. Every resource sits within a system. The system proves invaluable for lineage users to understand the flow of data across the firm.NoteThe system name is defined in Data Catalog as a property label on a Virtual Folder. It is used to enable common name cluster identification in the next-gen lineage. Refer to Managing Virtual Folders for details. If the system name is not set, it defaults to the Data Source name. Note that a system may contain multiple resources.
Target vs Source
A target is any resource that is pulling data from another data source. A source is any resource that feeds data out to a target. For single-hop instances, the immediate upstream and downstream resources are also known as the parent and child of the target.
Upstream vs Downstream
Data lineage represents the flow of data as it comes from a source upstream and flows into a target downstream. Using the anchor resource as a vantage point, you express flow in the graph from left to right on the map.
On the Lineage tab, you can view the lineage information for the SRV resource.
The Lineage graph visually traces relationships between resources with overlapping data.
The Lineage navigation bar offers tools to help you analyze the graph and decide whether to accept or reject the traced lineages.
|Menu Bar / Actions|
|Find in graph||Type the name of the resource you want to find on the current lineage graph and select the best match from the list that displays. Suggested best matches are also highlighted in the graph.|
|View||Select the type of lineages to show within the current scope of the graph, such as accepted, suggested, or rejected.|
|Field lineage (toggle)||Click the switch to the (on) position to display the field lineage between the upstream and downstream resources. For more details refer to Field Lineage below.|
|Upstream/Downstream||Click to set the hop level in the current graph. Lumada Data Catalog can provide up to five (5) lineage hops upstream or downstream. The drop-down count is the highest hop level in Data Catalog. The anchor resource is set at level 0 and is the default lineage graph for any resource for which lineage has not yet been discovered. Use the hop level to validate the authenticity of the data flowing into the anchor resource to determine whether to accept or reject the lineage.|
|Graph controls||Click to resize, magnify, and refocus the lineage graph.|
|Parent or Source Resources||Displayed per hop level. A source with a dotted system boundary indicates the presence of an upstream source, while a source with solid system boundary indicates the parent resource.|
|Anchor or Target Resources||The resource of interest for which lineage is being examined. It is indicated by a color-filled Resource node with an anchor icon.|
|Suggested Lineage||Denoted by a dotted grey line between the source and target Resource nodes via an Operation node.|
|Accepted Lineage||Denoted by a solid grey line between the source and target Resource nodes via an Operation node.|
|Rejected Lineage||Denoted by a dotted line and open link between the source and target Resource nodes via an Operation node.|
In Lumada Data Catalog, the details displayed on the Recommendations pane of the Details tab offer deeper insight into the lineage discovered, depending on whether you select the Operations or Resource node.
Operations node details SRV
In Lumada Data Catalog, you can click the Operation node (the node with a link icon) to display the lineage information on the Details tab. The lineage actions which display vary depending on whether the lineage is an accepted, rejected, or suggested lineage.
On the Lineage Actions tab, you can accept or reject a suggested lineage or create your user-defined lineage as long as your user roles are Steward or Administrator.
The Steward and Administrator roles have the privilege to perform the following lineage actions:
Add Operation Source
Establishes your user-defined factual lineage to the operation when you specify the absolute path to a parent resource. When you add factual lineage, all suggested edges for the Operation node automatically become accepted. Factual Lineage on the Operation node is then validated. Data Catalog performs path validations and actual metadata/data relation checks on your user-defined lineages on the Operation nodes, as displayed in Field Mapping details for the added resource.
You can accept a Suggested or Rejected lineage.
You can reject a Suggested or Accepted lineage. Rejected lineages will not be rediscovered on the next non-incremental Lineage Discovery job to automatically transition into a Suggested lineage state.
You can only delete a Rejected lineage. Deleted lineages will be rediscovered on the next non-incremental Lineage Discovery job to automatically transition into Suggested lineage state.
In addition to identifying the resource lineage, Data Catalog is also able to provide you with the field level overlap relationships between the immediate source and target resources involved in that operation node.
The types of field-level lineage identified are:
- Partial Copy (vertical copy)
- Union - generally would have two or more parents and the child resource is the union of the identified parents.
- Superset (horizontal and vertical copy)
- Subset (horizontal and vertical copy)
Rejecting or deleting edges on the Operations node
Lumada Data Catalog does not support lineage actions for independent edges. Any action on an edge will be performed on the Operation node. Exercise caution when adding factual sources on an Operation node. These sources cannot be independently rejected or deleted without affecting the other resources associated with the operation.
To remove a factual source from an operation node:
- Reject the operation.
- Delete the operation.
- Re-run lineage discovery to recover any suggested lineages associated with the Operation node that you deleted in step# 2.
The User Annotations section offers a placeholder for any annotations you want to make regarding the lineages discovered. Any edits in this section are not considered for any Data Catalog processing but used as a placeholder for your notes.
Used as a placeholder field for a future Data Catalog use.
Used to define a filter that can be further exported to external tools for additional processing.
Used for personal notes related to the lineage identified.
This field is populated by Data Catalog with a default expression used to identify the relationship for this Operation node. You can either use this as-is or modify it to reflect a more usable formula, like (Source) Partial Copy (Target) on field 3,6,9.
In Lumada Data Catalog, History Facet displays the timestamps for the chronological history of the time of creation and time of last modification for the lineage operation.
Resource node details
On the Details tab, click the Resource node to display the resource-related lineage actions and information.
Resource Type & Path
The Details tab identifies the type of resource (file/collection/table and so on) with the absolute path.
If your user role allows, you can use the following actions depending on whether the resource is an anchor, a target resource, or a source resource:
For an anchor resource, you can add and define factual lineage by specifying the absolute path to a parent resource. For a source resource, you can visit the resource.
Lists any resource tags associated with that resource.
Field Tag domains
Lists the tag domains of any field tags associated with or suggested on the fields of that resource.
The flattened (plain text-only) rich-text description of the resource that is obtained from the Overview tab.
This facet displays the chronological history of time of creation and time of last modified timestamps for the resource.
Field lineage curation
You can add, edit, delete, or accept any inferred lineage relationship between two resources at the field level to establish the factual lineage. This process is called field lineage curation. You can use field lineage curation to reestablish the context and origin of fields that may have been lost when that data was copied or joined with other data.
This process is useful for protecting sensitive data. For example, when resources are fully or partially copied, or some fields of Resource A are joined with some fields of Resource B, the column and field names may not have carried over. If one of the participating fields in the copy or join is a sensitive data field, such as an SSN, as you make further copies or joins, the SSN context may be lost, potentially allowing it to be shared as public data. By reestablishing that relationship, you can prevent security breaches of sensitive data.
Impact analysis is another example where, if corrupt data is discovered in a field, you can see how many and which resources are impacted to take corrective action.
Data Catalog provides two ways of curating field-level lineages:
Curation of the lineage relationship
You can use the Field lineage graph to curate the relationship of the nodes, as represented by edges, at the resource level of your data. In this graph, you can trace the lineage for every field in the anchor resource and select the edge of the relationship that you want to curate.
For example, data for the _c1 field can be traced to multiple resources both upstream and downstream as shown by the Field Mapping cards. Based on the overlap information, you can independently accept or delete this field lineage by clicking on the respective edges.
You can select additional fields on the anchor resource from the drop-down list on the field node of the anchor resource.
Some field nodes may have different lineage graphs. Some fields may also appear alone with no source feeding into the anchor resource, which implies that the field is a new field introduced in the anchor resource.
See Curate the lineage relationship for details.
Curation using the Operation node
To help you make a more informed curation decision, you can use the Operation node, which offers both a graphical visualization and overlap percentages. In this graph, you can trace the lineage for every field in the anchor resource and select the Operation node of the relationship that you want to curate.
After you select an Operation node on a resource, a curation view opens with details about the field.
The curation view provides these features and details:
Identifies the number of resources participating in the operation and lists the full path for a maximum of two participating resources. If there are more than two participating resources, you can click the hyperlinked numeral to view each resource. You can accept or reject lineages for these participating resources in bulk from this common drop-down menu. Alternatively, you can accept, reject, or delete individual lineages by selecting the action from the Status drop-down menu.NoteYou cannot perform a bulk delete action in Data Catalog.
+ Add source
Click this button to add a new manual lineage to the Operation node. Refer to Resource node details for information.
The fields that are available from all participating resources along with the overlap information. In the Source fields table, sorting is supported on all columns.
The fields of the resource you are examining. In the Target fields table, sorting is supported on the Name column.
Discovered field lineages for the selected Target field are grouped at the top of the Source fields list with the link icon on the action drop-down menu that allows you to Accept or Delete the association.NoteUsing Delete on field lineages does not affect the resource lineage, and only removes the lineage suggestion that Data Catalog has discovered. A deleted lineage can be rediscovered when Lineage discovery is triggered again by selecting Allow Discovery on that resource.
Identifies the resource by number, and its path can be derived by referring to the Source(s) list.
The field number in the source resource.
The name of the field in the source resource.
Lists the overlap percentages for the Source (S-fit) and Target (T-fit) resources. For example, an Overlap % of 100 indicates an ideal match, such that all values in the target and the source align for that field so the match should be accepted. Field lineages discovered to be weaker matches with a lower Overlap % appear further down in the list.
Click the % links to display sample values from these resources.
Indicates the number of target fields linked to this source field. Links are common when similar data is repeated in multiple columns of the same resource.
See Curate using the Operation node for details.
Curate the lineage relationship
Navigate to the resource that you want to curate.
Click Lineage.The lineage graph of the resource opens.
Set the Field lineage switch to the on position.The lineage graph switches to display the field lineage.
Click the down arrow in the anchor resource, and then select the field that you want to curate from the drop-down list that displays.
Click the lineage edge and then click Accept or Delete in the Recommendations pane.
(Optional) You can continue to curate additional fields by selecting the fields and clicking the respective lineage edge and then clicking Accept or Delete in the Recommendations pane.
Curate using the Operation node
Follow the steps below to curate field lineages using the Operation node.
Navigate to the resource you want to curate.
Click Lineage.The lineage graph of the resource opens.
Set the Field lineage switch to the off position.
Click the Operation node that you want to examine.The Recommendations pane opens.
Click Curate Lineage.The menu opens.
Select a field in the Target fields list.Click Add source to add a new manual lineage to this Operation node. See Resource node details for information.The Source fields list then shows the corresponding fields from all the participating resources along with the overlap information.
Click Accept, Reject, or Delete in the Status drop-down menu to make individual selections for the resource.You can also perform a bulk operation for the sources by clicking the Source(s) drop-down menu and then selecting Accept, Reject, or Delete.The row of the accepted field lineage displays in green and the corresponding resource lineage is automatically accepted.
(Optional) You can continue to curate individual fields by selecting additional Target fields and then clicking Accept, Reject, or Delete in the Status drop-down menu.
Field node details
On the Field node, you can select the anchor resource from the drop-down menu to display the lineage for the field between the source and target resources. The details for the Field node provide the field-related information.
Field Data Type and Path
The Details tab identifies the Field data type (string, Boolean, integer, or timestamp) from the Lumada Data Catalog discovery engine along with the absolute path to the field.
You can take the following recommended actions depending on whether the resource is an anchor, target resource, or a source resource:
- For an anchor resource, no action is offered.
- For a source resource, select Visit Resource.
Lists the discovered field details like Data type, Selectivity, and Cardinality.
Field Tag Domains
Lists the tag domains of Field Tags associated with the resource field.
Refers to the flattened text description associated with a field as set via Rest API or Hive/JDBC comments.
This facet displays the chronological history of the time of creation and the time of the last modified timestamps for the resource.
Multiple origins and origin propagation
The origin identifies the data source of the resource, even when it is part of a derived virtual folder. It identifies the data source to which the resource belongs.
Sometimes a resource can be a union of subsets with two different resources from two different data sources. When this type of lineage exists, you or someone with administrative privileges must propagate the origin(s) of this lineage to these resources. Once the origins have been propagated, you can view multiple origins for the resource. The multiple origins are derived from the merging of resource subsets across different data sources.
Lumada Data Catalog provides a framework to allow integration to other applications from email servers to data cleansing and visualization tools. The integration framework lets you define an action menu option at the resource level that will initiate a client-side or server-side operation.
With the help of this framework, you can import the lineages from third-party tools.
Integrating third-party tools Atlas and Navigator
In Lumada Data Catalog, you can integrate third-party tools such as Atlas and Navigator to import lineages. To import lineages from either tool, drill down to the resource detail view and click the More actions icon. From the drop-down menu, select either Import Operations from Atlas or Import Operations from Navigator.
After you import the operations, you can view the imported lineages on the Lineage tab of the resource detail view. The Description displays Atlas Lineage Import or Navigator Lineage Import depending on the third-party tool you are using.