Lineage discovery

Last updated
Save as PDF

Use the Lumada Data Catalog data lineage tools to help track the relationships between data resources in your data environment, which is especially helpful when you frequently merge and duplicate data. Knowing where the data has come from can help you find data quality problems, know whether the data can be trusted, or confirm that data from a particular system or region is included. Knowing where the data is going can help you determine who depends on it, and see how the data flows through systems and business processes.

Data Catalog uses resource data along with its metadata to discover cluster resources that are related to each other. It identifies data copies and merges resources with the horizontal and vertical subsets of these resources.

ImportantProfiled data is a precondition to lineage discovery. You must complete profiling for all data resources before you run a lineage discovery job.

Lineage terminology

Lineage in Data Catalog uses the following common terms:

Anchor Resource
The current resource for which the lineage is being examined. Can also be referred to as a target resource.
Curation
Any action you take on a lineage suggestion, such as Accept or Reject.
Edges
These are the trace branches between the Operation node and Resource nodes.
Factual lineage
All non-suggested lineage is considered to be factual lineage and is derived from user curations or third party tools like Atlas. See Imported or factual lineage for more information.

Inferred lineage
A lineage suggested by Data Catalog. It is inferred or derived based on the metadata and data for resources across the platform. See Inferred lineage discovery for more information.
Multi-hop lineage
Data Catalog can trace lineage up to 5 levels upstream (to source) and 5 levels downstream (to targets/anchors).
Nodes
A node is a grouping point on the lineage graph. Data Catalog identifies four types of nodes on the lineage graph:
- Field node
  A drop-down field selector for the target (anchor) resource to visualize field lineage between source and target resources.
- Operation node
  This node identifies the lineage type with all the details and offers actionable options of:
  - Accept lineage
  - Reject lineage
  - Add source (for user-defined factual lineages)
- Resource node
  A visual grouping of the fields that participate in the lineage.
Resource-level lineage
This type of lineage traces the relationship between the upstream to downstream resources in a data environment.
Target and Source
A target is any resource that pulls data from another data source. Can also be referred to as an anchor resource. A source is any resource that feeds data out into a target. For single-hop instances, the immediate upstream and downstream resources are known as parent and child of the target (anchor).
Upstream and Downstream
Data lineage represents the flow of data. Data comes from a source upstream and flows into a target (anchor) downstream. Using the target (anchor) resource as a vantage point, Data Catalog expresses flow in the graph from the left to right of the map.

Common lineage discovery tasks include:

Impact assessment
Discovering downstream data impact, or which target (anchor) resources this data will affect.
Lineage tracing
Discovering upstream data, or which resources the data is coming from.

Types of lineage

Lineage in Data Catalog can be inferred or factual, and sometimes is heterogeneous.Data Catalog can import the factual lineage collected by external tools or discover inferred lineage relationships.

Inferred lineage
Data Catalog's Lineage Discovery job discovers inferred lineage relationships among all profiled files and tables. Inferred lineage discovery is the backend process that looks for the potential parent-child relationships between data resources across data sources. See Inferred lineage discovery for more information.
Factual lineage
Any lineage that is not inferred is a factual lineage, and can be further categorized as imported lineage or user-defined lineage. See Imported or factual lineage for more information.
Heterogeneous lineage
Lineage can be formed or discovered between different types of data sources, such as between HDFS resources and S3, Hive, or JDBC-based resources.

Imported or factual lineage

Factual lineage can be categorized as:

Imported lineage: This lineage is imported from third-party tools, such as the Atlas tool.
User-defined lineage: This lineage is defined by establishing the parent with the Add Source action and by specifying the path to the parent resource. Data Catalog does not employ path or actual lineage checks on user-defined lineages and merely assumes the relationship the user has established is valid.

Inferred lineage discovery

Data Catalog's Lineage Discovery job discovers inferred lineage relationships among all profiled files and tables. Inferred lineage discovery is the backend process that looks for the potential parent-child relationships between data resources across data sources, and infers lineage intelligently from metadata and data matches or overlaps between resources like timestamp, path, field values, and content.

Data Catalog uses a heuristic, data-centric approach to find the similar data in an enterprise. When audit logs or other historic metadata are unavailable, this method helps to discover lineage relationships. Because discovered knowledge based on heuristics is not precise, the lineage is inferred and not factual.

Inferred lineage is discovered across platform, between different data source types. The following specific parent-child relationships are targeted:

Copy
Partial copy (vertical copy)
Subset (horizontal and vertical copy)
Superset (horizontal and vertical copy)
Union (generally would have two or more parents, and the child resource is the union of the identified parents)
Partial union
Join

Other types of lineage, such as join/merge relationships, are likely to be discovered because Data Catalog does not require all fields to match.

Inferred lineage discovery is based on the following assumptions:

Ability to match resource data for individual fields: Information is available as a result of the profiling.
Ability to arrange resources based on creation date with the assumption that the parent is created before the child. This assumption results in different processing details and data source compatibility and other restrictions based on the current implementation details.
For parent and child data sources, it is important to use hints to scale down computation since the enterprise is moving data from one location (data source) to another. Required hints can be target virtual folders and an optional list of parent virtual folders.
Lineage is performed on regular resources and collection roots. Collection members are excluded from the lineage discovery.

Spark execution parameters

Inferred lineage discovery is a resource-consuming Spark application that performs across the product between different data source fingerprints. You should start with a large number of Spark executors (worker nodes) with a large amount of memory allocated for each executor.

Best practices for lineage discovery

Most Data Catalog jobs, including lineage discovery, can be triggered from the user interface. See Managing jobs for details.

Consider the following best practices when running lineage discovery.

The lineage discovery job sequence operates on data in the Data Catalog HDFS metadata store. If new files are added to the cluster, run a profile job to collect profiling data so you can see information for the new files reflected in lineage relationships.
To optimize performance, profile all data on the cluster before running lineage discovery. During regular maintenance, run lineage discovery after a substantial number of files are added rather than running it for each incremental change.
Limit lineage discovery to a specific parent or child directory if your data environment is organized to allow you to isolate lineage to specific areas. Consult your administrator about modifying Spark parameters in a job if you want to discover lineages within the same folder.
To run lineage against a set of test files, do not provide the -parentVirtualFolderList parameter. In such a case, both parent and children resources are picked from the -virtualFolder parameter listed, like in this example:
<AGENT-HOME>$ bin/ldc lineage -virtualFolder Fin_Asia -path /data/Finance
The lineage command in this example triggers lineage discovery across the entire Fin_Asia folder, re-evaluating any existing suggested lineage relationships. The progress of the job is indicated by messages on the console and logged in the ldc‑jobs.log found by default in /var/log/ldc.
NoteThis shortcut for discovering lineages within the same folder will not work for JDBC resources.
Lineage discovery is a memory-intensive process. As a best practice, allow for additional time to run the initial lineage discovery.

Data source specific parent-child relationship

The following table lists specific parent-child relationships.

Parent data set type	Child data set type	Parent-child validation rules (ordering)
HDFS	HDFS	Parent is older than child. Ordered by last modified date.
HDFS	Hive	Factual lineage only. Creates the lineage during Hive schema discovery. Format discovery or UI browsing is required.
HDFS	JDBC	Do not use creation dates.
Hive	HDFS	N/A. HDFS tables are not created from Hive.
Hive	Hive	Parent is older than child. Ordered by creation date.
Hive	DB	Do not use creation dates.
DB	HDFS	Do not use creation dates.
DB	Hive	Do not use creation dates.
DB	DB	Do not use creation dates. Do not allow for the same virtual folder as a parent and as a child.

Lineage configuration settings

The following table contains configuration settings for lineage with default values and related resources. Users with permission can modify these settings using Configuration on the Management page. For more information, see Managing configurations.

NoteAs a best practice, do not change these configuration settings.

Setting	Description	Default	Resource specific
Use access time filter	Use last-access time when checking for lineage discovery. Only applicable if the HDFS settings for the last access date are enabled in the property `dfs.namenode.accesstime.precision` of hdfs-site.xml. Currently, this optimization is not recommended.	false	HDFS
Selectivity difference	Child field selectivity should be more than this value times the differences from selectivity of the matching parent field to consider resources for a lineage relationship. Disabled if cardinality difference is set to 0.0.	2	--
Percent of non-matching fields of same name to be excluded from the lineage consideration	If greater than this percentage of the child fields do not match for the same schema, then the parent-child is not considered for lineage discovery.	0.3	--
Minimum same schema fields to match	Minimum same schema fields to match. Otherwise, you cannot use same schema optimization.	5	--
Percent of fields with the same name to be considered as the same schema	Percent of fields with the same name to be considered as the same schema. For the same schema, lineage discovery matches the same name fields.	0.8	--
Min overlap values	Percentage of overlapped values to total child field values must be greater than or equal to this value to consider resources as candidates for a lineage relationship. Non-distinct or anonymous fields are not considered for lineage discovery. It is recommended that you not change this value. Values lower than 0.8 may lead to a significant number of false positive matches.	0.8	--
Maximum number of fields not to ignore anonymous values	If the number of fields for both parent and child lineage candidate resources are greater than this value, Data Catalog ignores optimization that disallows matching the anonymous values.	5	--
Max cardinality	Maximum cardinality value for all matched fields for lineage discovery to consider. Should be greater than or equal to this value to be considered as a valid lineage.	3	--
Min rate of matching fields	Minimum rate of matching fields count required to consider two resources as candidates for a lineage relationship. The field count or the value of this setting is used, whichever is larger.	0.1	--
Min matching fields	Minimum number of matching fields required to consider two resources as candidates for a lineage relationship.	2	--
Field name similarity	Minimal level of field name similarity allowed for lineage discovery.	0.8	--
Same directory modified time difference	Shortest interval allowed between last modified dates for files in the same directory to be considered for lineage relationships. If you have a transformation process that runs on files to create similar or 'refined' copies of data in the same directory, consider reducing this limit to ensure that Data Catalog inventory finds lineage relationships between the original file and the modified file.	30	HDFS
Cardinality difference	Child field cardinality can be up to this value times larger than cardinality of matching parent field to consider for lineage relationships. Disabled if value is 0.0.	0	--
Max access time window	Longest interval (in hours) between parent access time and child modification date for discovery to consider the two resources to be candidates for lineage relationships. Time checking is ignored if this value is set to 0.	24	HDFS
Batch size for left entity caching for discovery framework	Use this batch size to retrieve the left entity, which is the first entity type defined for the discovery cross product. For business term propagation, it is a term. For lineage discovery, it is a parent data resource.	500	--

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.