Lineage discovery

Last updated
Save as PDF

Use the data lineage tools to help track the relationships between data resources in your data lake, which is especially helpful when you frequently merge and duplicate data. Lumada Data Catalog uses resource metadata and resource data to discover cluster resources that are related to each other. It identifies copies of the same data and merges between resources and the horizontal and vertical subsets of these resources. These relationships are the lineages of the resources.

The places where data comes into the cluster is called the origin of that data. Typically, the origin is the data source belonging to or coming from the resource. Data Catalog propagates the origin information across the lineage relationships to show the origin of the data.

Using lineage terminology

As a best practice, you may want to become familiar with the common terms used in data lineage and origins.

Upstream vs Downstream
Data lineage represents the flow of data. Data comes from a source upstream and flows into a target downstream. Using the anchor resource as a vantage point, Data Catalog expresses flow in the graph from the left to right of the map.
Target vs Source
A target is any resource that is pulling data from another data source. A source is any resource that feeds data out into a target. Single hop instances that are immediately upstream and downstream are known as parent and child of the target.
Inferred lineage
This lineage is suggested by Data Catalog. It is inferred or derived based on the metadata and data for data resources across the platform.
Factual lineage
All non-suggested lineage is considered to be factual lineage and is derived from user curations or third party tools, like Atlas or Navigator.
Multi-hop lineage
Data Catalog can trace lineage up to 5 levels upstream to a source and 5 levels downstream to targets.
Resource-level lineage
This type of lineage traces the relationship between the upstream to downstream resources in a data lake.
Field-level lineage
For an established resource lineage, field-level lineage traces the field relationships between the source and target resources. The field-level mapping details the relationships.
Systems
The system is the outermost boundary containing a resource. Every resource sits within a system. A system corresponds to a business analyst's concept of an application, library, or logical repository, such as "Sales Production" or "Corporate Warehouse". The system is key to understanding the flow of data across the firm.
For example, if a business analyst is looking at the "Sales Team Compensation" file, then they may see a file with a sales figure upstream. If it is in the sales production system, the sales figures are probably authoritative. If they are in a training services system, they are much less authoritative.
As lineages become longer chains, and the farther away a file is in the lineage, the user is less likely to know of the specifics of folders and such, but the system names are more likely to be informative.
System is defined in Data Catalog as a property on a virtual folder. If the system property is not set, it defaults to the data source name. Also, a system may contain multiple resources. For more information, see Manage virtual folders for details.

Common lineage discovery tasks include:

Lineage tracing
Discovering upstream data origination, or which source resources the data is coming from.
Impact assessment
Discovering downstream data impact, or which target resources this data will affect.

Types of lineages

Data Catalog can import the factual lineage collected by external tools or discover inferred lineage relationships, as described in the following topics:

Imported or factual lineage

In Data Catalog, you can import the factual lineage gathered by Atlas or Navigator tools. See Lineage and origins for more details.

Inferred lineage discovery

Data Catalog's lineage discovery job discovers inferred lineage relationships among all profiled files and tables, then calculates file and table origins. Inferred lineage discovery is the backend process that is looking for the potential parent-child relationships between data resources across data sources.

Lineage is inferred in contrast with Atlas and Navigator factual lineage which is based on audit logs. Data Catalog uses a heuristic, data-centric approach to find the similar data in an enterprise. When audit logs or other historic metadata are unavailable, this method helps to discover lineage relationships. Because discovered knowledge based on heuristics is not precise, the lineage is inferred and not factual.

Inferred lineage is discovered across platform, between different data source types. The following specific parent-child relationships are targeted:

Copy
Partial copy (vertical copy)
Subset (horizontal and vertical copy)
Superset (horizontal and vertical copy)
Union (generally would have two or more parents, and the child resource is the union of the identified parents)
Partial union
Join

Other types of lineage, such as join/merge relationships, are likely to be discovered because Data Catalog does not require all fields to match.

Inferred lineage discovery is based on the following assumptions:

Ability to match resource data for individual fields: Information is available as a result of the profiling.
Ability to arrange resources based on creation date with the assumption that the parent is created before the child. This assumption results in different processing details and data source compatibility and other restrictions based on the current implementation details.
For parent and child data sources, it is important to use hints to scale down computation since the enterprise is moving data from one location (data source) to another. Required hints can be target virtual folders and an optional list of parent virtual folders.
Lineage is performed on regular resources and collection roots. Collection members are excluded from the lineage discovery.

Profiled data is a pre-condition to lineage discovery. Profiling must be completed for all data resources before lineage discovery.

Spark execution parameters

Inferred lineage discovery is a resource-consuming Spark application that performs across the product between different data source fingerprints. It is necessary to start with large number of executors with large amount of memory allocated for each executor.

Best practices for lineage discovery

Most Data Catalog jobs, including lineage discovery, can now be triggered from the user interface. Previously, such jobs were kicked off using the command line interface (CLI). See Managing jobs for details.

Consider the following best practices when running lineage discovery.

The lineage discovery job sequence operates on data in the Data Catalog HDFS metadata store. If new files are added to the cluster, run a profile job to collect profiling data so you can see information for the new files reflected in lineage relationships.
To optimize performance, profile all data on the cluster before running lineage discovery. During regular maintenance, run lineage discovery after a substantial number of files are added rather than running it for each incremental change.
Limit lineage discovery to a specific parent or child directory if your data lake is organized to allow you to isolate lineage to specific areas. Consult your administrator about modifying Spark parameters in a job if you want to discover lineages within the same folder.
To run lineage against a set of test files, do not provide the -parentVirtualFolderList parameter. In such a case, both parent and children resources are picked from the -virtualFolder parameter listed, like in this example:
<AGENT-HOME>$ bin/ldc lineage -virtualFolder Fin_Asia -path /data/Finance
The Lineage command in this example triggers lineage discovery across the entire Fin_Asia folder, re-evaluating any existing suggested lineage relationships. The progress of the job is indicated by messages on the console and logged in the ldc‑jobs.log found by default in /var/log/ldc.
NoteThis shortcut for discovering lineages within the same folder will not work for JDBC resources.
Lineage discovery is a memory-intensive process. As a best practice, allow for additional time to run the initial lineage discovery.

Data source specific parent-child relationship

The following table lists specific parent-child relationships.

Parent data set type	Child data set type	Parent-child validation rules (ordering)
HDFS	HDFS	Parent is older than child. Ordered by last modified date. Other configured time sensitive restrictions (see configuration.json).
HDFS	HIVE	Factual lineage only. Creates the lineage during HIVE schema discovery. Format discovery or UI browsing is required.
HDFS	JDBC	Do not use creation dates.
HIVE	HDFS	N/A. HDFS tables are not created from HIVE.
HIVE	HIVE	Parent is older than child. Ordered by creation date.
HIVE	DB	Do not use creation dates.
DB	HDFS	Do not use creation dates.
DB	HIVE	Do not use creation dates.
DB	DB	Do not use creation dates. Do not allow for the same virtual folder as a parent and as a child.

Inferred lineage configuration parameters

The following table contains configuration parameters for inferred linages with default values and related resources.

Parameter	Name	Description	Default	Resource specific
batch_window_hours	Max access time window	Longest interval (in hours) between parent access time and child modification date for discovery to consider the two resources to be candidates for lineage relationships. Time checking is ignored if this value is set to 0.	24	HDFS
diff_same_directory_sec	Same directory modified time difference	Shortest interval allowed between last modified dates for files in the same directory to be considered for lineage relationships. If you have a transformation process that runs on files to create similar or 'refined' copies of data in the same directory, reduce this limit to ensure that Data Catalog inventory finds lineage relationships between the original file and the modified file.	30	HDFS
min_lineage_field_count	Min matching fields	Minimum number of matching fields required to consider two resources as candidates for a lineage relationship.	2	--
min_max_cardinality	Max cardinality	Maximum cardinality value for all matched fields for lineage discovery to consider. Should be greater or equal to this value to be considered as a valid lineage.	3	--
overlap	Min overlap values	Portion of overlapped values to cardinality of the child field for filter (copy, partial copy) lineage or parent field for union lineage. Portion must be greater than or equal to this value to consider parent-child field as a valid candidate for a lineage relationship. This value is not recommended to change. Values lower that 0.8 may lead to a significant number of false positive matches.	0.8	--
filter.min_fields2infer	Minimum number for mapped child fields to infer the filter lineage	Infers filter lineage (copy or partial copy) if the portion of mapped child fields is greater or equal to this value.	0.6	--
union.min_fields2infer	Minimum number for mapped child fields to infer the lineage	Infers union lineage if the portion of mapped child fields is greater or equal to this value.	0.8	--
filter.max_alternatives	Maximum number of alternative filter lineages	Maximum number of alternative filter lineages (copy and partial copy) to infer.	2	--
union.max_alternatives	Maximum number of alternative union lineages	Maximum number of alternative union lineages to infer.	2	--
use_access_time_filter	Use access time filter	Use last-access time when checking for lineage discovery. Only applicable if the HDFS settings for last access date are enabled in the property `dfs.namenode.accesstime.precision` of hdfs-site.xml. Currently, this optimization is not recommended.	false	HDFS
same_schema	Percent of fields with the same name to be considered as the same schema	Percent of fields with the same name to be considered as the same schema. For the same schema, lineage discovery matches the same name fields.	0.8	--
same_schema.non_matching	Percent of non-matching fields of same name to be excluded from the lineage consideration	If greater than this percentage of the child fields do not match the same schema, then the parent-child is not considered for lineage discovery.	0.3	--
same_schema.min2check	Minimum same schema fields to match	Minimum same schema fields to match. Otherwise, you cannot use same schema optimization.	5	--
lpupdate.batchsize	Batch size for discovery cache objects	Use this batch size to save discovery cache objects. One batch normally is saved in one HDFS MapFile in the location configured using `ldc.metadata.hdfs.large_properties.uri` and `ldc.metadata.hdfs.large_properties.path`	200	--
discovery.framework.right.batchsize	Batch size for left entity caching for discovery framework	Use this batch size to retrieve left entity: first entity type defined for the discovery cross product. For tag propagation, it is a tag. For lineage discovery, it is a parent data resource.	500	--
discovery.framework.right.batchsize	Batch size for right entity caching for discovery framework	Use this batch size to retrieve the right entity: second entity type defined for the discovery cross product. Currently, it is always a data resource.	400	--

Origins

The places where data comes into the cluster is called the origin of that data. Typically, the origin is the data source or the virtual folder belonging to or coming from the resource. Data Catalog propagates the origin information across the lineage relationships to show the origin of the data.

Origins of any resource are data sources related to this data resource directly (belongs to) or indirectly, through imported, inferred, or manually created (factual) lineages.

To see origins in the single resource view, navigate to Manage, then click Tools. Select Utils from the Tools dialog box that displays. For details about entering parameters to view origins, see Examine data resource origins.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.