Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Discovery commands

Parent article

The Discovery jobs use the metadata collected in profiling to suggest tag associations and lineage relationships for files in the inventory. This information is also stored in the repository.

For more information about what are Tags and how they work refer to Discovery & Propagation. and to learn how to control it refer to Tag Discovery Concepts & Controls in User Guide. The discovery commands are supported for the waterline script that resides in the Agent component of the Data Catalog under <Agen Location>/bin.

The following topics will help you to understand the discovery jobs in the Data Catalog

Tag discovery

The general syntax for this job is as follows:

$ bin/waterline tag -virtualFolder [Virtual Folder Name] [-path [tag propagation path] -incremental [true/false] [optional system flags]] 

Tag discovery is incremental by default (-incremental flag is true)- it will run only against the updated or newly added resources in the catalog.

Caution

Effective Waterline Data 4.4, -regex true/false flag for tag discovery is obsolete.

  • We re-evaluate regular expressions on most frequent values discovered internally (up to 2000 by default).
  • We evaluate if necessary - previously not evaluated or qualified changes to regular expression tags.

    Qualifications:

    • not evaluated during profiling (for instance added after resource was profiled).
    • regular expression was modified (regular expression itself or min or max of regular expression length).
NoteActual tag association happens on the top of evaluation during tag discovery If not satisfied with evaluation during tag discovery in some cases, can always re-run profiling with -incremental false and re-run tag discovery.

To force tag discovery across the entire data catalog run the following command:

$ bin/waterline tag -incremental false <additional job options>

To propagate tags for both regular expression tags and value tags, include the regular expression tags in the same tag command.

$ bin/waterline tag -regex true <additional job options>

NoteTag discovery requires that the profiling job has run at least once to gather the metadata required for tag discovery.

Lineage discovery

Lumada Data Catalog visualizes relationships among resources in the form of a lineage graph. This graph shows parent-to-child relationships among resources, where the child resource has the same or a subset of rows and the same or a subset of columns from the parent and where the parent was created before the child.

Lineage relationships can be determined in two different ways:

  1. Imported from an external source such as Cloudera Navigator or Apache Atlas.
    • Navigator integration
    • Atlas integration
  2. Inferred from Data Catalog's lineage discovery processing.

Lineage discovery discovers lineage relationships among all profiled files and tables and calculates file and table origins.

The general syntax for lineage discovery is as follows:

$ bin/waterline lineage -virtualFolder [child virtual folder name] 
                                 -parentVirtualFolderList [parents virtual folder list] 
                                 -incremental [true/false] 

Where:

-virtualFolder is the child root virtual folder.

-parentVirtualFolderList is quoted string of comma(only) separated list of potential potential root virtual folders like Parent1,Parent2,Parent3 if more than one potential parent; else

-incremental option if set to false runs lineage discovery across the entire data resource list specified not just the new or updated ones. Its default value is true.

A sample lineage discovery command would be as follows:

bin/waterline lineage -virtualFolder EUMEAS 
                               -parentVirtualFolderList "Corp-HQ,MagnUX,DataWarehouse" 
                               -incremental false

Or

bin/waterline lineage -virtualFolder Marketing 
                                -parentVirtualFolderList Corp-HQ
CautionIf the lineage job fails (returns status code of -1), the origin job does not run.

The two data source parameters are optional but recommended, especially for evaluation. At least include the child data source (this is the set of resources that you want to show lineage for). By default, Data Catalog looks in the entire repository for parent candidates for these resources. If you also specify a parent data source (or sources),Data Catalog will restrict the search for parents to just these resources.

There are additional parameters, but these are the important ones.

To run lineage against a set of test files, set both parent and children directories to the same location, like this:

$ bin/waterline lineage -virtualFolder /data/Finance 
                        -parentVirtualFolderList /data/Finance
CautionLineage discovery is a memory-intensive process; initial lineage discovery runs can take a long time.

Origins

Origins of the data resource are data sources related to this data resource directly (belongs to) or indirectly, through imported, inferred or manually created lineages. Important, for this release origin is not provided automatically. To see origins in the data resource single file view, run the following command:

$ bin/waterline utils com.waterlinedata.origin.OriginPropagation -mode <value> 

Where:

-mode determines the type of lineage the origin will propagate through:

  • inferred: propagates origins only through inferred lineages, accepted and suggested
  • imported: propagates origins only through factual imported lineages
  • all: (default) propagates origins through inferred and imported lineages

For instance, to import lineage, run the following command:

$ bin/waterline utils com.waterlinedata.origin.OriginPropagation -mode inferred 

After running lineage discovery, you may want to run:

$ bin/waterline utils com.waterlinedata.origin.OriginPropagation -mode all

Performance notes

Multiple factors impact the performance of lineage discovery.

  • Size of the data: You can reduce the scope of lineage discovery by providing less potential parent data sources. To get a recommendation, run the following script:
    $ bin/waterline utils com.waterlinedata.lp_service.utils.RepoCacheMaintenance
  • Level of parallelism: recommended allocate explicitly multiple executors per node, at least --num-executors N*2, where N is number of nodes.
  • Amount of executor memory: the more allocation, the lineage works faster. This is due to increase in automatic increase in cache size up to the parameters set in the below properties. However, this can also slow down performance due to garbage collection overhead.
  • Relevant batch sizes in configuration.json
    waterlinedata.discovery.framework.left.batchsize (default value set to 500) 
    waterlinedata.discovery.framework.right.batchsize (default value set to 400) 

This can increase caching limits together with increasing executor memory. Exercise caution, because increasing the value for this parameter will increase the portion of the cartesian product generated for lineage and can result in OOM exception and/or performance degradation.