After adding resources from the organization's data lake to Lumada Data Catalog, the next steps will be to profile and discover the data.
Running data catalog jobs
The Data Catalog starts profiling, tag discovery, and lineage discovery operations on the edge node where Data Catalog is installed and then triggers Spark jobs.
While almost all jobs can be initiated from the UI, some jobs or applications may prefer running jobs from the command prompt or in a script.
This section details the command syntax for the different jobs supported by Data Catalog. For profiling, these jobs populate the Data Catalog repository with file format and schema information, sample data, and data quality metrics for files in HDFS, Hive & JDBC. Much of the profiling information is stored on HDFS so that it is available for discovery operations.
Tag discovery and lineage discovery use the metadata collected during profiling to suggest tag associations and lineage relationships for resources in the inventory. This information is also stored in the repository.
Data Catalog jobs running on command line are triggered on the edge node on which Data Catalog is installed. The jobs are started using scripts located in the bin subdirectory within the Data Catalog Agent's installation location. They can also be run as workflows in Apache Oozie.
Command syntax options for profiling and discovery jobs
The generic command syntax for Data Catalog profiling and discovery job is as follows, unless otherwise specified, these parameters hold valid for most Data Catalog jobs.
These jobs are run as an option to the waterline script found in the installation directory - /opt/waterlinedata/agent
$ bin/waterline <job command> <necessary parameters> [- <optional Waterline parameters>] [-- <optional system parameters>]
The parameters in
[..] brackets are optional, when absent,
Data Catalog runs with
their default value as listed in the matrix below.
In addition to optional waterline script parameters listed in the matrix below, users can also specify certain system parameters like:
|--driver-memory||specifies the driver-memory size|
|--executor-memory||specifies the executor memory size|
|--queue||Yarn queue name defined for Data Catalog jobs|
|--num-executors||specifies the number of executors to be used|
The following matrix summarizes the various waterline script commands and the options that apply to them: