Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Command syntax and options

Parent article

After adding resources from the organization's data lake to Lumada Data Catalog, the next steps will be to profile and discover the data.

Running data catalog jobs

The Data Catalog starts profiling, tag discovery, and lineage discovery operations on the edge node where Data Catalog is installed and then triggers Spark jobs.

While almost all jobs can be initiated from the UI, some jobs or applications may prefer running jobs from the command prompt or in a script.

This section details the command syntax for the different jobs supported by Data Catalog. For profiling, these jobs populate the Data Catalog repository with file format and schema information, sample data, and data quality metrics for files in HDFS, Hive & JDBC. Much of the profiling information is stored on HDFS so that it is available for discovery operations.

Tag discovery and lineage discovery use the metadata collected during profiling to suggest tag associations and lineage relationships for resources in the inventory. This information is also stored in the repository.

Data Catalog jobs running on command line are triggered on the edge node on which Data Catalog is installed. The jobs are started using scripts located in the bin subdirectory within the Data Catalog Agent's installation location. They can also be run as workflows in Apache Oozie.

Command syntax options for profiling and discovery jobs

The generic command syntax for Data Catalog profiling and discovery job is as follows, unless otherwise specified, these parameters hold valid for most Data Catalog jobs.

These jobs are run as an option to the waterline script found in the installation directory - /opt/waterlinedata/agent

NoteIt is assumed that all Data Catalog script and jobs are run from the installation directory, /opt/waterlinedata/agent, unless otherwise specified. Moreover, jobs pertaining to a data source must be run on agents running on the same cluster as the data source in case of Hadoop and Hive data sources and in close network proximity for JDBC data sources.
$ bin/waterline <job command> <necessary parameters> [- <optional Waterline parameters>] [-- <optional system parameters>]

Where:

The parameters in [..] brackets are optional, when absent, Data Catalog runs with their default value as listed in the matrix below.

In addition to optional waterline script parameters listed in the matrix below, users can also specify certain system parameters like:

System parameterDescription
--driver-memoryspecifies the driver-memory size
--executor-memoryspecifies the executor memory size
--queueYarn queue name defined for Data Catalog jobs
--num-executorsspecifies the number of executors to be used

The following matrix summarizes the various waterline script commands and the options that apply to them:

Command syntax and options for profiling and discovery jobs matrix
Command syntax and options for profiling and discovery jobs matrix