Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Profiling datasets

Parent article

Dataset is one of the data assets in Lumada Data Catalog. Datasets allow users to create groups of resources having the same schema and spanning different virtual folders in your data lake, into a single data unit for easier management.

Datasets can be considered as user-defined virtual collections that have the matching schema but may have different path specifications/hierarchy with respect to physical location in your data lake irrespective of their data source type.

Datasets differ vastly from Virtual Folders in structure and behavior and thus profiling Datasets requires invoking separate set of commands that are described in this section.

Once the resources are added to a Dataset, you can profile the Dataset from the command line using the Data Catalog command line jobs described in this section. Data Catalog provides a separate set of command jobs for managing and profiling Datasets from the CLI.

Create Dataset

While not a profiling command per say, createDataSet command applies to Datasets for creating Datasets from CLI or using scripts.

createDataSet - This is a command line job and creates a Dataset with the specified name, datasetId, schemaVersion and pathSpecification with sourcePath and the include and exclude parameters.

Usage:

<WLD Agent>$ bin/waterline createDataSet -virtualFolder <Source Virtual Fodler> \
                                                  -dataSet <Dataset Name> \
                                                  -dataSetId <Dataset ID> \
                                                  -schemaVersion <Schema Version number> \
                                                  -path <Source path for the pathSpec template> \
                                                  [-includePattern <Regex include pattern>] \
                                                  [-excludePattern <Regex exclude pattern>] 

Example:

<WLD Agent>$ bin/waterline createDataSet -virtualFolder VF_country_1 \
                                                  -dataSet myCountry_DS1 \
                                                  -dataSetId DSET0001 \
                                                  -schemaVersion 1 \
                                                  -path /user/cloudera/demodata-19/pub/insurance/countries/Asia \
                                                  -includePattern ".*csv" 
Note
  • When creating the dataset, you must specify the source virtual folder and the source path for the path specification template.
  • Currently only S3/HDFS type of data sources are supported.

Add resource to dataset

With the current UI implementation limitation regarding adding resources to datasets, this command can be used in a script to add multiple resources as opposed to adding one at a time.

addResourceToDataSet: This command adds a resource to the Dataset by specifying its path.

Usage:

<WLD Agent>$ bin/waterline addResourceToDataSet -virtualFolder <Resource Virtual Folder> \
                                                         -dataSet <Dataset Name> \
                                                         -dataSetId <Dataset ID> \
                                                         -schemaVersion <Schema Version number> \
                                                         -path <Resource path> 

Example:

<WLD Agent>$ bin/waterline addResourceToDataSet -dataSet myCountry_DS1 \
                                                         -virtualFolder VF_country_1 \
                                                         -schemaVersion 1 \
                                                         -dataSetId DSET0001 \
                                                         -path /user/cloudera/demo-data-19/pub/insurance/countries/Asia/transactions_china.csv 

Dataset format discovery

Dataset format discovery registers the dataset into the Data Catalog repository, marking it as further processable. Each dataset is included in the catalog and is available for searching and tagging.

dataSetFormat:This command will format the members of the dataset (S3/HDFS only for this release).

Usage:

<WLD Agent>$ bin/waterline dataSetFormat -dataSet <Dataset Name> \
                                                  [-dataSetId <Dataset ID>] \
                                                  [-schemaVersion <Schema version number>] \
                                                  [-virtualFolder <Virtual Folder Name>] 

Example:

<WLD Agent>$ bin/waterline dataSetFormat -virtualFolder VF_country_1 \
                                                  -dataSet myCountry_DS1 \
                                                  -dataSetId DSET0001 \
                                                   -schemaVersion 1 

Dataset schema discovery

Dataset schema discovery applies format-specific algorithms to determine the structure of the dataset members.

dataSetSchema: This command runs schema discovery on the Dataset members.

Usage:

<WLD Agent>$ bin/waterline dataSetSchema -dataSet <Dataset Name> \
                                                  [-dataSetId <Dataset ID>] \
                                                  [-schemaVersion <Schema version number>] \
                                                  [-virtualFolder <Virtual Folder Name>] 

Example:

<WLD Agent>$ bin/waterline dataSetSchema -virtualFolder VF_country_1 \
                                                  -dataSet myCountry_DS1 \
                                                  -dataSetId DSET0001 -schemaVersion 1

Dataset profiling

Dataset basic profiling applies the schema information form Dataset schema discovery to collect statistics from each field in the member resources. This is also the process where the Dataset Profiling engine will pick the reference schema for the Dataset, against which the schema of the member resources is compared. Based on the results of this comparison, the Dataset profiling engine will include or ignore the member resources as part of the Dataset for further processing. The reference schema is picked from the member resource with the oldest timestamp in the repository.

dataSetProfile: This command does basic profiling of the Dataset members.

Usage:

<WLD Agent>$ bin/waterline dataSetProfile -dataSet <Dataset Name> \
                                                   [-dataSetId <Dataset ID>] \
                                                   [-schemaVersion <Schema version number>] 
                                                   [-virtualFolder <Virtual Folder Name>] 
NoteWhen a Dataset is created on a data lake that has already been profiled, you need only perform dataSetProfile on the newly created Dataset.

Example:

<WLD Agent>$ bin/waterline dataSetProfile -virtualFolder VF_country_1 \
                                                   -dataSet myCountry_DS1 \
                                                   -dataSetId DSET0001 \
                                                   -schemaVersion 1 

Tag discovery

Tag discovery propagates the tags across the data lake for all the resources that have been extended profiled. Datasets are no exceptions and tags seeded in Datasets will also get propagated and similar associations are discovered and suggested like for any other resource in the data lake.

tag: When a Virtual Folder is specified as a command line option for the tag command, it will run tag discovery on the member resources of the Virtual Folder and at the end runs tag discovery on all the Datasets associated with that Virtual Folder.

Usage:

<WLD Agent>$ bin/waterline tag [-virtualFolder <Virtual Folder Name>] \
                                        [-incremental <true/false> ] \
                                        [-path <Tag propagation path>] \
                                        [-regex <true/false>] 

Example:

<WLD Agent>$ bin/waterline tag -virtualFolder VF_All 
NoteIf the added resource is already profiled, the profiling jobs for Dataset that run, simply fetch the information for that profiled resource from the Data Catalog metadata directory. That file/table is not re-profiled unless the -incremental with false option is used with the command.

Dataset process

Much like the Waterline process command, dataSetProcess will sequentially perform dataSetFormat, dataSetSchema, and dataSetProfile in that order.

Usage:

<WLD Agent>$ bin/waterline dataSetProcess -dataSet <Dataset Name> \
                                                   [-dataSetId <Dataset ID>] \
                                                   [-schemaVersion <Schema version number>] \
                                                   [-virtualFolder <Virtual Folder Name>]

Example:

<WLD Agent>$ bin/waterline dataSetProcess -virtualFolder VF_country_1 \
                                                   -dataSet myCountry_DS1 \
                                                   -dataSetId DSET0001 \
                                                   -schemaVersion 1

Purge Dataset resources

The purge job flushes the resources marked for deletion from the Data Catalog repository.

Sometimes after the Data Catalog has processed and fingerprinted a resource, this resource itself may be deleted from the data sources. The Data Catalog marks such resources for deletion only during the next non-incremental processing cycle.

While this action takes care of making the absent resource unavailable for any further data curation, the Data Catalog repositories retain these entries which can then show up in search queries or listings.

And such resources if members of Datasets, will continue to display Resource not found! messages in the logs during next processing cycle of the Dataset.

The Purge job flushes all the resources that are marked for deletion from the repositories to reflect the absence of the deleted resource from search queries and listings.

The CLI script under <WLD Agent>/bin/waterline allows for providing a list of resources that need to be deleted in a CSV file format. This is particularly helpful for Dataset members.

The syntax for purgeResources is as follows:

<WLD Agent>$ bin/waterline purgeResources -deletedResourcesFile <Fully qualified path to the DeleteResourceListFile> \
                                                   -purgeDatasetMembers <true/false>

Where

  • -deletedResourcesFile: necessary parameter to specify a file that contains the list of resources to be removed from Lumada Data Catalog.
  • - (sample name used as example) is a file that contains the list of resources that need to be deleted by this utility in the comma-separated format.
  • -purgeDatasetMembers: default is false (skips resources that are Dataset members); When true, will remove resource from the dataset and then delete fromLumada Data Catalog.

Sample DeleteResourceListFile will contain dataSourceName and resourcePath as comma separated rows as shown in the following example:

CorpHQ, /corp/hq/resource1.json
Marketing, /marketing/eu_zone/curr_adv.csv 
BeeZign, /corp.session15 

Let's say resource1.json is a Dataset member, it will be deleted if purgeDatasetMembers is set to true, else skipped as a Dataset member if listed in the DeleteResourceListFile.

Important notes about Datasets

This section describes the known limitations by design, recognized by Data Catalog pertaining to Dataset feature.

Resource
  • A resource cannot be a member of more than one Dataset at a time. Datasets can be created only for S3/HDFS data sources.
  • Data Catalog uses the schema of the member resource with the oldest timestamp in its metadata directory, as a reference schema based on which the other Dataset members will be profiled. Those added resources whose schema does not match the reference schema will not be profiled as Dataset member and an appropriate log message will be displayed.
Virtual Folder
  • A dataset can have only one virtual folder.
  • By design, the uniqueness of the Dataset is upheld across the catalog not just across source virtual folders. So no two datasets with same name, id and schema version can exist even when their source virtual folders are different.
  • Datasets can have only files as member resources. Data Catalog does not allow folders to be a part of the Dataset.
Security

Only the user with Administrator privileges is able to create, modify, or delete a Dataset or define Reported Schema for a Dataset.

Tagging

If a resource that is a part of a collection becomes a Dataset member, this resource cannot be tagged.