Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Profiling Lumada Data Catalog Resources

With your data sources set up, its time to start profiling them for cataloging. These instructions walk you through all the steps needed to set up your catalog.

Profiling virtual folders

Parent article

Data Catalog's profiling engine reads the data sources and fills the catalog with metadata about the resources, such as databases, tables, folders, and files, in each data source.

Profiling is done in three distinct stages:

  • Format discovery

    Data Catalog identifies the format of data resources, marking the resources that can be further processed. Each resource is included in the catalog and is available for searching and tagging at the resource level. Format Discovery applies only to the HDFS/S3 type of data source and will be skipped for HIVE and JDBC data sources.

  • Schema discovery

    Data Catalog applies format-specific algorithms to determine the structure of the data in each resource, producing a list of columns or fields for each resource's catalog entry. Data Catalog uses this information to determine if files are part of a larger data set - this is referred to as a "collection" in the catalog.

  • Profiling

    Data Catalog uses the schema information to collect statistics from each field in each resource. This information is included in each resource's catalog entry.

Each of the above processes - Format Discovery, Schema Discovery and Profiling - can be run independently, or they can be combined to run as a "process". Refer to Process Commands for details.

Starting with Waterline Data 4.3, we combine the light and extended profiling as a single profile command as follows:

Profiling Command4.3 (Old)4.4 (Old)5.0 (Old)2019.1 (Old)2019.2 (Old)2019.3 (New/Current)
Format Discoveryformatformatformatformatformatformat
Schema Discoveryschemaschemaschemaschemaschemaschema
One Profiling includes profile_ext + profileRegexprofileprofileprofileprofileprofileprofile
NoteProfiling commands no longer dependent on the data source type.

If you are upgrading from our previous versions - Waterline Data 4.0.x or Waterline Data 4.1.x - please note that the profiling commands are no longer dependent on the data source type being profiled. Starting with Waterline Data 4.2.x, we now incorporate the data source type sensing logic as part of single action-specific command.

Data Catalog commands are discussed in details in the following sections.

Format Discovery

In Format Discovery, Data Catalog identifies the format of data resources, marking the resources that can be further processed. Format Discovery applies only to file system data sources such as HDFS or S3.

Usage: $ bin/waterline format -virtualFolder <Virtual Folder Name> [-path <path to resource>]

Example: $ bin/waterline format -virtualFolder MagnUX -path /user/sam/LnA_Data

Schema Discovery

In the Schema Discovery step, Data Catalog applies format-specific algorithms to determine the structure of the data in each resource, producing a list of columns or fields for each resource's catalog entry.

Usage: $ bin/waterline schema -virtualFolder <Virtual Folder Name> [-path <path to resource>]

Sample: $ bin/waterline schema -virtualFolder SFAir -path /default.flightraffic -incremental false

Schema Discovery fetches field level information in addition to the basic metadata information for a resource Schema Discovery identifies and labels any inferred Collections discovered at this stage. Refer to Collections for more details.

Profiling

Usage:

$ bin/waterline profile -virtualFolder <Virtual Folder Name> \
                               [-path <path to resource>] \
                               [-incremental <true/false>] \
                               [-regex <true/false>] \
                               [-mode <fast/normal>] \
                               [-singleResource <true/false> [-skewedCollection <true/false>]] \
                               [-profileLastPartition <true/false>] 

Where:

ParameterRequired/OptionalDescription
-virtualFolderRequiredName of the virtual folder being profiled
-pathOptionalAbsolute path to a resource. When absent will profile all the resources in the virtual folder provided in the -virtualFolder parameter
-incrementalOptionalThe set of resources to profile.
  • true

    The profile job runs against only the newly added/updated resources in the asset.

  • false

    The profile job runs against all the resources - even the previously profiled ones.

-modeOptionalProfiles the entire asset. The default is normal. If you need to speed up the job, perform profiling with sampling enabled for large resource sizes by passing this argument with 'fast' flag. Sampling is discussed in detail in the Sampling Large Data Sources.
-singleResourceOptionalTypically used with collection data unit, provides ability to aggregate profile results to the collection root as opposed to individual collection members. If true, the collection is profiled as a single data resource. The default is false
-skewedCollectionOptionalApplicable only for collections profiling when -singleResource is true. When true(default), sampling will not be applied to Collections. If false, collections will be profiled with sampling enabled.
-profileLastPartitionUsed with partitioned HIVE tables only, provides ability to profile just the last partition of the table.

Regular expression evaluation happens in two different places.

  • During profiling

    We evaluate regular expression for ALL values as is. Regular expression tag must exist prior to profiling.

  • During tag discovery

    If regular expression tag was created or changed after data profiling, we approximate regular expression evaluation during tag discovery using top K values (default 2000, controlled by waterlinedata.discovery.profile.tipping_point configuration parameter).

While evaluation during profiling is the preferred method, it can be very expensive because we need to re-profile data with -incremental false to re-evaluate regular expressions. Can be very time consuming.

Evaluation during tag discovery is much faster but will have accuracy problems because:

  • We do evaluation on top K values, results may not be accurate.
  • Top K values stored internally are all lower-cased so regular expressions should be case insensitive.

Profiling HIVE tables with partitions

When profiling HIVE data sources with partitions, you can profile just the last partition of a table by setting the -profileLastPartition to true as follows:

$ bin/waterline profile -virtualFolder <HIVE virtual folder name> -profileLastPartition true

In case of huge partitioned tables, it may be desirable to sample the table to get an honest representation of the data residing in all the partitions of the table. Applying sampling on a partition will not provide an honest interpretation of all the data.

To circumvent this limitation, you can use the -singleResource flag. With the -singleResource set to true, partitioned HIVE tables are treated as non-partitioned tables and sampling can be applied to it as follows:

$ bin/waterline profile -virtualFolder <HIVE virtual folder name> \
                        -singleResource true \
                        -sample.splits <number of splits> \
                        -sample.rows <num rows to be sampled> 

The profiling results of the table will be applied to individual partitions thus marking them as profiled.

WARNINGThe -profileLastPartition option applies only for HIVE tables.

When both -profileLastPartition and -singleResource options are passed for HIVE tables, the -profileLastPartition takes precedence and -singleResource option will be ignored.

Profiling a Collection as a single resource

When the -singleResource flag is applied to Collections, the Data Catalog engine profiles the complete collection as a single resource, aggregating the profile results to the collection root as opposed to individual collection members.

$ bin/waterline profile -virtualFolder <HDFS/S3 virtual folder name> -path <absolute path to collection folder(discovered during schema discovery)> -singleResource true 

JDBC incremental profiling

Newly added rows to an already profiled JDBC Resource can be incrementally profiled with additional flag set during profiling as follows:

$ bin/waterline profile –virtualFolder <name> [-path <path name>] -wdc_append_change_key <column name> 

Where:

-wdc_append_change_key is the column name for unique row identifier (typically with values in ascending order) maintained by some JDBC databases like Oracle and MySQL.

JDBC parallel profiling

The performance of profile discovery on JDBC resources improves with this feature. While profiling the JDBC resources, an additional flag -jdbcConcurrency <value> need to passed to do parallel profiling.

$ bin/waterline profile –virtualFolder <name> [-path <path name>] -jdbcConcurrency <value> 

Where:

-jdbcConcurrency: defines the level of parallelism desired when profiling the JDBC resource, i.e. when -jdbcConcurrency is set to 2, Data Catalog will make two concurrent connections to the JDBC resource to profile it.

Process

Waterline process command allows the user to trigger all the three profiling processes - format (where applicable), schema, and profile - one after the other in that order using a single command - process.

The usage is as follows:

$ bin/waterline process -virtualFolder <name> [-path <path name>] [additional options] 
ParameterRequired/OptionalDefinition
-virtualFolderRequiredThe name of the virtual folder to be processed.
-pathOptionalThe absolute path to resource being processed .

For example:

$ bin/waterline process -virtualFolder PanAsia -incremental false 
Caution

For tables with names starting with special characters:

Table names starting with a $ character must be escaped with 3 backslash () characters.

For example, here is how the command containing the table name with a preceding $ special character must be entered:

$ bin/waterline process -virtualFolder SQLServerJDBC -path "/Person.CRONUS EXT International Ltd_\$Acc_ Schedule Name" 

Table names starting with other special characters like (, { } and ) must be escaped with a single backslash character ().

Sampling large data sources

Sampling implementation targets resources that are very large in size. Because profiling time is linear to the row count, profiling very large resources may prove a time intensive task. The implementation method varies for Hive/HDFS and for JDBC data sources.

NoteData Sampling offers a snapshot of the actual data; likewise, tag propagation on sampled data may produce acceptable results to deduce inferences on the Tag propagation. However running tag propagation on entire data should produce better results.

Sampling Hive and HDFS data sources

For Hive and HDFS data sources, sampling should be applied only for large resources. A resource is considered as large if it exceeds the certain number of partitions, and these partitions are be determined by the following command line parameter:

-sample.splits <number of splits>

For large clusters, it can be tricky to specify the number of splits over which sampling should be performed. If this number is larger than the actual number of partitions for a resource, the entire resource gets profiled rendering sampling functionality futile. If this number is too small, a resource may be sampled without any performance benefits. For this reason, Data Catalog suggests the following guideline formula to arrive at an optimum value for the sample.splits parameter.

Say your data cluster has Nn nodes and each node has Ex executors, then with a scale factor Sf, the sample.splits can be calculated as:

sample.splits = Nn * Ex * Sf

The scale factor is an integer number that determines the trade-off between processing speed and sampling accuracy.

With Sf = 1, the data is essentially sampled in parallel and the processing time is considerably lower. As Sf value increases, processing time increases linearly but the accuracy of sampling improves. Higher the value of Sf, for a given number of nodes Nn and executors Ex, the sample.splits approaches the actual number of splits for a resource, implying the entire resource is profiled.

Arriving at the optimum number for sample.splits

Assuming that a very large data resource is spread out through the cluster and having S splits, more than number of nodes: S > Nn. If Sf = 1, all processing ideally will be done in parallel and usually does not take a lot of time. It does not provide substantial benefits if Sf = 1 because all processing is distributed and run in parallel.

From Sf = 2 we start to expect the performance benefits from sampling.

The compromise here is: As Sf increases, the processing time will linearly increase for a file with partitions count S > Nn*Ex*Sf; but sampling will be applied on less number of resources (only those with partitions greater than Nn*Ex*Sf).

If Sf too small, sampling will be applied on many more resources without actual performance benefits.

Considering the goal - reduce the processing time for a very large data resource (with some sacrifice to accuracy) and do full profiling for the rest of the data resources: the recommendations for a large clusters is Sf = 4.

For instance, user run profiling for a very large resource:

S = 10000 splits (actual typical size is 128Mb*5000 = 1250Gb)
Nn = 50 nodes
Ex = 4 executors per node
T = 60 sec If processing of 1 split takes 60 sec and
Nn*Ex splits are processed in parallel: 50*4 - 200

The total processing time without sampling would be (10000/(50*4)) * 60 = 1500sec = 50 min

If define sampling with Sf = 4 which translated to -sample.splits = 50*4*4 = 800; the processing time does goes down to (800/(50*4)) * 60 = 240 sec = 4 min. Approximately at least 128Mb*800 = 100Mb of data will be sampled (depending also on sample.rows parameter value).

It also means that for Sf = 4 (sample.splits = 800), the data resources that are approximately smaller than 100Mb will be fully profiled.

If data resource is qualified for sampling (sample.splits specified are less than the number of partitions), then profiler will use only first sample.rows for profiling.

While sample.splits suggests an intent of sampling and provides a baseline number for how many splits need to be included for sampling, Data Catalog requires sample.rows that limits the number of rows that will be sampled as follows:

-sample.rows <number of rows to sample>
CautionTo enable sampling at profile time, you must pass the -mode flag set to fast.

For a given sample.splits value, the minimum size of data that will be sampled is determined by sample.splits * partition size, where partition size or split size is an internal parameter determined or configured by the HFDS system. So, if a resource size is less than this value, that resource is not sampled but fully profiled. And any resource larger than this minimum size value will be sampled for the sample.rows number of rows.

In order to sample large HDFS and Hive data sources, the best practice in Data Catalogis to define both of the command line parameters: -sample.splits and -sample.rows.

Examples:

$ bin/waterline profile -virtualFolder <HDFS VF> -path <path to HDFS VF> -mode fast -sample.splits <number of splits [Nn*Ex*Sf]> -sample.rows <rows to be sampled> 
$ bin/waterline profile -virtualFolder <HIVE VF> -path /<path to HIVE VF> -mode fast -sample.spl

Sampling JDBC data sources

If you want to sample JDBC data sources, use only one command line parameter:

-sample.rows <number of rows to sample>

Data Catalog samples first sample.rows rows.

NoteJDBC does not guarantee the order of the incoming rows.

Sampling configurations

Sampling is controlled by two configuration parameters in configuration.json:

  • waterlinedata.profile.sampling.splits_per_node

    Average number of splits per node for sampling.

    In the 'fast' profiling mode, the Data Catalog starts to sample Hive/HDFS resources if average number of splits processed on one node (total splits count/partitions count) is greater than this value.

    For partition counts calculations see the description of waterlinedata.profile.partition.count configuration parameter described below. If partition count is calculated as 0, the Data Catalog uses 2 for calculations. Increase this value if satisfied with profiling performance but want less resources sampled.

    defaultValue: 2

  • waterlinedata.profile.sampling.rows

    Number of rows to sample.

    In the 'fast' profiling mode, the Data Catalog will always sample these number of rows for JDBC resources.

    For HIVE and HDFS resources, these number of rows will be sampled if resource qualifies for sampling. For qualification criteria see the description of waterlinedata.profile.sampling.splits_per_node configuration parameter.

    defaultValue: 500000

Related configuration parameter:

  • waterlinedata.profile.partition.count

    Partition count.

    Profiler will coalesce or parallelize partitions to this value as following:

    • If executors count is defined, will use it to set this parameter.

    ELSE

    • If spark.dynamicAllocation.enabled is set to false, will use this value, if not 0.

    ELSE

    • Will set this value to node cluster count.
    • In all other cases will use default system behavior for coalesce and parallelize.

    defaultValue: 0

Profiling delimited data

Data Catalog reads delimited data files to determine the file type and the delimiter character determines the schema for the file. The result is categorized as a CSV file in the catalog.

Here are some attributes of delimited data that Data Catalog can profile successfully:

  • Separators (delimiters) can be single characters. Letters and numbers are excluded from consideration. Comma, colon, tab, space, pipe are all supported.
  • Data types recognized are string, int, float, and date and datetimestamp that correspond to the default for the Spark configuration.
  • Null values in fields are recognized.
  • Compressed files are recognized when the compression file contains a single file.

Purge Resources

The purge job essentially flushes the resources marked for deletion from the Data Catalog repository.

Sometimes after Data Catalog has processed and fingerprinted a resource, this resource itself may be deleted from the data lake. Data Catalog marks such resources for deletion only during the next processing cycle on the virtual folder that contained the deleted resource.

While this action takes care of marking the absent resource unavailable for any further metadata curation, the Data Catalog repositories retain these entries which can then show up in search queries or listings.

The Purge job will flush all the resources that are marked for deletion from the repositories to reflect the absence of the deleted resource from data lake.

The Purge job can be run from the user interface. The Purge job appears as one of the job sequence options in the Sequence workflow for Run job now action in the UI.

NoteRemember to run it against the Virtual Folder and not on individual resources.

The CLI script under <WLD Agent>/bin/waterline script allows for providing a list of resources that need to be deleted in a CSV file format. The syntax for purgeResources is as follows:

<WLD Agent>$ bin/waterline purgeDataResources -virtualFolder <Virtual Folder Name> \ -path <absolute path to the resource to be purged>

Example:

<WLD Agent>$ bin/waterline purgeDataResources -virtualFolder IND931 -path /AB_Test_Files/demoData/Pass1
Caution
  • Purge job will flush only the resources that have been marked for deletion in the Data Catalog Engine.
  • Resources deleted from the data lake will only be marked for deletion on (at least) the successive format (HDFS) or schema (HIVE/JDBC) jobs.

Unlike the Data Catalog jobs logging, the logs for the Purge job are stored under <Logs Location>/backgroundJobs path as virtual_folder_id.log

$ ll /var/log/waterlinedata/backgroundJobs/
total 56
drwxr-xr-x 2 waterlinesvc waterlinesvc 174 Mar 31 00:05 2020-03
-rw-r--r-- 1 waterlinesvc waterlinesvc 39346 Mar 31 18:00 ${ctx:logFilename}.log 
-rw-r--r-- 1 waterlinesvc waterlinesvc 1700 Mar 29 22:43 ra6e4c504062c4411a.log 
-rw-r--r-- 1 waterlinesvc waterlinesvc 1700 Mar 30 05:34 ra9ed950cac1974d9f.log 
-rw-r--r-- 1 waterlinesvc waterlinesvc 1698 Mar 30 05:34 rac64a09531764a54.log 
-rw-r--r-- 1 waterlinesvc waterlinesvc 1699 Mar 29 23:44 rad3f956a82cc54841.log

Job Concurrency: Profile multiple jobs in parallel

Data Catalog now supports profiling of multiple virtual folders with non-overlapping paths to run concurrently. This involves employing ZooKeeper functionality for acquiring read/write locks on the source paths in the virtual folder's path specifications after validating path overlap conditions.

Before running jobs concurrently, make sure you have enabled the switch in the following properties in the <Agent Location>/conf/configuration.json file:

  • waterlinedata.job.locking.enabled: set value parameter to true.
  • waterlinedata.job.locking.zookeeper.znode: set value parameter to the zknode, e.g. ip-172-31-29-206:9983.
  • waterlinedata.metadata.solr.usezk: set value parameter to true.