Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Data profiling and data identification

After you select the data you want to analyze, you must profile it and run the data identification process to tag the data. See the topics below and the User Guide for more information on profiling and data identification.

Data profiling

The data profiling process generates statistical and intermediate data that is required by other data analytic processes. The intermediate data is consumed by downstream processes such as data flow and foreign key detection.

The intermediate data generated for each column of data includes:

  • Roaring Bitset

    A bitmap of the hash values for all entries in the column.

  • HyperLogLog (HLL)

    Provides an estimate of the cardinality of the data, with a roughly ~2% margin of error.

  • Data Pattern Analysis

    Performs a rudimentary data pattern analysis using dimensional reduction, tracking the most frequently occurring patterns.

  • Data Quality Pre-Analysis

    Using the Data Pattern Analysis results, Data Catalog performs a statistical estimation of the data quality is performed. This is summarized as an overall percentage as well as a heat map for each data pattern. Additionally, Data Catalog makes RegEx recommendations for the most probable matches.

  • Statistics

    Data Catalog gathers the following statistics when examining all the data:

    • Minimum and Maximum values (for numeric columns)
    • Widest and Narrowest (non-null) string widths
    • Null count
    • Total row count
  • Data Sampling

    Data Catalog takes a controlled sampling of the data so that the samples are consistently chosen across different columns.

Data identification

The data identification process uses dictionaries and data pattern analysis to automatically classify data, applying tags defined in dictionary and pattern configuration files. In addition to the dictionaries and patterns included with Data Catalog, you can create your own dictionaries and pattern analysis configuration files to better suit your organization's needs.

You can use the data patterns to help identify (or to disqualify) data of a certain type.

Data pattern analysis

Fundamental to data quality analysis is either the use of a regular expression to check data, or to statistically analyze the data itself to find patterns and outlier patterns (which could indicate bad data).

The data identification process generates roughly the top 20 most common patterns which capture the characteristics of the data. You can then use these patterns, along with their statistical frequency and supplementary information, to generate regular expression (RegEx) recommendations for your data. You can tune the RegEx to meet your specific needs. Or you can select the valid patterns, so that subsequent data quality checks will identify any data entries that are outside the accepted patterns.

After running data identification, you can use the Galaxy View feature to visualize the data tagging, identify the data flow, locate your data, and view the sensitivity and security of that data.