Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Manage data identification methods

Pentaho Data Catalog provides a built-in list of data identification methods called dictionaries and data patterns. Dictionaries are lists of words used to create bitsets, HyperLogLogs (HLLs), and data patterns that can be used for column data matching relying on bitset matching. Patterns define the data pattern, regular expression, column alias, and tags used to identify a data column. In addition to this, you can import custom dictionaries and data patterns configuration files that better suit your organization's specific needs.

View dictionaries and data patterns

In Data Catalog, perform the following steps to view data identification methods dictionaries and data patterns available for use.

Procedure

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.
  2. Click View Methods.

    You can view the list of dictionaries under the Dictionaries tab and the list of data patterns under the Patterns tab.
  3. Locate the dictionaries or data patterns you want to view in the table and select the View Details button (>) in its row.

    For Dictionaries, to view the JSON file details, click the Rules tab.

    It provides insight into logic for the dictionary to apply tags mentioned in the JSON file, such as conditions and confidence scores. Based on these data factors, you can apply dictionaries or patterns to data set.

    For example, in the following JSON file the dictionary rule specifies that the type is "Dictionary". The confidence score is calculated based on the weighted sum of "similarity" and "metadataScore" with conditions set to apply when the confidence score is greater than or equal to 0.7 and the column cardinality is greater than or equal to 1. If these conditions are met, the action is to apply the tag "General" to the dataset. This demonstrates how the provided logic guides the application of tags to datasets based on specified criteria.

    [ 
        { 
            "__typename": "dictionariesRules", 
            "type": "Dictionary", 
            "minSamples": 200, 
            "confidenceScore": { 
                "+": [ 
                    { 
                        "*": [ 
                            { 
                                "var": "similarity" 
                            }, 
                            0.9 
                        ] 
                    }, 
                    { 
                        "*": [ 
                            { 
                                "var": "metadataScore" 
                            }, 
                            0.1 
                        ] 
                    } 
                ] 
            }, 
            "condition": { 
                "and": [ 
                    { 
                        ">=": [ 
                            { 
                                "var": "confidenceScore" 
                            }, 
                            "0.7" 
                        ] 
                    }, 
                    { 
                        ">=": [ 
                            { 
                                "var": "columnCardinality" 
                            }, 
                            "1" 
                        ] 
                    } 
                ] 
            }, 
            "actions": [ 
                { 
                    "applyTags": [ 
                        { 
                            "k": "General" 
                        } 
                    ] 
                } 
            ] 
        } 
    ] 

Import dictionaries and data patterns

Perform the following steps to import custom dictionaries and data patterns.

Procedure

  1. Click Management in the left navigation menu.

    The Manage Your Environment page opens.
  2. Click Dictionaries to upload dictionaries or click Patterns to upload data patterns.

  3. Click Import.

  4. To represent structure and metadata in detail, do the following:

    1. For Dictionaries, import JSON and the CSV file in a compressed (.zip) file.

      Sample JSON and the CSV files:

      Dictionaries JSON fileDictionaries CSV file

    2. For Patterns, import JSON.

      A sample JSON file:

      Data patterns JSON file