Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Data Catalog assets and features

The following topics can help you understand the assets and features of Pentaho Data Catalog.

Data Catalog assets

Pentaho Data Catalog provides data management and data representation with its own logical data entities, including the following assets:

Data sets

Data sets are named logical groupings of related data objects. They are primarily used to organize data objects together visually, but you can also run a process on the data set.

While in the Data Canvas tree view, you can create a data set, or collection, of data objects by selecting the check box next to those objects you want to organize into a single data set. You can choose to display only user-defined data sets in the tree view, greatly simplifying the navigation of these objects.

Workers

Data Catalog uses Open AI-architected digital worker processes that connect to data sources to perform the following tasks in an automated way:

  • Test connection
  • Ingesting metadata
  • Data profiling
  • Data identification
  • Key discovery
  • Data quality
  • Sensitive data discovery

Workers are the background processes that are launched whenever any activity is initiated, either manually or scheduled. You can view and manage workers, cancel worker processes, review the worker's progress and any details or exceptions relating to the worker process.

Dictionaries and data patterns

Data identification uses two data discovery methods, dictionaries and data pattern analysis. Data Catalog installs a set of pre-configured dictionaries and patterns, but you can define custom dictionaries and patterns if they are necessary for your specific requirements.

  • Dictionaries

    Dictionaries are word or term lists used to create bitsets and data patterns that you can then use to match column data.

  • Data patterns

    A pattern analysis (or data patterns) document defines the data pattern, regular expression, and column alias(es) and tags that you can use to identify a column of data. You can use data patterns for a variety of purposes, such as regular expression (RegEx) generation, data identification, and data quality checking.

  • Policies

    The dictionaries and data patterns are together referred to as a data identification policies. There are many policies included with Data Catalog, covering categories from a wide range of business sectors, such as Finance, Education, Aviation, Law Enforcement, PCI-DSS and Data Privacy. Data Catalog provides these policies for you to use and build on.

Glossary

The business glossary is an organized list of business terms and their definitions intended to serve as the single and definitive reference for an organization. You can associate business terms with data elements, business rules, related terms, and custom attributes to form a comprehensive view of your organization’s business concepts and data landscape.

Business rules

Business rules translate business requirements into logic-based rules you can use to tag your data. You can define business rules to manage your data and track its quality by designating whether or not that data is compliant.

You can use the Business Glossary page to define the compliant and non-compliant data and data formats.

Using these definitions, you can use business rules to apply SQL commands (called data quality rules) that identify non-compliant rows in your data. You can add any number of data quality rules to a business rule.

The business rules act as a hierarchal layer above the data quality rules:

  • You can choose whether or not to enable data quality rules.
  • You can also decide if a rule requires supervisor approval before being deployed.
  • Use custom tags to track and group business rules according to your needs.
  • For data quality type rules, you can further define one of the 7 standard dimensions of data quality.

Dashboards

Dashboards extend the visual discovery and relationship discovery capabilities of Data Catalog in several ways. They also provide a way to add your own customized insight assets, unique to your organization.

Data Catalog includes the following standard dashboards. See the Product Overview for more information.

  • Data Element Search
  • Data Identification Detail
  • Data Inventory Summary All
  • Data Inventory Summary
  • Docker Host Admin
  • Data Quality Assessment
  • Sensitive Data Discovery

Pentaho Data Storage Optimizer

If you have Data Catalog version 10.0.1, you can enable Pentaho Data Storage Optimizer.

Data Storage Optimizer is an intelligent data storage tiering solution that reduces operating costs and gives you seamless access to Hadoop data with S3 compatible object storage like Hitachi Content Platform.

For more information on Data Storage Optimizer, see https://help.hitachivantara.com/Documentation/Pentaho/Data_Storage_Optimizer/10.0.

To install Data Storage Optimizer, see https://help.hitachivantara.com/Documentation/Pentaho/Data_Catalog/10.0/Install/Install_Pentaho_Data_Catalog#Installing_Data_Storage_Optimizer_into_a_Data_Catalog_deployment.

You can either install Data Storage Optimizer when you install Data Catalog, or if you already have Data Catalog 10.0.1 installed, you can enable Data Storage Optimizer.