Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Data Catalog user features

This article describes the Pentaho Data Catalog user interface and the tasks non-admin users can perform. Before proceeding, make sure that a Data Catalog service user or administrator has set up your catalog for you. Data Catalog builds a complete inventory of data assets in a data warehouse, automatically and securely. It provides:

  • exact data discovery and faster delivery to an authenticated user.
  • better understanding of data quality at a glance.
  • an inventory of all assets for efficient data repository governance.

Data Catalog complements data visualization, data discovery, and data wrangling tools by streamlining the collection and initial data quality checks of the data repository, and making the data repository available to those tools for further processing.

Data self-service

Pentaho Data Catalog provides a rich interface for data self-service to help you find the best instances of the integrated data you are looking for. These services include:

  • Role-based access control

    User Access Administrators can assign a role the permission to view the metadata for files and fields without assigning the permission to view the data. The ability to view just the metadata is useful because many times data analysts do not know they need access to a file until they have access to it.

  • Business Terms

    You can discover metadata about files and fields and have Data Catalog associate fields to customer's business terms. You can associate business terms with data elements, business rules, related terms, and custom properties to form a comprehensive view of the organization’s business concepts and data landscape.

  • Data quality metrics and statistics

    Data Catalog's profiling processes produce and present detailed data quality metrics and statistics that help you decide if the data is useful, valid, and complete, without having to write code to graph each field in the file.

  • User roles

    User roles in Data Catalog are used for access control and permissions management. They help control who can view, edit, or delete data assets and ensure data security. To know about the available roles in the Data Catalog, go to Management and click Roles uder Users tile. It displays the list of user roles available.

  • Table creation

    When you find a file you are interested in, you can create a table for that file. This ability is useful for non-technical users to find the files that contain the data that they need, create a table and use a tool like Grafana to visualize the data without any added efforts.

Data inventory

Behind Data Catalog’s self-service user interface is an engine that profiles the data repository and enriches it by propagating terms created by users. Data Catalog identifies the formats of the resources and profiles their contents, creating an inventory of data assets in the data warehouse securely.

  • Profiling

    Most of the data-curating process entails writing code to profile and graph data. Data Catalog automates this process, improving the productivity of data engineers and data scientists.

    Data profiling is the process in which Data Catalog examines file data and gathers statistics about the data. It profiles data in the cluster, and uses its algorithms to compute detailed properties, including field-level data quality metrics, and data statistics. The resulting inventory includes rich metadata for delimited files, like JSON, and Parquet, and files compressed with supported compression algorithms such as gzip.

  • Sensitive data discovery

    Sensitive data residing in the data cluster presents a sizable liability if it is not protected and managed. Data Catalog’s algorithms identify sensitive data throughout the data clusters as a part of profiling with minimal additional overhead. Identification is the first step, and often the hardest step, in the process of protecting sensitive data. You cannot protect sensitive data unless you know where it resides. Data Catalog identifies sensitive data and facilitates the next step of protecting it through masking, encryption, or quarantine.

Data quality

You can discover data quality metrics automatically using large scale profiling, such as discovering the number of nulls in a data column or cardinality. For example, you can assess the number of values that should be in a field versus the actual numbers that have been profiled.

NoteData Catalog writes profiling process notifications to the log files.

Data governance

Pentaho Data Catalog provides data governance by securing access to the data, managing metadata creation, enrichment, and approval, and linking physical data to business-related terminology.

Securing access to data

In Pentaho Data Catalog, you can protect resources with secured access using glossaries. A glossary is a logical grouping of business terms that you can assign to a specific project user group. Once you have set up roles for glossaries, you can use them to limit access to data via specific roles and users.

Managing metadata creation, enrichment, and approval

In the Data Catalog user interface, you can create new metadata by adding business terms, term hierarchies, synonyms, and custom properties. Most user roles are permitted to add new terms as the users are browsing content. Only administrators can define new custom properties. Properties can be read or set by users who have the permissions to read and set.

You can change glossary items like domains, category, and terms using the Data Catalog user interface or undo the changes if necessary. You can also push the changes back to the tools from which they were imported into Data Catalog.

Linking physical data to business-defined terms

In Data Catalog, with applicable permissions, you can manually tag data directly in the user interface. To learn more about this process, see Manage associations.

Reporting and data visualization

Reporting in Data Catalog usually is done through dashboards using third-party BI tools. Dashboards further extend the visual discovery and relationship discovery capabilities of the Data Catalog in several ways. They also provide a means to add customized insight assets unique to the organization.