Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Data Catalog user features

Parent article

This article describes the Lumada Data Catalog user interface and the tasks non-admin users can perform.

Before proceeding, make sure that a Data Catalog service user or Administrator has set up your catalog for you. For information about configuring and running Data Catalog jobs, see the Data Catalog Administration Guide.

Data Catalog builds a complete inventory of data assets in a data warehouse, automatically and securely.

It provides:

  • Exact data discovery and faster delivery to an authenticated user
  • Better understanding of data quality at a glance
  • An inventory of all assets for efficient data repository governance.

Data Catalog complements data visualization, data discovery, and data wrangling tools by streamlining the collection and initial data quality checks of the data repository, and by making the data repository available to those tools for further processing.

Data self-service

Lumada Data Catalog provides a rich interface for data self-service by leveraging multi-faceted metadata search using terms, detailed properties, bookmarks, and Hive tables to help you find the best instances of the integrated data you are looking for. These services include:

  • Role-based access control

    Administrators can assign a role the permission to view the metadata for files and fields without assigning the permission to view the data. The ability to view just the metadata is useful because many times data analysts do not know they need access to a file until they have access to it.

  • Multi-faceted, customizable metadata search

    During profiling and discovery, Data Catalog infers facets from both data and metadata properties. You can use facets to refine search results and drill down to the desired file, table, or field without having to write code or ask subject matter experts. Additionally, you can use pre-filters for a specified set of values to customize facets for a specific role, so you can search within a narrower scope for the values or facets relevant to their function.

  • Business Terms

    You can discover metadata about files and fields and have Data Catalog automatically associate fields to customer's business terms. Data Catalog learns from the way these associations are accepted or rejected, then propagates terms automatically to similar fields.

  • Data quality metrics and statistics

    Data Catalog's profiling processes produce and present detailed data quality metrics and statistics that help you decide if the data is useful, valid, and complete, without having to write code to graph each field in the file.

  • Bookmarks and audit history

    As you search for applicable files and tables to use, Data Catalog helps you keep track of these resources with bookmarks. You can use bookmarks to create a shortcut for browsing directly to the bookmarked resource. Additionally, bookmarks also register your interest in that resource, so you are notified when changes to the resource, such as schema changes or term changes, are detected.

  • User roles

    User roles help you manage terms in multiple glossaries and provide control of the users who are making the changes in each glossary. You can open up one domain to ad hoc tagging of terms while keeping another glossary under strict curation.

  • Job management

    You can delegate job management based on user roles, which allows the owner of a data node or resource, such as a data steward or data analyst, to make administrative decisions for and take ownership of that node or resource.

  • Hive table creation

    When you find a file you are interested in, you can create a Hive table for that file. This ability is useful for non-technical users to find the files that contain the data that they need, create a Hive table and use a popular business intelligence (BI) tool to visualize the data without any added efforts, making Hadoop accessible and useful to a much wider audience.

Automated data inventory

Behind Data Catalog’s self-service user interface is an engine that profiles the data repository and enriches it by propagating terms created by users. Data Catalog identifies the formats of the resources and profiles their contents, creating an inventory of data assets in the data warehouse automatically and securely.

  • Format discovery

    One of the benefits of a data warehouse is that it is a central repository of integrated data from different data source types with different types of files and tables, without a requirement for the data to fit a predefined resource schema. However, this flexibility can also make it difficult to understand the structure of the resource and how key fields may be related within and across resources.

    Data Catalog crawls a data cluster and identifies the formats of a wide variety of resources, including images, PDFs, logs, data files, and tables. All the recognized resources are added to the inventory for tagging, searching, and lineage.

  • Profiling

    Most of the data-curating process entails writing code to profile and graph data. Data Catalog automates this process, improving the productivity of data engineers and data scientists.

    Data profiling is the process in which Data Catalog examines file data and gathers statistics about the data. It profiles data in the cluster, and uses its algorithms to compute detailed properties, including field-level data quality metrics, and data statistics. The resulting inventory includes rich metadata for delimited files, JSON, ORC, AVRO, and Parquet and from files compressed with supported compression algorithms such as gzip and Snappy.

  • Lineage discovery

    File lineage is the source of a file or the systems it came from. Knowing the lineage across files is important for audit purposes, to determine if a file is trustworthy, and to do impact analysis to determine how schema changes would affect downstream files. Data Catalog performs file lineage discovery automatically, so you can view and drill into the lineage information as you work with the data.

  • Sensitive data discovery

    Sensitive data residing in the data cluster presents a sizable liability if it is not protected and managed. Data Catalog’s algorithms identify sensitive data throughout the data clusters as a part of profiling with minimal additional overhead. Identification is the first step, and often the hardest step, in the process of protecting sensitive data. You cannot protect sensitive data unless you know where it resides. Data Catalog identifies sensitive data and facilitates the next step of protecting it through masking, encryption, or quarantine.

  • Business term propagation

    Business term propagation is central to the Data Catalog's ability to crowd-source an ontology. Business term propagation is the process in which Data Catalog learns from the way users tag the data, recognizing similar content across resources and automatically assigning terms to similar fields.

    Term propagation allows all users to benefit from the insight and knowledge that other users encapsulate in the terms they created. The end result is a rich understanding of the data that was achieved by the community with little effort.

Data quality

You can discover data quality metrics automatically using large scale profiling, such as discovering the number of nulls in a data column or cardinality. For example, you can assess the number of values that should be in a field versus the actual numbers that have been profiled.

Data Catalog writes profiling process notifications to the log files.

Data governance

Lumada Data Catalog provides data governance by securing access to the data, managing metadata creation, enrichment, and approval, and linking physical data to business-related terminology.

Securing access to data

In Lumada Data Catalog, you can protect resources with secured access using glossaries and virtual folders.

  • Glossaries

    A glossary is a logical grouping of business terms that you can assign to a specific project user group.

  • Virtual folders

    A virtual folder is a logical view of the files or tables of a given data source that you can use with filters for inclusion or exclusion.

Once you have set up roles for glossaries and virtual folders, you can use them to limit access to data via specific roles and users.

Managing metadata creation, enrichment, and approval

In the Data Catalog user interface, you can create new metadata by adding business terms, term hierarchies, synonyms, and custom properties. Most user roles are permitted to add new terms as the users are browsing content. Only administrators can define new custom properties. Properties can be read or set by users who have the permissions to read and set.

You can change terms or glossaries using the Data Catalog user interface or undo the changes if necessary. You can also push the changes back to the tools from which they were imported into Data Catalog.

The process to profile and review terms that are discovered automatically has been designed so data stewards can review terms and work with Data Catalog without the need of a formal approval workflow.

Linking physical data to business-defined terms

A key feature of Lumada Data Catalog is automatic tagging of physical data elements with business terminology.

Data Catalog automatically matches physical content with business terms using a combination of observable metadata, pattern recognition on content, and comparison of data fingerprints with other known sources.

Data Catalog clearly marks matches as suggested associations. Whether or not they are curated, you may use terms for search or similar operations. Additional curation by data stewards (with permissions) contributes to a machine learning algorithm that fine-tunes future matching operations.

With applicable permissions, you can manually tag data directly in the user interface. Data Catalog uses visual markers to distinguish terms associated manually.

To learn more about this process, please see Getting started with business terms and term propagation.

Reporting and data visualization

Reporting in Data Catalog usually is done using third-party BI tools.

Data Catalog complements data visualization, data discovery, and data wrangling tools by streamlining the collection and initial data quality checks of the data repository. Then it makes the data repository available to those tools for further processing.

For applying business terms and reporting with Data Catalog, you can tag reports with business terms in the same way you tag any raw data with terms, whether the data is for a report or an ERP.