Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Data Catalog overview

Parent article

This article describes the Lumada Data Catalog user interface and the tasks non-admin users can perform.

Before proceeding, make sure that a Data Catalog service user or Administrator has set up your catalog for you. For information about configuring and running Data Catalog jobs, see the Data Catalog Administration Guide.

Data Catalog builds a complete inventory of data assets in a data warehouse, automatically and securely.

It provides:

  • Exact data discovery and faster delivery to an authenticated user
  • Better understanding of data quality at a glance
  • An inventory of all assets for efficient data repository governance.

Data Catalog complements data visualization, data discovery, and data wrangling tools by streamlining the collection and initial data quality checks of the data repository, and by making the data repository available to those tools for further processing.

Data self-service

Lumada Data Catalog provides a rich interface for data self-service by leveraging multi-faceted metadata search using tags, detailed properties, bookmarks, and Hive tables to help you find the best instances of the integrated data you are looking for. These services include:

  • Role-based access control

    Administrators can assign a role the permission to view the metadata for files and fields without assigning the permission to view the data. The ability to view just the metadata is useful because many times data analysts do not know they need access to a file until they have access to it.

    When administrators create roles, they designate the resource read access level for every role using the following settings:

    • Metadata Access

      This parameter is used to determine whether to override the system permissions such that the selected role members are allowed read-only access permission to the resource's metadata.

    • Data Access

      This parameter is used to determine the read access to a resource.

  • Multi-faceted, customizable metadata search

    During profiling and discovery, Data Catalog infers facets from both data and metadata properties. You can use facets to refine search results and drill down to the desired file, table, or field without having to write code or ask subject matter experts. You also can create facets from cluster data.

    You can use pre-filters for a certain set of values to customize facets for a specific role, so you can search within a narrower scope for the values or facets relevant to their function.

  • Tags

    Data Catalog allows you to discover the content of files and tag fields automatically. It also provides the ability to import an ontology vocabulary so users can leverage pre-defined terms, as well as by crowd-sourcing the creation of tags. Data Catalog learns from the way users tag the data, and propagates tags to similar fields automatically.

  • Data quality metrics and statistics

    Data Catalog's profiling processes produce and present detailed data quality metrics and statistics that help you decide if the data is useful, valid, and complete, without having to write code to graph each field in the file.

  • Bookmarks and audit history

    As you search for applicable files and tables to use, Data Catalog helps you keep track of these resources with bookmarks. You can use bookmarks to create a shortcut for browsing directly to the bookmarked resource. Additionally, bookmarks also register your interest in that resource, so you are notified when changes to the resource, such as schema changes or tag changes, are detected.

    In addition to notifying users, Data Catalog keeps an audit history of all user actions and resource changes on the cluster.

  • Hive table creation

    When you find a file you are interested in, you can easily create a Hive table for that file. This ability is useful for non-technical users to find the files that contain the data that they need, create a Hive table and use a popular business intelligence (BI) tool to visualize the data without any added efforts, making Hadoop accessible and useful to a much wider audience.

  • User roles

    User roles help you manage tags in multiple domains and provide control of the users who are making the changes in each domain. You can open up one domain to ad hoc tagging while keeping another domain under strict curation.

  • Job management

    You can delegate job management based on user roles, which allows the owner of a data node or resource, such as a data steward or data analyst, to make administrative decisions for and take ownership of that node or resource.

Automated data inventory

Behind Data Catalog’s self-service user interface is an engine that profiles the data repository and enriches it by propagating tags created by users. Data Catalog identifies the formats of the resources and profiles their contents, creating an inventory of data assets in the data warehouse automatically and securely.

  • Format discovery

    One of the benefits of a data warehouse is that it is a central repository of integrated data from different data source types with different types of files and tables, without a requirement for the data to fit a predefined resource schema. However, this flexibility can also make it difficult to understand the structure of the resource and how key fields may be related within and across resources.

    Data Catalog crawls a data cluster and identifies the formats of a wide variety of resources, including images, PDFs, logs, data files, and tables. All the recognized resources are added to the inventory for tagging, searching, and lineage.

  • Profiling

    Most of the data-curating process entails writing code to profile and graph data. Data Catalog automates this process, improving the productivity of data engineers and data scientists.

    Data profiling is the process in which Data Catalog examines file data and gathers statistics about the data. It profiles data in the cluster, and uses its algorithms to compute detailed properties, including field-level data quality metrics, and data statistics. The resulting inventory includes rich metadata for delimited files, JSON, ORC, AVRO, and Parquet and from files compressed with supported compression algorithms such as gzip, LZO, and Snappy.

  • Lineage discovery

    File lineage is the origin of a file or the systems it came from. Knowing the lineage across files is important for audit purposes, to determine if a file is trustworthy, and to do impact analysis to determine how schema changes would affect downstream files. Data Catalog performs file lineage discovery automatically, so you can view and drill into the lineage information as you work with the data.

  • Sensitive data discovery

    Sensitive data residing in the data cluster presents a sizable liability if it is not protected and managed. Data Catalog’s algorithms identify sensitive data throughout the data clusters as a part of profiling with minimal additional overhead. Identification is the first step, and often the hardest step, in the process of protecting sensitive data. You cannot protect sensitive data unless you know where it resides. Data Catalog identifies sensitive data and facilitates the next step of protecting it through masking, encryption, or quarantine.

  • Tag propagation

    Tag propagation is central to the Data Catalog's ability to crowd-source an ontology. Tag propagation is the process in which Data Catalog learns from the way users tag the data, recognizing similar content across resources and automatically assigning tags to similar fields.

    Tag propagation allows all users to benefit from the insight and knowledge that other users encapsulate in the tags they created. The end result is a rich understanding of the data that was achieved by the community with little effort.

Data quality

You can discover data quality metrics automatically using large scale profiling, such as discovering the number of nulls in a data column or cardinality. For example, you can assess the number of values that should be in a field versus the actual numbers that have been profiled.

You can search these statistical demographics in the user interface, with the Lumada Data Catalog RESTful API, or extract them to a reporting tool. You can also use the API to integrate Data Catalog with a data quality tool.

Data Catalog writes profiling process notifications to the log files.

Data governance

Lumada Data Catalog provides data governance by securing access to the data, managing metadata creation, enrichment, and approval, and linking physical data to business-related terminology.

Secured access to data

In Lumada Data Catalog, you can protect resources with secured access using tag domains and virtual folders.

  • Tag domains

    A tag domain is a logical grouping of tags to represent business terms that you can assign to a specific project user group.

  • Virtual folders

    A virtual folder is a logical view of the fields of a given data source that you can use with filters for inclusion or exclusion.

Once you have set up tag domains and virtual folders, you can use them to limit access to data.

Managing metadata creation, enrichment, and approval

In the Data Catalog user interface, you can create new metadata by adding tags, tag hierarchies, synonyms, and custom properties. Most user roles should have the capability of adding new tags right away as they are browsing content. However, only admin users can add properties into Data Catalog.

You can change terms or glossaries using the Data Catalog user interface or undo the changes if necessary. You can also push the changes back to the tools from which they were imported into Data Catalog.

The process to profile and review tags that are discovered automatically has been designed so data stewards can review tags and work with Data Catalog without the need of a formal approval workflow.

Linking physical data to business-defined terms

A key feature of Lumada Data Catalog is automatic tagging of physical data elements with business terminology.

Data Catalog automatically matches physical content with business terms using a combination of observable metadata, pattern recognition on content, and comparison of data fingerprints with other known sources.

Data Catalog clearly marks matches as suggested associations. Whether or not they are curated, you may use tags for search or similar operations. Additional curation by privileged stewards feeds a machine learning algorithm, tuning future matching operations.

With appropriate permissions, you can manually tag data directly in the user interface, with a REST API, or by importing a CSV file. You can also use Data Catalog's Metadata REST API to extract business-defined terms or tags from third party tools such as:

  • Apache Ranger
  • Apache Atlas
  • Cloudera Navigator
  • Informatica
  • Collibra
  • Zaloni
  • Dataguise

Data Catalog uses visual markers to distinguish tags associated manually or by using an API from tags recommended by automatic discovery.

To learn more about this process, please see Tags and Tag Propagation.

Reporting and data visualization

Reporting in Data Catalog usually is done using third-party BI tools. Data Catalog provides a RESTful API for the reporting tools to consume, so nearly any integration can be performed in the most standard format.

Data Catalog complements data visualization, data discovery, and data wrangling tools by streamlining the collection and initial data quality checks of the data repository. Then it makes the data repository available to those tools for further processing.

Report tagging

With Data Catalog, you can tag reports in the same way you tag any raw data, whether the data is for a report, an ERP, or a CRM dataset.

Third-party reporting tools

For reporting, you can easily connect to your BI tools from Data Catalog using the REST API. For example, you can use Data Catalog's REST API with Tableau to effectively query the Data Catalog.