Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Product overview

The Data Catalog software builds a metadata catalog from data assets residing in Apache HDFS and Hive, S3, MySQL, Oracle®, Amazon Redshift®, Teradata®, etc. It profiles the data assets to produce field-level data quality statistics and to identify representative data so users can efficiently analyze the content and quality of the data.

The Data Catalog's interactive UI provides a customized user experience for the business role of that user, promoting rich content authoring and resource knowledge collaboration through posts and notifications.

  • Data Glossary

    Lumada Data Catalog gives users access to the file and field-level metadata available for the entire catalog of data assets. In addition, for the assets that each user has authorization to view, the Data Catalog displays a rich view of data-based details such as minimum, maximum, and most frequent values. Data Catalog users can add their own information to the catalog in the form of descriptions, ratings, and custom metadata designed for your organization.

  • Tags

    Lumada Data Catalog provides an interface for users to label data with information about its business value. It distributes these labels or tags to similar data across the cluster, producing a powerful index for business language searches. It enables business users to find the right data quickly and to understand the meaning and quality of the data at a glance.

  • Lineage

    Lumada Data Catalog displays lineage relationships among resources using metadata imported from Apache Atlas or from Cloudera Navigator. In addition, the Data Catalog can infer lineage relationships from the metadata collected in discovery profiling.

  • User roles

    Roles assigned to Lumada Data Catalog users let administrators exercise role based functional and access control for users, like can create annotation tags, which users can apply the tags to data, and which users can approve or reject the metadata suggested by the Data Catalog's discovery operations. In addition, you can use roles to establish access control over resources in the catalog at the data source level. Roles also incorporate a set of predefined access levels, which define what aspects of the catalog are available to which users.

  • Access control

    User profiles, roles, and access levels combine to restrict or grant metadata access to users.

This overview describes the following subjects:

  • Architecture: The Data Catalog's place in Apache Hadoop.®
  • Spark: Compute engine for the Data Catalog's collection of metadata.
  • Data profiling: Data format types profiled by the Data Catalog.
  • Interacting with Hive: Configuration and permissions.
  • Relational data source: MySQL, Oracle, Redshift and Teradata come pre-configured.
Architecture

Data Catalog consists of three main components:

  • Application Server

    This is where the user interacts with the Data Catalog to browse the data, initiate jobs and perform business analytics on the processed data.

  • Processing Engine/Agents

    The brains of the Data Catalog. This engine communicates with the data sources in the Data lake to process and fingerprint the data.

  • Repository

    The metadata storage for the Data Catalog.

Traditionally the application server and the processing engine components have resided in the same cluster.

Lumada Data Catalog enables a distributed architecture, where the processing engine is replaced by processing agents that can run on multiple remote clusters while the application server and the metadata repository reside in a centralized location.

Thus a single unified catalog is fed from multiple distributed data sources, including multiple Hadoop clusters and relational databases.

The metadata collected from processing functions is stored partly in a Solr repository and partly in Postgres. This repository will typically reside with the application server in a centralized location. A metadata server component facilitates communications between the Solr repository and agents.

The following schematic gives an overview of Lumada Data Catalog's architectural components showcasing the distributed application.

Distributed architecture

Spark

Data Catalog runs profiling jobs against HDFS and Hive data using Spark. The application code is transferred to each cluster node to be executed against the data that resides on that node. The results are accumulated in Solr or, in the case of deep profiling, in HDFS files. Lumada Data Catalog runs Spark jobs against the resulting metadata to determine tag association suggestions.

It is important to understand that Data Catalog jobs are standard Spark jobs: the expertise your organization already has for tuning cluster operations applies to running and tuning Lumada Data Catalog jobs.

Because these jobs read all the data in assets included in the catalog, the user running the jobs needs to have read access to the data; make sure to configure the Lumada Data Catalog service user with security in mind to ensure this broad access is appropriately controlled.

Data profiling

Data Catalog profiles data in an HDFS cluster, including HDFS files and Hive tables. In addition, Lumada Data Catalog profiles data in relational databases accessible through JDBC. It collects file and field-level data quality metrics, data statistics, and sample values. The catalog includes rich metadata for delimited files, JSON, ORC, Avro, XML, and Parquet and from files compressed with Hadoop supported compression algorithms such as gzip, LZO, and Snappy.

The first part of the process is identifying the data sources you want to include in your catalog, the next step is to have the Lumada Data Catalog profiling engine read the data sources and populate the catalog with metadata about the resources—databases, tables, folders, and files—in each data source.

  • XML

    At this time, Lumada Data Catalog processes XML files that are constructed with a single root element and any number of repeating row elements. Administrators can specify the root and row elements if Data Catalog does not identify the correct elements.

  • Delimited text files

    Lumada Data Catalog format discovery determines how data is organized in a file. For a text file to be profiled as data, each row is assumed to be a line of data. If there are lines in the file that are not data (no delimiter is found), Lumada Data Catalog will not profile the file. For example, if there are titles, codes, or descriptions at the top or bottom of a text file, Lumada Data Catalog categorizes the file as plain text and does not profile it.

Interacting with Hive

Lumada Data Catalog can include Hive tables in the catalog. This level of interaction with Hive requires some manual configuration and privileges.

  • Hive authorization

    Profiling Hive tables requires that the Data Catalog service user have read access to the Hive database and table. Browsing Hive tables requires that the active user be authorized to access the database and table and the Data Catalog service user have read access to the backing file. Creating Hive tables requires that the active user can create tables in at least one Hive database and have write access to the folder where the source files reside.

  • Hive authorization on Kerberized cluster

    During profiling, Lumada Data Catalog interacts with Hive through the metastore. In a Kerberized environment, typically the only user allowed to access Hive data through the metastore is the Hive superuser; to perform this operation, Lumada Data Catalog needs the following configurations:

    • The Hive access URL must include the Hive superuser principal name.
    • The Data Catalog service user must be configured as a proxy user in Hadoop so it can perform profiling operations as the Hive superuser.
    • These configuration requirements are described in detail in the installation steps in this document.
  • Profiling tables and their backing files

    When Lumada Data Catalog profiles Hive tables, it uses the Hive metadata to determine the backing directory for each table and includes that directory and constituent files in the catalog, whether the HDFS files have been profiled or not. It also includes a lineage relationship between the Hive table and these backing files. By default, it does not profile the backing HDFS files. If you choose to independently profile the backing files, it is possible that the Data Catalog will show different views for the same data based on the input formats and parsing used for the HDFS file itself and for the Hive table. For example, the Hive table may have different column names, a subset of columns or rows, and may use a different delimiter to determine the fields within each row of data.

Relational data sources

Lumada Data Catalog can include data from relational data sources in the catalog. Administrators can connect through JDBC to MySQL, Oracle, MSSQL Server, Redshift, Snowflake, SAP-HANA, Aurora or Teradata databases. Once the data source/logical table is created, it can be processed/viewed as any other databases/Hive tables.

At this time, profiling against these data sources involves a full pass through the tables; there is no option to only profile new or updated resources.

The Data Catalog does not ingest data from SQL functions or stored procedures directly. However, the Data Catalog currently has adapters to import SQL from Hive via Apache Atlas and Cloudera Navigator. Additionally custom adaptors to integrate with the Data Catalog's RESTful API can be done. This allows metadata to be pushed into Lumada Data Catalog and then profiled.