Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

PDI and Data Catalog

Data Catalog users can work with Data Catalog data profiling, automated discovery, classification, and management of enterprise data from within PDI transformations and jobs.

NotePDI connects to Data Catalog v7.5.2.

Data Catalog collects metadata for various types of data assets and points to the asset's location in storage. Data assets that are registered in Data Catalog are known as data resources.

Data engineers, data scientists, and business users can use Data Catalog to accelerate metadata discovery and data categorization. Data stewards can use Data Catalog to manage sensitive data.

For example, you can create a PDI transformation that reads the storage location of data for a data resource in Data Catalog, retrieves the data from storage, transforms the data, writes the transformed data back to storage, and then either registers that transformed data as a new data resource in Data Catalog or overwrites an existing data resource in Data Catalog. When data is registered as a data resource in Data Catalog, you can add the storage location of the data and descriptive metadata tags.

You can use the following PDI steps to build PDI transformations that work with Data Catalog metadata and data resources:

  • Read Metadata

    You can use the Read Metadata step to search Data Catalog metadata, find specific data resources, and find the location of those resources. Metadata that is associated with an identified data resource can be passed to another step within a PDI transformation.

  • Write Metadata

    You can use the Write Metadata step to add or update Data Catalog business terms that are associated with a data resource.

  • Catalog Input

    You can use the Catalog Input step to read the CSV text file format of a Data Catalog data resource that is stored in a Hadoop Distributed File System ( HDFS) or S3 ecosystem, and then output the data as table rows to use in a transformation.

  • Catalog Output

    You can use the Catalog Output step to write output step data, in CSV format, into HDFS or S3 storage that is provided in Data Catalog. You can write the CSV data as a new data resource or overwrite an existing data resource. After the data is written, the Catalog Output step triggers a Data Catalog profile job to generate the metadata and statistical profile of the data. You can use the Catalog Output step to create or modify Data Catalog business terms for the data resource.

NoteThe Read Metadata, Write Metadata, Catalog Input, and Catalog Output steps support only the Pentaho engine. The steps do not support Pentaho Adaptive Execution Layer (AEL) or metadata injection (MDI).

Prerequisites

To use the Read Metadata or Write Metadata steps, you must set up a VFS connection to a stand-alone instance of Data Catalog and provide your role access credentials. For more information, see Access to Data Catalog.

To use the Catalog Input and Catalog Output steps, you must complete the following tasks:

  • Set up a VFS connection to a stand-alone instance of Data Catalog and provide your role access credentials. For more information see Access to Data Catalog.
  • If you want to use S3 storage that is provided by Data Catalog, you must configure S3 as the Default S3 Connection in VFS Connections to access S3 storage. For details, see Connecting to Virtual File Systems.
  • Establish a PDI connection to one or more clusters that you plan to use. For example, an HDFS driver must be configured as a named connection for the distribution for accessing HDFS storage. For information on named connections, see Connecting to a Hadoop cluster with the PDI client.

Supported Filetypes

You can use the Read Metadata and Write Metadata steps to modify the Data Catalog metadata for any resource in the catalog, regardless of the file format.

The Catalog Input and Catalog Output steps support retrieving and performing ETL data transformations for CVS files that are stored on HDFS or S3.