Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Read Metadata

You can use the Read Metadata step to search and retrieve metadata in the Data Catalog that is associated with specific data resources that are registered in Data Catalog.

You can create a transformation that searches Data Catalog for metadata that points to data in CSV files that are stored in HDFS or S3. You can then pass all the associated metadata, including the location of the data, to other steps in your transformation for processing.

For example, you can use the Read Metadata step to retrieve the metadata for the location of a data file and then pass that location metadata to a Text File input step or a Catalog Input step. The Text File input step or Catalog Input step uses the location metadata to retrieve data from the file for an ETL operation on the data.

The Read Metadata step includes search options to identify, locate, and retrieve the metadata associated with the available data resources listed in Data Catalog .

For more information about accessing Data Catalog in PDI, see PDI and Data Catalog.

NoteThis step is supported on the PDI engine but not on the Spark engine. Only CSV text file formats are currently supported. You must have role permissions set in Data Catalog to read the data resources.

Before you begin

Before you can use the Read Metadata step, you must establish a VFS connection to Data Catalog. For more information, see Access to Data Catalog. You must also have role permissions set in Data Catalog to read the metadata that is associated with the data resources that are registered in Data Catalog.

General

The following options are general to the Read Metadata transformation step.

OptionDescription
Step nameSpecify a unique name for the Read Metadata step. You can customize the name or use the default name.
Connection

Select the name of your connection to Data Catalog.

See Connecting to Virtual File Systems for details.

Options

Use the Read Metadata step to search for data resources and associated metadata. You can search for Data Catalog data resources by using the following options:

  • Specific Resources

    Search Data Catalog by using the unique resource ID that is associated with a data resource.

  • Search Criteria

    Search Data Catalog for specific data resources by using criteria that you select from lists of the metadata available in your instance of Data Catalog.

  • Advanced Search

    Creates a JSON script that queries the Data Catalog business terms for specific data resources. For more information, see the Data Catalog REST API.

NoteIf missing or incomplete search data is returned, you might need to change the default limit for returned results. See Data Catalog searches returning incomplete or missing data for information.

Specific Resources

Select Specific Resources if you know the unique identification key that Data Catalog assigned to a data resource.

Select Column Only if you want to search column-level metadata inside the specified data resources. Clear the Column Only check box if you want to search file metadata.

Search for specific resources

Use the following options to search for specific resources in Data Catalog. The data resource must be profiled in Data Catalog to be added.

OptionDescription
ID Enter the Data Catalog resource identifier. The resource must first be profiled in Data Catalog before it can be found by a search.
AddClick Add to retrieve the profiled data that is associated with the Resource ID in Data Catalog.
DeleteClick Delete to remove one or more selected resources from the Selected Files table.
EditClick Edit to make a selected resource in the Selected Files table editable.

Results of the search are displayed in the Selected Files table, which provides details about the data resources that were found.

ColumnDescription
Resource NameDisplays the name of the resource in Data Catalog.
IDDisplays the Data Catalog resource identification key.
Resource TypeDisplays the data type associated with the resource in Data Catalog.
OriginDisplays the origin of the resource in Data Catalog, related to its linage or point of origin in the cluster.

If you have restricted access to a data resource or a specific data type, PDI notifies you.

Search Criteria

Select Search Criteria if you know general information about the data resource. Results are filtered by the criteria selected.

Select Column Only if you want to search column-level metadata inside data resources. Clear the Column Only check box if you want to search file metadata.

Use the following options to specify search criteria to use for finding data resources.

OptionDescription
Limit Number of EntitiesLimit the number of entities that are shown per page in the search results.
KeywordEnter a Data Catalog keyword to use for the search.
Business TermsSelect one or more Data Catalog business terms that might be assigned to the data resource.
Virtual FoldersSelect a Data Catalog virtual folder where the data resource is located or might be located.
Data SourcesSelect a Data Catalog data source that might be associated with the data resource.
Resource TypeSelect a resource type to search.
Files SizeSelect a file size range to search.
File FormatSelect a specific file format to search for.
NoteOnly CSV file formats are supported for Catalog Input and Catalog Output steps.

If you have restricted access to a data resource or a specific data type, PDI notifies you.

NoteIf missing or incomplete data is returned, use Advanced Search or change the default limit for returned results. See Data Catalog searches returning incomplete or missing data for information.

Advanced Search

To perform advanced searches of the Data Catalog, enter a JSON string to communicate with the Data Catalog API. The transformation uses the JSON string and Data Catalog APIs to run the search and returns results.

Select Column Only if you want to search column-level metadata inside data resources. Clear the Column Only check box if you want to search file metadata.

For more information, see the Data Catalog REST API.

If you have restricted access to a data resource or a specific data type, PDI notifies you.