Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Read Metadata

 

You can use the Read Metadata step to search and retrieve metadata in the Data Catalog that is associated with specific data resources that are registered in Data Catalog.

You can create a transformation that searches Data Catalog for metadata that points to data in CSV files that are stored in HDFS or S3. You can then pass all the associated metadata, including the location of the data, to other steps in your transformation for processing.

For example, you can use the Read Metadata step to retrieve the metadata for the location of a data file and then pass that location metadata to a Text File input step or a Catalog Input step. The Text File input step or Catalog Input step uses the location metadata to retrieve data from the file for an ETL operation on the data.

The Read Metadata step includes search options to identify, locate, and retrieve the metadata associated with the available data resources listed in Data Catalog .

For more information about accessing Data Catalog in PDI, see PDI and Data Catalog.

Before you begin

 

Before you can use the Read Metadata step, you must establish a VFS connection to Data Catalog. For more information, see Access to Data Catalog. You must also have role permissions set in Data Catalog to read the metadata that is associated with the data resources that are registered in Data Catalog.

General

 
Read metadata step

The following options are general to the Read Metadata transformation step.

 

Option Description
Step name Specify a unique name for the Read Metadata step. You can customize the name or use the default name.
Connection

Select the name of your connection to Data Catalog.

See Connecting to Virtual File Systems for details.

Options

 

Use the Read Metadata step to search for data resources and associated metadata. You can search for Data Catalog data resources by using the following options:

  • Specific Resources

    Search Data Catalog by using the unique resource ID that is associated with a data resource.

  • Search Criteria

    Search Data Catalog for specific data resources by using criteria that you select from lists of the metadata available in your instance of Data Catalog.

  • Advanced Search

    Creates a JSON script that queries the Data Catalog business terms for specific data resources. For more information, see the Data Catalog REST API.

NoteIf missing or incomplete search data is returned, you might need to change the default limit for returned results. See Data Catalog searches returning incomplete or missing data for information.

Specific Resources

 

Select Specific Resources if you know the unique identification key that Data Catalog assigned to a data resource.

Select Column Only if you want to search column-level metadata inside the specified data resources. Clear the Column Only check box if you want to search file metadata.

GUID-95ABBAAE-F33D-4087-8D6F-EF07529474B7-low.pngSearch for specific resources

Use the following options to search for specific resources in Data Catalog. The data resource must be profiled in Data Catalog to be added.

Option Description
ID Enter the Data Catalog resource identifier. The resource must first be profiled in Data Catalog before it can be found by a search.
Add Click Add to retrieve the profiled data that is associated with the Resource ID in Data Catalog.
Delete Click Delete to remove one or more selected resources from the Selected Files table.
Edit Click Edit to make a selected resource in the Selected Files table editable.

Results of the search are displayed in the Selected Files table, which provides details about the data resources that were found.

Column Description
Resource Name Displays the name of the resource in Data Catalog.
ID Displays the Data Catalog resource identification key.
Resource Type Displays the data type associated with the resource in Data Catalog.
Origin Displays the origin of the resource in Data Catalog, related to its linage or point of origin in the cluster.

If you have restricted access to a data resource or a specific data type, PDI notifies you.

Search Criteria

 

Select Search Criteria if you know general information about the data resource. Results are filtered by the criteria selected.

Select Column Only if you want to search column-level metadata inside data resources. Clear the Column Only check box if you want to search file metadata.

Specify search criteria to find a data resource

Use the following options to specify search criteria to use for finding data resources.

Option Description
Limit Number of Entities Limit the number of entities that are shown per page in the search results.
Keyword Enter a Data Catalog keyword to use for the search.
Business Terms Select one or more Data Catalog business terms that might be assigned to the data resource.
Virtual Folders Select a Data Catalog virtual folder where the data resource is located or might be located.
Data Sources Select a Data Catalog data source that might be associated with the data resource.
Resource Type Select a resource type to search.
Files Size Select a file size range to search.
File Format Select a specific file format to search for.
NoteOnly CSV file formats are supported for Catalog Input and Catalog Output steps.

If you have restricted access to a data resource or a specific data type, PDI notifies you.

NoteIf missing or incomplete data is returned, use Advanced Search or change the default limit for returned results. See Data Catalog searches returning incomplete or missing data for information.

Advanced Search

 

To perform advanced searches of the Data Catalog, enter a JSON string to communicate with the Data Catalog API. The transformation uses the JSON string and Data Catalog APIs to run the search and returns results.

Select Column Only if you want to search column-level metadata inside data resources. Clear the Column Only check box if you want to search file metadata.

For more information, see the Data Catalog REST API.

Advanced search option for entering a JSON stringIf you have restricted access to a data resource or a specific data type, PDI notifies you.