Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Read Metadata

Parent article

You can use the Read Metadata step to search and retrieve any existing metadata in the Lumada Data Catalog that is associated with specific Data Catalog registered data resources.

Specifically, you could create a transformation that searches Data Catalog for existing metadata that points to data stored in CVS files and Parquet files stored in HDFS or Amazon S3. You can then pass all the associated metadata, including the location of the data, to other steps within your transformation for processing.

For example, you could use the Read Metadata to retrieve the metadata for a data file's cluster location and then pass the metadata to a Text File input step or a Catalog Input step that retrieves the file’s contents for an ETL operation on the data. The transformation can then write the new data contents back to the file or to a new file.

The Read Metadata step includes search options to identify, locate, and retrieve the metadata associated with the available data resources listed in Data Catalog .

For more information about accessing Lumada Data Catalog in PDI, see PDI and Lumada Data Catalog.

NoteThis step is supported on the PDI engine but not on the Spark engine. Only CSV text file and Parquet data formats are currently supported. You must have role permissions set in Data Catalog to read the data resources.

Before you begin

Before using the Read Metadata step, you must have an established VFS connection to Data Catalog. For more information, see Access to Lumada Data Catalog. In addition, you must have role permissions set in Data Catalog to read the metadata associated with the data resources registered in Data Catalog.

General

Read metadata step

The following fields are general to this transformation step.

FieldDescription
Step nameSpecify the unique name of the Read Metadata step on the canvas. You can customize the name or leave it as the default.
Connection

Use the list to select the name of your connection to Data Catalog.

See Connecting to Virtual File Systems for details.

Options

Use the Read Metadata step to search for data resources and associated metadata from the search criteria you specify. You can search for Data Catalog data resources in multiple ways:

  • Specific Resources

    Searches Data Catalog using the unique resource ID that is associated with a data resource.

  • Search Criteria

    Narrows Data Catalog searches for specific data resources using criteria that you select from lists of the existing metadata available in your instance of Data Catalog.

  • Advanced Search

    Creates a JSON script that finds the Data Catalog tags for specific data resources. For more information, see the Lumada Data Catalog REST API.

NoteIn some cases, if missing or incomplete search data is returned, you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.

Specific Resources

Select Specific Resources if you know the unique identification key that Data Catalog assigned to a data resource.

Search for specific resources

Type the identification key into the ID field and click Add. The data resource must be profiled in Data Catalog to be available.

The metadata for each ID you add populates the Selected Files table. The metadata may include some or all the following metadata types:

FieldDescription
Resource NameDisplays the name of the resource in Data Catalog.
IDDisplays the Data Catalog resource identification key.
Resource TypeDisplays the type of the data type associated with the resource in Data Catalog.
OriginDisplays the origin of the resource in Data Catalog, related to its linage or its point of origin in the cluster.
Delete (button)Delete a line item from the Selected Files table by clicking the line item, then click Delete.
Edit (button)Select a line item to edit in the Selected Files table by clicking the line item, then click Edit. The ID is inserted into the ID field, where you can modify it.

If you have restricted access to a data resource or a specific data type, PDI notifies you.

Search Criteria

Select Search Criteria if you know general information about the data resource. The search filters all the results by the criteria selected. Type a keyword in the Keyword field or select criteria from the drop-down menus, then click ADD to include it in the search.

NoteIf missing or incomplete data is returned, use Advanced Search or change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
Specify search criteria to find a data resource
FieldDescription
KeywordEnter a Data Catalog keyword to use for the search.
TagsSelect a Data Catalog tag or tags that might be assigned to the data resource.
Virtual FoldersSelect a Data Catalog virtual folder in which the data resource is or might be included.
Data SourcesSelect a Data Catalog data source that might be associated with the data resource.
Resource TypeSelect a resource type to search using the drop-down menu.
Files SizeSelect a file size range to search using the drop-down menu.
File FormatSelect a specific file format to search for using the drop-down menu. The menu includes CSV and Parquet file formats to read the file format metadata within a transformation that includes a Catalog Input step or a Catalog Output step.

If you have restricted access to a data resource or a specific data type, PDI notifies you.

Advanced Search

To perform advanced searches of the Lumada Data Catalog, provide a JSON string to communicate with the Data Catalog API. The transformation runs the string using the Data Catalog APIs and provides you with the results. For more information, see the Lumada Data Catalog REST API.

Advanced search option for entering a JSON stringIf you have restricted access to a data resource or a specific data type, PDI notifies you.