Read Metadata
You can use the Read Metadata step to search and retrieve metadata in the Data Catalog that is associated with specific data resources that are registered in Data Catalog.
You can create a transformation that searches Data Catalog for metadata that points to data in CSV files that are stored in HDFS or S3. You can then pass all the associated metadata, including the location of the data, to other steps in your transformation for processing.
For example, you can use the Read Metadata step to retrieve the metadata for the location of a data file and then pass that location metadata to a Text File input step or a Catalog Input step. The Text File input step or Catalog Input step uses the location metadata to retrieve data from the file for an ETL operation on the data.
The Read Metadata step includes search options to identify, locate, and retrieve the metadata associated with the available data resources listed in Data Catalog .
For more information about accessing Data Catalog in PDI, see PDI and Data Catalog.
Before you begin
Before you can use the Read Metadata step, you must establish a VFS connection to Data Catalog. For more information, see Access to Data Catalog. You must also have role permissions set in Data Catalog to read the metadata that is associated with the data resources that are registered in Data Catalog.
General
The following options are general to the Read Metadata transformation step.
Option | Description |
Step name | Specify a unique name for the Read Metadata step. You can customize the name or use the default name. |
Connection |
Select the name of your connection to Data Catalog. See Connecting to Virtual File Systems for details. |
Options
Use the Read Metadata step to search for data resources and associated metadata. You can search for Data Catalog data resources by using the following options:
Specific Resources
Search Data Catalog by using the unique resource ID that is associated with a data resource.
Search Criteria
Search Data Catalog for specific data resources by using criteria that you select from lists of the metadata available in your instance of Data Catalog.
Advanced Search
Creates a JSON script that queries the Data Catalog business terms for specific data resources. For more information, see the Data Catalog REST API.
Specific Resources
Select Specific Resources if you know the unique identification key that Data Catalog assigned to a data resource.
Select Column Only if you want to search column-level metadata inside the specified data resources. Clear the Column Only check box if you want to search file metadata.
Use the following options to search for specific resources in Data Catalog. The data resource must be profiled in Data Catalog to be added.
Option | Description |
ID | Enter the Data Catalog resource identifier. The resource must first be profiled in Data Catalog before it can be found by a search. |
Add | Click Add to retrieve the profiled data that is associated with the Resource ID in Data Catalog. |
Delete | Click Delete to remove one or more selected resources from the Selected Files table. |
Edit | Click Edit to make a selected resource in the Selected Files table editable. |
Results of the search are displayed in the Selected Files table, which provides details about the data resources that were found.
Column | Description |
Resource Name | Displays the name of the resource in Data Catalog. |
ID | Displays the Data Catalog resource identification key. |
Resource Type | Displays the data type associated with the resource in Data Catalog. |
Origin | Displays the origin of the resource in Data Catalog, related to its linage or point of origin in the cluster. |
If you have restricted access to a data resource or a specific data type, PDI notifies you.
Search Criteria
Select Search Criteria if you know general information about the data resource. Results are filtered by the criteria selected.
Select Column Only if you want to search column-level metadata inside data resources. Clear the Column Only check box if you want to search file metadata.
Use the following options to specify search criteria to use for finding data resources.
Option | Description |
Limit Number of Entities | Limit the number of entities that are shown per page in the search results. |
Keyword | Enter a Data Catalog keyword to use for the search. |
Business Terms | Select one or more Data Catalog business terms that might be assigned to the data resource. |
Virtual Folders | Select a Data Catalog virtual folder where the data resource is located or might be located. |
Data Sources | Select a Data Catalog data source that might be associated with the data resource. |
Resource Type | Select a resource type to search. |
Files Size | Select a file size range to search. |
File Format | Select a specific file format to search for. NoteOnly CSV file formats are supported for Catalog
Input and Catalog Output steps.
|
If you have restricted access to a data resource or a specific data type, PDI notifies you.
Advanced Search
To perform advanced searches of the Data Catalog, enter a JSON string to communicate with the Data Catalog API. The transformation uses the JSON string and Data Catalog APIs to run the search and returns results.
Select Column Only if you want to search column-level metadata inside data resources. Clear the Column Only check box if you want to search file metadata.
For more information, see the Data Catalog REST API.