You can use the Catalog Input step to read CSV text file formats of a Data Catalog resource that is stored in a Hadoop Distributed File System ( HDFS) or S3 ecosystem, and then output the data as table rows that can be used by a transformation.
You must have role permissions set in Data Catalog to read the data resources.
For more information about accessing Data Catalog in PDI, see PDI and Data Catalog.
Before you begin
Before you can use the Catalog Input step, you must complete the following tasks:
- Establish a Catalog connection to Data Catalog. For details, see Access to Data Catalog.
- If you want to use S3 storage that is provided by Data Catalog, you must configure S3 as the Default S3 Connection in VFS Connections to access S3 storage. For details, see Connecting to Virtual File Systems.
- Establish a PDI connection to one or more clusters that you plan to use. For example, an HDFS driver must be configured as a named connection for your distribution for accessing HDFS. For information on named connections, see Connecting to a Hadoop cluster with the PDI client.
The following options are general to the Catalog Input transformation step.
|Step Name||Specify a unique name for the Catalog Input step. You can customize the name or use the default name.|
Select the name of your connection to Data Catalog from the Connection list.
See Connecting to Virtual File Systems for details.
The Catalog Input step includes settings for the metadata search method used and related property definitions.
You can use the Input tab to specify the search method that is used to find metadata about data that is read from Data Catalog. Searches are performed on the schema of a supported format type according to the specified search method.
You can search for Specific Resources if you know the ID of the data resource.
Use the following options to search for specific resources in Data Catalog. The data resource must be profiled in Data Catalog to be added.
|ID||Enter the Data Catalog resource identifier. The resource must first be profiled in Data Catalog before it can be found by a search.|
|Add||Click Add to retrieve the profiled data that is associated with the Resource ID in Data Catalog.|
|Delete||Click Delete to remove one or more selected resources from the Selected Files table.|
|Edit||Click Edit to make a selected resource in the Selected Files table editable.|
Results of the search are displayed in the Selected Files table, which provides details about the data resources that were found.
|Resource Name||Displays the name of the resource in Data Catalog.|
|ID||Displays the Data Catalog resource identifier.|
|Resource Type||Displays the data type that is associated with the resource in Data Catalog.|
|Origin||Displays the origin of the resource in Data Catalog.|
Select Search Criteria if you know only general information about the resource. Results are filtered by the criteria selected.
Use the following options to specify search criteria to use for finding data resources.
|Limit Number of Entities||Limit the number of entities that are shown per page in the search results.|
|Keyword||Enter a keyword to use for the search.|
|Business Terms||Select one or more business terms to search for from the list.|
|Virtual Folders||Select a virtual folder to search for from the list.|
|Data Sources||Select a data source to search for from the list.|
|Resource Type||Select a resource type to search for from the list.|
|Files Size||Select a file size range to search for from the list.|
|File Format||Select a specific file format to search for from the list. Only CSV text file formats are currently supported.|
Select Advanced Search to search using a JSON string. For more information, see Lumada Data Catalog REST API. Results are filtered by the query parameters in the JSON string.
|Advanced Search||Enter a JSON string with API-specific query parameters to run against Data Catalog.|
In the Fields tab, you can define properties for the fields of the data format being read.
See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.
When CSV data is read, the table defines the fields to read as input from the CSV file.
Enter the information for the Catalog Input step fields as shown in the following table.
|Name||The name of the field.|
|Type||The data type of the field.|
|Format||The number format. See Number formats for a description of format symbols.|
|Position||The position needed to process the 'Fixed' Filetype. The position field is zero based, so the first character starts with position ‘|
The length of the field, according to the following field types:
The value of this field depends on the format:
|Currency||The currency symbol used by Data Catalog to represent the currency (for example, $ or €).|
|Decimal||A decimal point that is represented as either a dot '.' or a comma ',' (for example, 5,000.00 or 5.000,00).|
|Null if||Used to set as null (empty) if the string is equal to the specified value.|
|Default||The default value that is used if the field in the CSV file is empty.|
|Catalog Type||The data type as defined in Data Catalog. For example, |
|Get Fields||You can click Get Fields to retrieve a list of fields derived from the source file in Data Catalog.|
|Minimal width||You can click Minimal width to minimize the field length by removing unnecessary characters. If selected, string fields will no longer be padded to their specified length.|
Use the following options to retrieve and format field information.
|Get Fields||Click Get Fields to retrieve a list of fields derived from the source file in Data Catalog.|
|Minimal width||Click Minimal width to minimize the field length by removing unnecessary characters. If Minimal width is used, string fields are no longer padded to their specified length.|