Catalog Input

Last updated
Save as PDF

You can use the Catalog Input step to read CSV text file formats of a Data Catalog resource that is stored in a Hadoop Distributed File System ( HDFS) or S3 ecosystem, and then output the data as table rows that can be used by a transformation.

You must have role permissions set in Data Catalog to read the data resources.

For more information about accessing Data Catalog in PDI, see PDI and Data Catalog.

Before you begin

Before you can use the Catalog Input step, you must complete the following tasks:

Establish a Catalog connection to Data Catalog. For details, see Access to Data Catalog.
If you want to use S3 storage that is provided by Data Catalog, you must configure S3 as the Default S3 Connection in VFS Connections to access S3 storage. For details, see Connecting to Virtual File Systems.
Establish a PDI connection to one or more clusters that you plan to use. For example, an HDFS driver must be configured as a named connection for your distribution for accessing HDFS. For information on named connections, see Connecting to a Hadoop cluster with the PDI client.

General

The following options are general to the Catalog Input transformation step.

Option	Description
Step Name	Specify a unique name for the Catalog Input step. You can customize the name or use the default name.
Connection	Select the name of your connection to Data Catalog from the Connection list. See Connecting to Virtual File Systems for details.

Option

Description

Step Name

Specify a unique name for the Catalog Input step. You can customize the name or use the default name.

Connection

Select the name of your connection to Data Catalog from the Connection list.

See Connecting to Virtual File Systems for details.

Options

The Catalog Input step includes settings for the metadata search method used and related property definitions.

Input tab

You can use the Input tab to specify the search method that is used to find metadata about data that is read from Data Catalog. Searches are performed on the schema of a supported format type according to the specified search method.

Input tab

You can search for Specific Resources if you know the ID of the data resource.

Specific Resources method

Use the following options to search for specific resources in Data Catalog. The data resource must be profiled in Data Catalog to be added.

Option	Description
ID	Enter the Data Catalog resource identifier. The resource must first be profiled in Data Catalog before it can be found by a search.
Add	Click Add to retrieve the profiled data that is associated with the Resource ID in Data Catalog.
Delete	Click Delete to remove one or more selected resources from the Selected Files table.
Edit	Click Edit to make a selected resource in the Selected Files table editable.

Results of the search are displayed in the Selected Files table, which provides details about the data resources that were found.

Column	Description
Resource Name	Displays the name of the resource in Data Catalog.
ID	Displays the Data Catalog resource identifier.
Resource Type	Displays the data type that is associated with the resource in Data Catalog.
Origin	Displays the origin of the resource in Data Catalog.

Select Search Criteria if you know only general information about the resource. Results are filtered by the criteria selected.

Search Criteria method

Use the following options to specify search criteria to use for finding data resources.

NoteIf missing or incomplete data is returned, you can use Advanced Search or you might need to change the default limit for returned results. See Data Catalog searches returning incomplete or missing data for information.

Option	Description
Limit Number of Entities	Limit the number of entities that are shown per page in the search results.
Keyword	Enter a keyword to use for the search.
Business Terms	Select one or more business terms to search for from the list.
Virtual Folders	Select a virtual folder to search for from the list.
Data Sources	Select a data source to search for from the list.
Resource Type	Select a resource type to search for from the list.
Files Size	Select a file size range to search for from the list.
File Format	Select a specific file format to search for from the list. Only CSV text file formats are currently supported.

Select Advanced Search to search using a JSON string. For more information, see Lumada Data Catalog REST API. Results are filtered by the query parameters in the JSON string.

Advanced Search method

Option	Description
Advanced Search	Enter a JSON string with API-specific query parameters to run against Data Catalog.

Fields tab

In the Fields tab, you can define properties for the fields of the data format being read.

See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.

CSV fields

When CSV data is read, the table defines the fields to read as input from the CSV file.

CSV fields

Enter the information for the Catalog Input step fields as shown in the following table.

Column	Description
Name	The name of the field.
Type	The data type of the field.
Format	The number format. See Number formats for a description of format symbols.
Position	The position needed to process the 'Fixed' Filetype. The position field is zero based, so the first character starts with position ‘`0`’.
Length	The length of the field, according to the following field types: Number: Total number of significant figures in a number. String: Total length of the string. Date: Length of printed output of the string. For example, ‘`4`’ returns only the year.
Precision	The value of this field depends on the format: Number: Number of floating point digits. String, Date, Boolean: Unused.
Currency	The currency symbol used by Data Catalog to represent the currency (for example, `$` or `€`).
Decimal	A decimal point that is represented as either a dot '`.`' or a comma '`,`' (for example, `5,000.00` or `5.000,00`).
Null if	Used to set as null (empty) if the string is equal to the specified value.
Default	The default value that is used if the field in the CSV file is empty.
Catalog Type	The data type as defined in Data Catalog. For example, `UTF8`.
Get Fields	You can click Get Fields to retrieve a list of fields derived from the source file in Data Catalog.
Minimal width	You can click Minimal width to minimize the field length by removing unnecessary characters. If selected, string fields will no longer be padded to their specified length.

Use the following options to retrieve and format field information.

Option	Description
Get Fields	Click Get Fields to retrieve a list of fields derived from the source file in Data Catalog.
Minimal width	Click Minimal width to minimize the field length by removing unnecessary characters. If Minimal width is used, string fields are no longer padded to their specified length.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Before you begin

General

Options

Input tab

Fields tab

CSV fields