Catalog Input

Last updated
Save as PDF

You can use the Catalog Input step to read CSV text file types or Parquet data formats of a Lumada Data Catalog resource stored in a Hadoop or S3 ecosystem and output the data payload in the form of rows to be used by a transformation.

CSV files include formats generated by spreadsheets and fixed-width flat files. Parquet data formats are decoded and the fields are extracted using the schema defined in the Parquet source files.

Catalog Input can be used with the Catalog Output transformation step to gather data from various Data Catalog resources and move that data into Hadoop or S3 storage.

You must have role permissions set in Data Catalog to read the data resources.

For more information about accessing Lumada Data Catalog in PDI, see PDI and Lumada Data Catalog.

NoteThis step is supported on the PDI engine but not on the Spark engine. Only CSV text file and Parquet data formats are currently supported.

Before you begin

Before using the Catalog Input step, be aware of the following conditions:

You must have an established Catalog connection to Data Catalog. For details, see Access to Lumada Data Catalog.
S3 must be configured as the Default S3 Connection in VFS Connections to access S3 storage. For details, see Connecting to Virtual File Systems.
You must have an established PDI connection to the cluster(s) you plan on using. For example, a Hadoop driver must be configured as a named connection for your distribution for accessing HDFS. For information on named connections, see Connecting to a Hadoop cluster with the PDI client.

General

The following fields are general to this transformation step.

Field	Description
Step Name	Specify the unique name of the Catalog Input step on the canvas. You can customize the name or leave it as the default.
Connection	Use the list to select the name of your connection to Data Catalog. See Connecting to Virtual File Systems for details.

Field

Description

Step Name

Specify the unique name of the Catalog Input step on the canvas. You can customize the name or leave it as the default.

Connection

Use the list to select the name of your connection to Data Catalog.

See Connecting to Virtual File Systems for details.

Input tab

Use the Input tab to specify the search method used to find the metadata about the payload to read from Data Catalog. Searches are performed on the schema of a supported format type according to the selected method.

Input tab

Select Specific Resources if you know the key value of the data resource.

Specific Resources method

Field	Description
ID	Enter the Data Catalog resource identifier. The resource must be profiled in Data Catalog to be available.
Add (button)	Click to retrieve the profiled data associated with the Resource ID in Data Catalog.
Delete (button)	Removes a selected resource from the Selected Files table.
Edit (button)	Allows you to edit a selected resource from the Selected Files table.

Results of the search are displayed in the Selected Files table, which provides details about the resources that were found. You can use the Edit button to modify a resource detail or Delete to remove a resource from the search.

Column	Description
Resource Name	Displays the name of the resource in Data Catalog.
ID	Displays the Data Catalog resource identifier.
Resource Type	Displays the type of the data type associated with the resource in Data Catalog.
Origin	Displays the origin of the resource in Data Catalog.

Select Search Criteria if you know general information about the resource. The search filters all the results by the criteria selected. Enter a keyword in the Keyword field or select criteria from a drop-down menu, then click ADD to include it in the search.

Search Criteria method

Field	Description
Keyword	Enter a keyword to use for the search.
Tags	Select a tag or tags to search for using the drop-down menu.
Virtual Folders	Select a virtual folder to search using the drop-down menu. NoteIf missing or incomplete data is returned, you can use Advanced Search or you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
Data Sources	Select a data source to search using the drop-down menu. NoteIf missing or incomplete data is returned, you can use Advanced Search or you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
Resource Type	Select a resource type to search using the drop-down menu. NoteIf missing or incomplete data is returned, you can use Advanced Search or you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
Files Size	Select a file size range to search using the drop-down menu.
File Format	Select a specific file format to search for using the drop-down menu. Note that CSV or Parquet file formats are currently supported.

Select Advanced Search to search using a JSON string. For more information, see Lumada Data Catalog REST API. The search filters all the results by the query parameters.

Advanced Search method

Field	Description
Advanced Search	Enter a JSON string of API-specific query parameters to run against Data Catalog.

Fields tab

In the Fields tab, you can define properties for the fields of the data type being read.

See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.

CSV fields

When CSV data is read, the table defines the fields to read as input from the CSV file.

CSV fields

Enter the information for the Catalog Input step fields as shown in the following table.

Column	Description
Name	The name of the field.
Type	The data type of the field.
Format	Enter the number format. See Number formats for a description of format symbols.
Position	The position is needed when processing the 'Fixed' Filetype. It is zero based, so the first character starts with position ‘`0`’.
Length	The value of this field depends on the format: Number: Total number of significant figures in a number. String: Total length of the string. Date: Length of printed output of the string. For example, ‘`4`’ only returns the year.
Precision	The value of this field depends on the format: Number: Number of floating point digits. String, Date, Boolean: Unused.
Currency	Used to interpret the symbol used to represent currencies (for example, `$` or `€`).
Decimal	Used to represent a decimal point, either a dot '`.`' or a comma '`,`' (for example, `5,000.00` or `5.000,00`).
Null if	Used to set as null (empty) if the string is equal to the specified value.
Default	Used to specify a default value to use in case the field in the CSV file was not specified (empty).
CatalogType	The data type as defined in Data Catalog. For example, `UTF8`.
Get Fields (button)	Click to retrieve a list of fields derived from the source file in Data Catalog.
Minimal width (button)	Click to minimize the field length by removing unnecessary characters. If selected, string fields will no longer be padded to their specified length.

Parquet fields

When Parquet data is read, the table defines the fields to read as input from the Parquet file.

Parquet fields

Enter the information for the Catalog Input step fields, as shown in the following table.

Column	Description
Path	The name of the field as it appears in the Parquet data file, and the Parquet data type. An associated PDI field type is provided in parentheses.
Name	The name of the input field.
Type	The type of the input field as detected by PDI.
Format	Specify the Date formats when the Type specified is Date.
Get Fields (button)	Click to retrieve a list of fields derived from the source file in Data Catalog.

Provide a path to a Parquet data file and click Get Fields. When the fields are retrieved, the Parquet type is converted to an applicable PDI type, as shown in the PDI types table. You can change the type by using the Type drop-down menu or by entering the type manually.

PDI types

The Parquet to PDI data type values are as shown in the table below:

Parquet Type	PDI Type
ByteArray	Binary
Boolean	Boolean
Double	Number
Float	Number
FixedLengthByteArray	Binary
Decimal	BigNumber
Date	Date
Enum	String
Int8	Integer
Int16	Integer
Int32	Integer
Int64	Integer
Int96	Timestamp
UInt8	Integer
UInt16	Integer
UInt32	Integer
UInt64	Integer
UTF8	String
TimeMillis	Timestamp
TimestampMillis	Timestamp

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Before you begin

General

Input tab

Fields tab

CSV fields

Parquet fields

PDI types