Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Catalog Input

 

Parent article

You can use the Catalog Input step to read CSV text file types or Parquet data formats of a Lumada Data Catalog resource stored in a Hadoop or S3 ecosystem and output the data payload in the form of rows to be used by a transformation.

CSV files include formats generated by spreadsheets and fixed-width flat files. Parquet data formats are decoded and the fields are extracted using the schema defined in the Parquet source files.

Catalog Input can be used with the Catalog Output transformation step to gather data from various Data Catalog resources and move that data into Hadoop or S3 storage.

You must have role permissions set in Data Catalog to read the data resources.

For more information about accessing Lumada Data Catalog in PDI, see PDI and Lumada Data Catalog.

 

NoteThis step is supported on the PDI engine but not on the Spark engine. Only CSV text file and Parquet data formats are currently supported.

Before you begin

 

Before using the Catalog Input step, be aware of the following conditions:

General

 

The following fields are general to this transformation step.

 

Field Description
Step Name Specify the unique name of the Catalog Input step on the canvas. You can customize the name or leave it as the default.
Connection

Use the list to select the name of your connection to Data Catalog.

See Connecting to Virtual File Systems for details.

Input tab

 

Use the Input tab to specify the search method used to find the metadata about the payload to read from Data Catalog. Searches are performed on the schema of a supported format type according to the selected method.

Input tab

Select Specific Resources if you know the key value of the data resource.

Specific Resources method

 

Field Description
ID Enter the Data Catalog resource identifier. The resource must be profiled in Data Catalog to be available.
Add (button) Click to retrieve the profiled data associated with the Resource ID in Data Catalog.
Delete (button) Removes a selected resource from the Selected Files table.
Edit (button) Allows you to edit a selected resource from the Selected Files table.

Results of the search are displayed in the Selected Files table, which provides details about the resources that were found. You can use the Edit button to modify a resource detail or Delete to remove a resource from the search.

 

Column Description
Resource Name Displays the name of the resource in Data Catalog.
ID Displays the Data Catalog resource identifier.
Resource Type Displays the type of the data type associated with the resource in Data Catalog.
Origin Displays the origin of the resource in Data Catalog.

Select Search Criteria if you know general information about the resource. The search filters all the results by the criteria selected. Enter a keyword in the Keyword field or select criteria from a drop-down menu, then click ADD to include it in the search.

Search Criteria method

 

Field Description
Keyword Enter a keyword to use for the search.
Tags Select a tag or tags to search for using the drop-down menu.
Virtual Folders Select a virtual folder to search using the drop-down menu.

 

NoteIf missing or incomplete data is returned, you can use Advanced Search or you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
Data Sources Select a data source to search using the drop-down menu.

 

NoteIf missing or incomplete data is returned, you can use Advanced Search or you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
Resource Type Select a resource type to search using the drop-down menu.

 

NoteIf missing or incomplete data is returned, you can use Advanced Search or you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
Files Size Select a file size range to search using the drop-down menu.
File Format Select a specific file format to search for using the drop-down menu. Note that CSV or Parquet file formats are currently supported.

Select Advanced Search to search using a JSON string. For more information, see Lumada Data Catalog REST API. The search filters all the results by the query parameters.

Advanced Search method

 

Field Description
Advanced Search Enter a JSON string of API-specific query parameters to run against Data Catalog.

Fields tab

 

In the Fields tab, you can define properties for the fields of the data type being read.

See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.

CSV fields

 

When CSV data is read, the table defines the fields to read as input from the CSV file.

CSV fields

Enter the information for the Catalog Input step fields as shown in the following table.

 

Column Description
Name The name of the field.
Type The data type of the field.
Format Enter the number format. See Number formats for a description of format symbols.
Position The position is needed when processing the 'Fixed' Filetype. It is zero based, so the first character starts with position ‘0’.
Length

The value of this field depends on the format:

  • Number: Total number of significant figures in a number.
  • String: Total length of the string.
  • Date: Length of printed output of the string. For example, ‘4’ only returns the year.
Precision

The value of this field depends on the format:

  • Number: Number of floating point digits.
  • String, Date, Boolean: Unused.
Currency Used to interpret the symbol used to represent currencies (for example, $ or ).
Decimal Used to represent a decimal point, either a dot '.' or a comma ',' (for example, 5,000.00 or 5.000,00).
Null if Used to set as null (empty) if the string is equal to the specified value.
Default Used to specify a default value to use in case the field in the CSV file was not specified (empty).
CatalogType The data type as defined in Data Catalog. For example, UTF8.
Get Fields (button) Click to retrieve a list of fields derived from the source file in Data Catalog.
Minimal width (button) Click to minimize the field length by removing unnecessary characters. If selected, string fields will no longer be padded to their specified length.

Parquet fields

 

When Parquet data is read, the table defines the fields to read as input from the Parquet file.

Parquet fields

Enter the information for the Catalog Input step fields, as shown in the following table.

 

Column Description
Path The name of the field as it appears in the Parquet data file, and the Parquet data type. An associated PDI field type is provided in parentheses.
Name The name of the input field.
Type The type of the input field as detected by PDI.
Format Specify the Date formats when the Type specified is Date.
Get Fields (button) Click to retrieve a list of fields derived from the source file in Data Catalog.

Provide a path to a Parquet data file and click Get Fields. When the fields are retrieved, the Parquet type is converted to an applicable PDI type, as shown in the PDI types table. You can change the type by using the Type drop-down menu or by entering the type manually.

PDI types
 

The Parquet to PDI data type values are as shown in the table below:

Parquet Type PDI Type
ByteArray Binary
Boolean Boolean
Double Number
Float Number
FixedLengthByteArray Binary
Decimal BigNumber
Date Date
Enum String
Int8 Integer
Int16 Integer
Int32 Integer
Int64 Integer
Int96 Timestamp
UInt8 Integer
UInt16 Integer
UInt32 Integer
UInt64 Integer
UTF8 String
TimeMillis Timestamp
TimestampMillis Timestamp