Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Catalog Input

Parent article

You can use the Catalog Input step to read CSV text file types or Parquet data formats of a Lumada Data Catalog resource stored in a Hadoop or S3 ecosystem and output the data payload in the form of rows to be used by a transformation.

CSV files include formats generated by spreadsheets and fixed-width flat files. Parquet data formats are decoded and the fields are extracted using the schema defined in the Parquet source files.

Catalog Input can be used with the Catalog Output transformation step to gather data from various Data Catalog resources and move that data into Hadoop or S3 storage.

You must have role permissions set in Data Catalog to read the data resources.

For more information about accessing Lumada Data Catalog in PDI, see PDI and Lumada Data Catalog.

NoteThis step is supported on the PDI engine but not on the Spark engine. Only CSV text file and Parquet data formats are currently supported.

Before you begin

Before using the Catalog Input step, be aware of the following conditions:

General

The following fields are general to this transformation step.

FieldDescription
Step NameSpecify the unique name of the Catalog Input step on the canvas. You can customize the name or leave it as the default.
Connection

Use the list to select the name of your connection to Data Catalog.

See Connecting to Virtual File Systems for details.

Input tab

Use the Input tab to specify the search method used to find the metadata about the payload to read from Data Catalog. Searches are performed on the schema of a supported format type according to the selected method.

Input tab

Select Specific Resources if you know the key value of the data resource.

Specific Resources method

FieldDescription
IDEnter the Data Catalog resource identifier. The resource must be profiled in Data Catalog to be available.
Add (button)Click to retrieve the profiled data associated with the Resource ID in Data Catalog.
Delete (button)Removes a selected resource from the Selected Files table.
Edit (button)Allows you to edit a selected resource from the Selected Files table.

Results of the search are displayed in the Selected Files table, which provides details about the resources that were found. You can use the Edit button to modify a resource detail or Delete to remove a resource from the search.

ColumnDescription
Resource NameDisplays the name of the resource in Data Catalog.
IDDisplays the Data Catalog resource identifier.
Resource TypeDisplays the type of the data type associated with the resource in Data Catalog.
OriginDisplays the origin of the resource in Data Catalog.

Select Search Criteria if you know general information about the resource. The search filters all the results by the criteria selected. Enter a keyword in the Keyword field or select criteria from a drop-down menu, then click ADD to include it in the search.

Search Criteria method

FieldDescription
KeywordEnter a keyword to use for the search.
TagsSelect a tag or tags to search for using the drop-down menu.
Virtual FoldersSelect a virtual folder to search using the drop-down menu.

NoteIf missing or incomplete data is returned, you can use Advanced Search or you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
Data SourcesSelect a data source to search using the drop-down menu.

NoteIf missing or incomplete data is returned, you can use Advanced Search or you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
Resource TypeSelect a resource type to search using the drop-down menu.

NoteIf missing or incomplete data is returned, you can use Advanced Search or you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
Files SizeSelect a file size range to search using the drop-down menu.
File FormatSelect a specific file format to search for using the drop-down menu. Note that CSV or Parquet file formats are currently supported.

Select Advanced Search to search using a JSON string. For more information, see Lumada Data Catalog REST API. The search filters all the results by the query parameters.

Advanced Search method

FieldDescription
Advanced SearchEnter a JSON string of API-specific query parameters to run against Data Catalog.

Fields tab

In the Fields tab, you can define properties for the fields of the data type being read.

See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.

CSV fields

When CSV data is read, the table defines the fields to read as input from the CSV file.

CSV fields

Enter the information for the Catalog Input step fields as shown in the following table.

ColumnDescription
NameThe name of the field.
TypeThe data type of the field.
FormatEnter the number format. See Number formats for a description of format symbols.
PositionThe position is needed when processing the 'Fixed' Filetype. It is zero based, so the first character starts with position ‘0’.
Length

The value of this field depends on the format:

  • Number: Total number of significant figures in a number.
  • String: Total length of the string.
  • Date: Length of printed output of the string. For example, ‘4’ only returns the year.
Precision

The value of this field depends on the format:

  • Number: Number of floating point digits.
  • String, Date, Boolean: Unused.
CurrencyUsed to interpret the symbol used to represent currencies (for example, $ or ).
DecimalUsed to represent a decimal point, either a dot '.' or a comma ',' (for example, 5,000.00 or 5.000,00).
Null ifUsed to set as null (empty) if the string is equal to the specified value.
DefaultUsed to specify a default value to use in case the field in the CSV file was not specified (empty).
CatalogTypeThe data type as defined in Data Catalog. For example, UTF8.
Get Fields (button)Click to retrieve a list of fields derived from the source file in Data Catalog.
Minimal width (button)Click to minimize the field length by removing unnecessary characters. If selected, string fields will no longer be padded to their specified length.

Parquet fields

When Parquet data is read, the table defines the fields to read as input from the Parquet file.

Parquet fields

Enter the information for the Catalog Input step fields, as shown in the following table.

ColumnDescription
PathThe name of the field as it appears in the Parquet data file, and the Parquet data type. An associated PDI field type is provided in parentheses.
NameThe name of the input field.
TypeThe type of the input field as detected by PDI.
FormatSpecify the Date formats when the Type specified is Date.
Get Fields (button)Click to retrieve a list of fields derived from the source file in Data Catalog.

Provide a path to a Parquet data file and click Get Fields. When the fields are retrieved, the Parquet type is converted to an applicable PDI type, as shown in the PDI types table. You can change the type by using the Type drop-down menu or by entering the type manually.

PDI types

The Parquet to PDI data type values are as shown in the table below:

Parquet TypePDI Type
ByteArrayBinary
BooleanBoolean
DoubleNumber
FloatNumber
FixedLengthByteArrayBinary
DecimalBigNumber
DateDate
EnumString
Int8Integer
Int16Integer
Int32Integer
Int64Integer
Int96Timestamp
UInt8Integer
UInt16Integer
UInt32Integer
UInt64Integer
UTF8String
TimeMillisTimestamp
TimestampMillisTimestamp