Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Catalog Input

You can use the Catalog Input step to read CSV text file formats of a Data Catalog resource that is stored in a Hadoop Distributed File System ( HDFS) or S3 ecosystem, and then output the data as table rows that can be used by a transformation.

You must have role permissions set in Data Catalog to read the data resources.

For more information about accessing Data Catalog in PDI, see PDI and Data Catalog.

NoteThe Catalog Input step is supported on the PDI engine but not on the Spark engine. Only CSV text file formats are currently supported.

Before you begin

Before you can use the Catalog Input step, you must complete the following tasks:

  • Establish a Catalog connection to Data Catalog. For details, see Access to Data Catalog.
  • If you want to use S3 storage that is provided by Data Catalog, you must configure S3 as the Default S3 Connection in VFS Connections to access S3 storage. For details, see Connecting to Virtual File Systems.
  • Establish a PDI connection to one or more clusters that you plan to use. For example, an HDFS driver must be configured as a named connection for your distribution for accessing HDFS. For information on named connections, see Connecting to a Hadoop cluster with the PDI client.

General

The following options are general to the Catalog Input transformation step.

OptionDescription
Step NameSpecify a unique name for the Catalog Input step. You can customize the name or use the default name.
Connection

Select the name of your connection to Data Catalog from the Connection list.

See Connecting to Virtual File Systems for details.

Options

The Catalog Input step includes settings for the metadata search method used and related property definitions.

Input tab

You can use the Input tab to specify the search method that is used to find metadata about data that is read from Data Catalog. Searches are performed on the schema of a supported format type according to the specified search method.

You can search for Specific Resources if you know the ID of the data resource.

Use the following options to search for specific resources in Data Catalog. The data resource must be profiled in Data Catalog to be added.

OptionDescription
ID Enter the Data Catalog resource identifier. The resource must first be profiled in Data Catalog before it can be found by a search.
AddClick Add to retrieve the profiled data that is associated with the Resource ID in Data Catalog.
DeleteClick Delete to remove one or more selected resources from the Selected Files table.
EditClick Edit to make a selected resource in the Selected Files table editable.

Results of the search are displayed in the Selected Files table, which provides details about the data resources that were found.

ColumnDescription
Resource NameDisplays the name of the resource in Data Catalog.
IDDisplays the Data Catalog resource identifier.
Resource TypeDisplays the data type that is associated with the resource in Data Catalog.
OriginDisplays the origin of the resource in Data Catalog.

Select Search Criteria if you know only general information about the resource. Results are filtered by the criteria selected.

Use the following options to specify search criteria to use for finding data resources.

NoteIf missing or incomplete data is returned, you can use Advanced Search or you might need to change the default limit for returned results. See Data Catalog searches returning incomplete or missing data for information.
OptionDescription
Limit Number of EntitiesLimit the number of entities that are shown per page in the search results.
KeywordEnter a keyword to use for the search.
Business TermsSelect one or more business terms to search for from the list.
Virtual FoldersSelect a virtual folder to search for from the list.
Data SourcesSelect a data source to search for from the list.
Resource TypeSelect a resource type to search for from the list.
Files SizeSelect a file size range to search for from the list.
File FormatSelect a specific file format to search for from the list. Only CSV text file formats are currently supported.

Select Advanced Search to search using a JSON string. For more information, see Lumada Data Catalog REST API. Results are filtered by the query parameters in the JSON string.

OptionDescription
Advanced SearchEnter a JSON string with API-specific query parameters to run against Data Catalog.

Fields tab

In the Fields tab, you can define properties for the fields of the data format being read.

See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.

CSV fields

When CSV data is read, the table defines the fields to read as input from the CSV file.

Enter the information for the Catalog Input step fields as shown in the following table.

ColumnDescription
NameThe name of the field.
TypeThe data type of the field.
FormatThe number format. See Number formats for a description of format symbols.
PositionThe position needed to process the 'Fixed' Filetype. The position field is zero based, so the first character starts with position ‘0’.
Length

The length of the field, according to the following field types:

  • Number: Total number of significant figures in a number.
  • String: Total length of the string.
  • Date: Length of printed output of the string. For example, ‘4’ returns only the year.
Precision

The value of this field depends on the format:

  • Number: Number of floating point digits.
  • String, Date, Boolean: Unused.
CurrencyThe currency symbol used by Data Catalog to represent the currency (for example, $ or ).
DecimalA decimal point that is represented as either a dot '.' or a comma ',' (for example, 5,000.00 or 5.000,00).
Null ifUsed to set as null (empty) if the string is equal to the specified value.
DefaultThe default value that is used if the field in the CSV file is empty.
Catalog TypeThe data type as defined in Data Catalog. For example, UTF8.
Get FieldsYou can click Get Fields to retrieve a list of fields derived from the source file in Data Catalog.
Minimal widthYou can click Minimal width to minimize the field length by removing unnecessary characters. If selected, string fields will no longer be padded to their specified length.

Use the following options to retrieve and format field information.

OptionDescription
Get FieldsClick Get Fields to retrieve a list of fields derived from the source file in Data Catalog.
Minimal widthClick Minimal width to minimize the field length by removing unnecessary characters. If Minimal width is used, string fields are no longer padded to their specified length.