Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Catalog Input

 

You can use the Catalog Input step to read CSV text file formats of a Data Catalog resource that is stored in a Hadoop Distributed File System ( HDFS) or S3 ecosystem, and then output the data as table rows that can be used by a transformation.

You must have role permissions set in Data Catalog to read the data resources.

For more information about accessing Data Catalog in PDI, see PDI and Data Catalog.

Before you begin

 

Before you can use the Catalog Input step, you must complete the following tasks:

  • Establish a Catalog connection to Data Catalog. For details, see Access to Data Catalog.
  • If you want to use S3 storage that is provided by Data Catalog, you must configure S3 as the Default S3 Connection in VFS Connections to access S3 storage. For details, see Connecting to Virtual File Systems.
  • Establish a PDI connection to one or more clusters that you plan to use. For example, an HDFS driver must be configured as a named connection for your distribution for accessing HDFS. For information on named connections, see Connecting to a Hadoop cluster with the PDI client.

General

 

The following options are general to the Catalog Input transformation step.

 

Option Description
Step Name Specify a unique name for the Catalog Input step. You can customize the name or use the default name.
Connection

Select the name of your connection to Data Catalog from the Connection list.

See Connecting to Virtual File Systems for details.

Options

 

The Catalog Input step includes settings for the metadata search method used and related property definitions.

Input tab

 

You can use the Input tab to specify the search method that is used to find metadata about data that is read from Data Catalog. Searches are performed on the schema of a supported format type according to the specified search method.

Input tab

You can search for Specific Resources if you know the ID of the data resource.

Specific Resources method

Use the following options to search for specific resources in Data Catalog. The data resource must be profiled in Data Catalog to be added.

Option Description
ID Enter the Data Catalog resource identifier. The resource must first be profiled in Data Catalog before it can be found by a search.
Add Click Add to retrieve the profiled data that is associated with the Resource ID in Data Catalog.
Delete Click Delete to remove one or more selected resources from the Selected Files table.
Edit Click Edit to make a selected resource in the Selected Files table editable.

Results of the search are displayed in the Selected Files table, which provides details about the data resources that were found.

Column Description
Resource Name Displays the name of the resource in Data Catalog.
ID Displays the Data Catalog resource identifier.
Resource Type Displays the data type that is associated with the resource in Data Catalog.
Origin Displays the origin of the resource in Data Catalog.

Select Search Criteria if you know only general information about the resource. Results are filtered by the criteria selected.

Search Criteria method

Use the following options to specify search criteria to use for finding data resources.

NoteIf missing or incomplete data is returned, you can use Advanced Search or you might need to change the default limit for returned results. See Data Catalog searches returning incomplete or missing data for information.
Option Description
Limit Number of Entities Limit the number of entities that are shown per page in the search results.
Keyword Enter a keyword to use for the search.
Business Terms Select one or more business terms to search for from the list.
Virtual Folders Select a virtual folder to search for from the list.
Data Sources Select a data source to search for from the list.
Resource Type Select a resource type to search for from the list.
Files Size Select a file size range to search for from the list.
File Format Select a specific file format to search for from the list. Only CSV text file formats are currently supported.

Select Advanced Search to search using a JSON string. For more information, see Lumada Data Catalog REST API. Results are filtered by the query parameters in the JSON string.

Advanced Search method

 

Option Description
Advanced Search Enter a JSON string with API-specific query parameters to run against Data Catalog.

Fields tab

 

In the Fields tab, you can define properties for the fields of the data format being read.

See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.

CSV fields

 

When CSV data is read, the table defines the fields to read as input from the CSV file.

CSV fields

Enter the information for the Catalog Input step fields as shown in the following table.

 

Column Description
Name The name of the field.
Type The data type of the field.
Format The number format. See Number formats for a description of format symbols.
Position The position needed to process the 'Fixed' Filetype. The position field is zero based, so the first character starts with position ‘0’.
Length

The length of the field, according to the following field types:

  • Number: Total number of significant figures in a number.
  • String: Total length of the string.
  • Date: Length of printed output of the string. For example, ‘4’ returns only the year.
Precision

The value of this field depends on the format:

  • Number: Number of floating point digits.
  • String, Date, Boolean: Unused.
Currency The currency symbol used by Data Catalog to represent the currency (for example, $ or ).
Decimal A decimal point that is represented as either a dot '.' or a comma ',' (for example, 5,000.00 or 5.000,00).
Null if Used to set as null (empty) if the string is equal to the specified value.
Default The default value that is used if the field in the CSV file is empty.
Catalog Type The data type as defined in Data Catalog. For example, UTF8.
Get Fields You can click Get Fields to retrieve a list of fields derived from the source file in Data Catalog.
Minimal width You can click Minimal width to minimize the field length by removing unnecessary characters. If selected, string fields will no longer be padded to their specified length.

Use the following options to retrieve and format field information.

Option Description
Get Fields Click Get Fields to retrieve a list of fields derived from the source file in Data Catalog.
Minimal width Click Minimal width to minimize the field length by removing unnecessary characters. If Minimal width is used, string fields are no longer padded to their specified length.