Catalog Output

Last updated
Save as PDF

Use the Catalog Output step to encode CSV text file types or Parquet data formats using the schema defined in PDI to create a new resource or to replace or update an existing resource in Lumada Data Catalog. Metadata can be added. The data is saved in the selected Hadoop or S3 ecosystem and registered as a resource in Data Catalog.

You must have role permissions in Data Catalog to write the data resources. If a new data resource is created, it must be profiled by Data Catalog to be recognized and available for use.

For more information about accessing Lumada Data Catalog in PDI, see PDI and Lumada Data Catalog.

NoteThis step is supported on the PDI engine but not on the Spark engine. Only CSV text files and Parquet data formats are currently supported.

Before you begin

Before using the Catalog Output step, be aware of the following conditions:

You must have an established Catalog connection to Data Catalog. For details, see Access to Lumada Data Catalog.
S3 must be configured as the Default S3 Connection in VFS Connections to access S3 storage. For details, see Connecting to Virtual File Systems.
You must have an established PDI connection to the cluster(s) you plan on using. For example, a Hadoop driver must be configured as a named connection for your distribution for accessing HDFS. For information on named connections, see Connecting to a Hadoop cluster with the PDI client.

General

The following fields are general to this transformation step:

Field	Description
Step Name	Specify the unique name of the Catalog Output step on the canvas. You can customize the name or leave it as the default.
Connection	Use the list to select the name of your connection to Data Catalog. See Connecting to Virtual File Systems for details.

Field

Description

Step Name

Specify the unique name of the Catalog Output step on the canvas. You can customize the name or leave it as the default.

Connection

Use the list to select the name of your connection to Data Catalog.

See Connecting to Virtual File Systems for details.

File tab

The File tab has Operational Type selections that determine how and where the data is written.

File tab

Field	Description
Save by location	Select to save your data according to the Save by location settings. NoteIn some cases, if missing or incomplete data is returned, you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
Overwrite an existing resource	Select to save your data by overwriting the existing resource.

When Save by location is selected, use these fields to determine the data type that is written, and where and how the data is saved.

Field	Description
Virtual Folder	Specify a virtual folder in Data Catalog for the data.
Name	Enter the file name and file extension.
File Format	Specify the file format. Only applicable for CSV and Parquet file types.
Replace	Select to replace an existing file.
Update	Select to update an existing file. Only applicable for CSV file types. Note that this method is currently unavailable when your data source is a MinIO instance. See Connecting to Virtual File Systems for alternate connections.
Create with timestamp	Select to create a new file with a timestamp when the file already exists.

When Overwrite an existing resource is selected, use this field to determine the resource that is overwritten.

Overwrite an existing Resource Id

Field	Description
Resource Id	Specify the Data Catalog resource identifier to use as the location to save the data.

Metadata tab

In the Metadata tab, you can add metadata for the data being written.

Field	Description
Description	Enter a description about the data. This entry will replace any existing description in Data Catalog.
Tags	Select an existing tag or tags for association with the data using the drop-down menu. NoteIn some cases, if missing or incomplete data is returned, you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
ADD (button)	Click to add this metadata to the data being written.

Fields tab

In the Fields tab, you can define properties for the fields of the data type being exported.

See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.

CSV fields

The table below describes options for configuring the name and format of the fields being written for CSV files.

Column	Description
Name	Specify the name of the field.
Type	Select the field's data type from the drop-down menu or enter it manually.
Format	Select the format mask (number type) from the drop-down menu or enter it manually. See Common Formats for information on common valid date and numeric formats you can use in this step.
Length	Specify the length of the field, according to the following field types: Number: Total number of significant figures in a number. String: Total number of significant figures in a number. Date: Total length of printed output of the string. For example, `4` only returns the year.
Precision	Specify the number of floating-point digits for number-type fields.
Currency	Specify the symbol used to represent currencies (for example, `$` or `€`).
Decimal	Specify the symbol used to represent a decimal point, either a period (`.`) as in `10,000.00` or it can be a comma (`,`) as in `5.000,00`.
Group	Specify the method used to separate units of thousands in numbers of four digits or larger, either a comma (`,`) as in `10,000.00` or (`.`) as in `5.000,00`.
Trim type	Select the trimming method (none, left, right, or both) to apply to a string, which truncates the field before processing. Trimming only works when no field length is specified.
Null	Specify the string to insert into the output text file when the value of the field is null.
Get Fields (button)	Click to retrieve a list of fields from the input stream.
Minimal width (button)	Click to minimize the field length by removing unnecessary characters. If selected, string fields will no longer be padded to their specified length.

Parquet fields

The table below describes options for configuring the properties of the fields being written for Parquet data.

Column	Description
Parquet Path	Specify the name of the column in the Parquet file.
Name	Specify the name of the PDI field.
Parquet Type	Specify the data type used to store the data in the Parquet file.
Precision	Specify the total number of significant digits in the number (only applies to the Decimal Parquet type). The default value is 20.
Scale	Specify the number of digits after the decimal point (only applies to the Decimal Parquet type). The default value is 10.
Default value	Specify the default value of the field if it is null or empty.
Null	Specify if the field can contain null values.
Get Fields (button)	Click to retrieve a list of fields from the input stream.

NoteTo avoid a transformation failure, make sure the Default value field contains values for all fields where Null is set to No.

You can define the fields manually, or you can click Get Fields to automatically populate the fields. When the fields are retrieved, a PDI type is converted into an appropriate Parquet type, as shown in the table below. You can also change the selected Parquet type by using the Type drop-down menu or by entering the type manually.

PDI Type	Parquet Type
InetAddress	UTF8
String	UTF8
TimeStamp	TimestampMillis
Binary	Binary
BigNumber	Decimal
Boolean	Boolean
Date	Date
Integer	Int64
Number	Double

Options tab

In the Options tab, you can define properties for the file output.

CSV options

In the Options tab, you can define properties for the CSV file output.

Option	Description
Separator	Specify the character used to separate the fields in a single line of text, typically a semicolon or tab. Click Insert TAB to place a tab in the Separator field. The default value is semicolon (;).
Enclosure	Specify to enclose fields with a pair of specified strings. It allows for separator characters in fields. This setting is optional and can be left blank. The default value is double quotes (").
Force the enclosure around fields?	Specify to force all field names to be enclosed with the character specified in the Enclosure option.
Disable the enclosure fix?	Specify to disregard enclosures on string fields and separators.
Header	Clear to indicate that the first line in the output file is not a header row.
Footer	Select to specify that the last line in the output file is a footer row. When using the Append option, it is not possible to strip a footer from the file content before appending new rows.
Format	Specify the type of formatting to use. It can be either DOS or UNIX. UNIX files have lines separated by line feeds, while DOS files have lines separated by carriage returns and line feeds. The default value is CR + LF (Windows, DOS).
Compression	Specify the type of compression (.ZIP or Gzip) to use when compressing the output file. Only one file is placed in a single archive. The default value is None.
Encoding	Specify the file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify `UTF-8` or `UTF-16`. On first use, PDI searches your system for available encodings.
Right pad fields	Select to add spaces to the end of the fields (or remove characters at the end) until the length specified in the table under the Fields tab is reached.
Add Ending line of file	Specify an alternate ending row to the output file.

Parquet options

In the Options tab, you can define properties for the Parquet file output.

Option	Description
Compression	Specify the codec to use to compress the Parquet Output file: None: No compression is used (default). Snappy: Using Google's Snappy compression library, writes the data blocks that are followed by the 4-byte, big-endian CRC32 checksum of the uncompressed data in each block. GZIP: Uses a compression format that is based on the Deflate algorithm.
Version	Specify the version of Parquet you want to use: Parquet 1.0 Parquet 2.0
Row group size (MB)	Specify the group size for the rows. The default value is 0.
Data page size (KB)	Specify the page size for the data. The default value is 0.
Dictionary encoding	Specifies the dictionary encoding, which builds a dictionary of values encountered in a column. The dictionary page is written first, before the data pages of the column. Note that if the dictionary grows larger than the Page size, whether in size or number of distinct values, then the encoding method will revert to the plain encoding type.
Page size (KB)	Specify the page size when using dictionary encoding. The default value is 1024.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Before you begin

General

File tab

Metadata tab

Fields tab

CSV fields

Parquet fields

Options tab

CSV options

Parquet options