Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Catalog Output

 

Parent article

Use the Catalog Output step to encode CSV text file types or Parquet data formats using the schema defined in PDI to create a new resource or to replace or update an existing resource in Lumada Data Catalog. Metadata can be added. The data is saved in the selected Hadoop or S3 ecosystem and registered as a resource in Data Catalog.

You must have role permissions in Data Catalog to write the data resources. If a new data resource is created, it must be profiled by Data Catalog to be recognized and available for use.

For more information about accessing Lumada Data Catalog in PDI, see PDI and Lumada Data Catalog.

 

NoteThis step is supported on the PDI engine but not on the Spark engine. Only CSV text files and Parquet data formats are currently supported.

Before you begin

 

Before using the Catalog Output step, be aware of the following conditions:

General

 

The following fields are general to this transformation step:

Field Description
Step Name Specify the unique name of the Catalog Output step on the canvas. You can customize the name or leave it as the default.
Connection

Use the list to select the name of your connection to Data Catalog.

See Connecting to Virtual File Systems for details.

File tab

 

The File tab has Operational Type selections that determine how and where the data is written.

File tab

 

Field Description
Save by location Select to save your data according to the Save by location settings.

 

NoteIn some cases, if missing or incomplete data is returned, you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
Overwrite an existing resource Select to save your data by overwriting the existing resource.

When Save by location is selected, use these fields to determine the data type that is written, and where and how the data is saved.

 

Field Description
Virtual Folder Specify a virtual folder in Data Catalog for the data.
Name Enter the file name and file extension.
File Format Specify the file format. Only applicable for CSV and Parquet file types.
Replace Select to replace an existing file.
Update Select to update an existing file. Only applicable for CSV file types. Note that this method is currently unavailable when your data source is a MinIO instance. See Connecting to Virtual File Systems for alternate connections.
Create with timestamp Select to create a new file with a timestamp when the file already exists.

When Overwrite an existing resource is selected, use this field to determine the resource that is overwritten.

Overwrite an existing Resource Id

Field Description
Resource Id Specify the Data Catalog resource identifier to use as the location to save the data.

Metadata tab

 

Metadata tab

In the Metadata tab, you can add metadata for the data being written.

 

Field Description
Description Enter a description about the data. This entry will replace any existing description in Data Catalog.
Tags Select an existing tag or tags for association with the data using the drop-down menu.

 

NoteIn some cases, if missing or incomplete data is returned, you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
ADD (button) Click to add this metadata to the data being written.

Fields tab

 

In the Fields tab, you can define properties for the fields of the data type being exported.

See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.

CSV fields
 

CSV fields

The table below describes options for configuring the name and format of the fields being written for CSV files.

 

Column Description
Name Specify the name of the field.
Type Select the field's data type from the drop-down menu or enter it manually.
Format Select the format mask (number type) from the drop-down menu or enter it manually. See Common Formats for information on common valid date and numeric formats you can use in this step.
Length

Specify the length of the field, according to the following field types:

  • Number: Total number of significant figures in a number.
  • String: Total number of significant figures in a number.
  • Date: Total length of printed output of the string. For example, 4 only returns the year.
Precision Specify the number of floating-point digits for number-type fields.
Currency Specify the symbol used to represent currencies (for example, $ or ).
Decimal Specify the symbol used to represent a decimal point, either a period (.) as in 10,000.00 or it can be a comma (,) as in 5.000,00.
Group Specify the method used to separate units of thousands in numbers of four digits or larger, either a comma (,) as in 10,000.00 or (.) as in 5.000,00.
Trim type Select the trimming method (none, left, right, or both) to apply to a string, which truncates the field before processing. Trimming only works when no field length is specified.
Null Specify the string to insert into the output text file when the value of the field is null.
Get Fields (button) Click to retrieve a list of fields from the input stream.
Minimal width (button) Click to minimize the field length by removing unnecessary characters. If selected, string fields will no longer be padded to their specified length.
Parquet fields
 

Parquet fields

The table below describes options for configuring the properties of the fields being written for Parquet data.

 

Column Description
Parquet Path Specify the name of the column in the Parquet file.
Name Specify the name of the PDI field.
Parquet Type Specify the data type used to store the data in the Parquet file.
Precision Specify the total number of significant digits in the number (only applies to the Decimal Parquet type). The default value is 20.
Scale Specify the number of digits after the decimal point (only applies to the Decimal Parquet type). The default value is 10.
Default value Specify the default value of the field if it is null or empty.
Null Specify if the field can contain null values.
Get Fields (button) Click to retrieve a list of fields from the input stream.

 

NoteTo avoid a transformation failure, make sure the Default value field contains values for all fields where Null is set to No.

You can define the fields manually, or you can click Get Fields to automatically populate the fields. When the fields are retrieved, a PDI type is converted into an appropriate Parquet type, as shown in the table below. You can also change the selected Parquet type by using the Type drop-down menu or by entering the type manually.

 

PDI Type Parquet Type
InetAddress UTF8
String UTF8
TimeStamp TimestampMillis
Binary Binary
BigNumber Decimal
Boolean Boolean
Date Date
Integer Int64
Number Double
Options tab
 

In the Options tab, you can define properties for the file output.

CSV options
 

CSV options

In the Options tab, you can define properties for the CSV file output.

 

Option Description
Separator Specify the character used to separate the fields in a single line of text, typically a semicolon or tab. Click Insert TAB to place a tab in the Separator field. The default value is semicolon (;).
Enclosure Specify to enclose fields with a pair of specified strings. It allows for separator characters in fields. This setting is optional and can be left blank. The default value is double quotes (").
Force the enclosure around fields? Specify to force all field names to be enclosed with the character specified in the Enclosure option.
Disable the enclosure fix? Specify to disregard enclosures on string fields and separators.
Header Clear to indicate that the first line in the output file is not a header row.
Footer

Select to specify that the last line in the output file is a footer row.

When using the Append option, it is not possible to strip a footer from the file content before appending new rows.

Format Specify the type of formatting to use. It can be either DOS or UNIX. UNIX files have lines separated by line feeds, while DOS files have lines separated by carriage returns and line feeds. The default value is CR + LF (Windows, DOS).
Compression Specify the type of compression (.ZIP or Gzip) to use when compressing the output file. Only one file is placed in a single archive. The default value is None.
Encoding Specify the file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, PDI searches your system for available encodings.
Right pad fields Select to add spaces to the end of the fields (or remove characters at the end) until the length specified in the table under the Fields tab is reached.
Add Ending line of file Specify an alternate ending row to the output file.
Parquet options
 

Parquet options

In the Options tab, you can define properties for the Parquet file output.

 

Option Description
Compression

Specify the codec to use to compress the Parquet Output file:

  • None: No compression is used (default).
  • Snappy: Using Google's Snappy compression library, writes the data blocks that are followed by the 4-byte, big-endian CRC32 checksum of the uncompressed data in each block.
  • GZIP: Uses a compression format that is based on the Deflate algorithm.
Version

Specify the version of Parquet you want to use:

  • Parquet 1.0
  • Parquet 2.0
Row group size (MB) Specify the group size for the rows. The default value is 0.
Data page size (KB) Specify the page size for the data. The default value is 0.
Dictionary encoding Specifies the dictionary encoding, which builds a dictionary of values encountered in a column. The dictionary page is written first, before the data pages of the column. Note that if the dictionary grows larger than the Page size, whether in size or number of distinct values, then the encoding method will revert to the plain encoding type.
Page size (KB) Specify the page size when using dictionary encoding. The default value is 1024.