Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Catalog Output

Parent article

Use the Catalog Output step to encode CSV text file types or Parquet data formats using the schema defined in PDI to create a new resource or to replace or update an existing resource in Lumada Data Catalog. Metadata can be added. The data is saved in the selected Hadoop or S3 ecosystem and registered as a resource in Data Catalog.

You must have role permissions in Data Catalog to write the data resources. If a new data resource is created, it must be profiled by Data Catalog to be recognized and available for use.

For more information about accessing Lumada Data Catalog in PDI, see PDI and Lumada Data Catalog.

NoteThis step is supported on the PDI engine but not on the Spark engine. Only CSV text files and Parquet data formats are currently supported.

Before you begin

Before using the Catalog Output step, be aware of the following conditions:

General

The following fields are general to this transformation step:

FieldDescription
Step NameSpecify the unique name of the Catalog Output step on the canvas. You can customize the name or leave it as the default.
Connection

Use the list to select the name of your connection to Data Catalog.

See Connecting to Virtual File Systems for details.

File tab

The File tab has Operational Type selections that determine how and where the data is written.

File tab

FieldDescription
Save by locationSelect to save your data according to the Save by location settings.

NoteIn some cases, if missing or incomplete data is returned, you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
Overwrite an existing resourceSelect to save your data by overwriting the existing resource.

When Save by location is selected, use these fields to determine the data type that is written, and where and how the data is saved.

FieldDescription
Virtual FolderSpecify a virtual folder in Data Catalog for the data.
NameEnter the file name and file extension.
File FormatSpecify the file format. Only applicable for CSV and Parquet file types.
ReplaceSelect to replace an existing file.
UpdateSelect to update an existing file. Only applicable for CSV file types. Note that this method is currently unavailable when your data source is a MinIO instance. See Connecting to Virtual File Systems for alternate connections.
Create with timestampSelect to create a new file with a timestamp when the file already exists.

When Overwrite an existing resource is selected, use this field to determine the resource that is overwritten.

Overwrite an existing Resource Id

FieldDescription
Resource IdSpecify the Data Catalog resource identifier to use as the location to save the data.

Metadata tab

Metadata tab

In the Metadata tab, you can add metadata for the data being written.

FieldDescription
DescriptionEnter a description about the data. This entry will replace any existing description in Data Catalog.
TagsSelect an existing tag or tags for association with the data using the drop-down menu.

NoteIn some cases, if missing or incomplete data is returned, you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
ADD (button)Click to add this metadata to the data being written.

Fields tab

In the Fields tab, you can define properties for the fields of the data type being exported.

See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.

CSV fields

CSV fields

The table below describes options for configuring the name and format of the fields being written for CSV files.

ColumnDescription
NameSpecify the name of the field.
TypeSelect the field's data type from the drop-down menu or enter it manually.
FormatSelect the format mask (number type) from the drop-down menu or enter it manually. See Common Formats for information on common valid date and numeric formats you can use in this step.
Length

Specify the length of the field, according to the following field types:

  • Number: Total number of significant figures in a number.
  • String: Total number of significant figures in a number.
  • Date: Total length of printed output of the string. For example, 4 only returns the year.
PrecisionSpecify the number of floating-point digits for number-type fields.
CurrencySpecify the symbol used to represent currencies (for example, $ or ).
DecimalSpecify the symbol used to represent a decimal point, either a period (.) as in 10,000.00 or it can be a comma (,) as in 5.000,00.
GroupSpecify the method used to separate units of thousands in numbers of four digits or larger, either a comma (,) as in 10,000.00 or (.) as in 5.000,00.
Trim typeSelect the trimming method (none, left, right, or both) to apply to a string, which truncates the field before processing. Trimming only works when no field length is specified.
NullSpecify the string to insert into the output text file when the value of the field is null.
Get Fields (button)Click to retrieve a list of fields from the input stream.
Minimal width (button)Click to minimize the field length by removing unnecessary characters. If selected, string fields will no longer be padded to their specified length.
Parquet fields

Parquet fields

The table below describes options for configuring the properties of the fields being written for Parquet data.

ColumnDescription
Parquet PathSpecify the name of the column in the Parquet file.
NameSpecify the name of the PDI field.
Parquet TypeSpecify the data type used to store the data in the Parquet file.
PrecisionSpecify the total number of significant digits in the number (only applies to the Decimal Parquet type). The default value is 20.
ScaleSpecify the number of digits after the decimal point (only applies to the Decimal Parquet type). The default value is 10.
Default valueSpecify the default value of the field if it is null or empty.
NullSpecify if the field can contain null values.
Get Fields (button)Click to retrieve a list of fields from the input stream.

NoteTo avoid a transformation failure, make sure the Default value field contains values for all fields where Null is set to No.

You can define the fields manually, or you can click Get Fields to automatically populate the fields. When the fields are retrieved, a PDI type is converted into an appropriate Parquet type, as shown in the table below. You can also change the selected Parquet type by using the Type drop-down menu or by entering the type manually.

PDI TypeParquet Type
InetAddressUTF8
StringUTF8
TimeStampTimestampMillis
BinaryBinary
BigNumberDecimal
BooleanBoolean
DateDate
IntegerInt64
NumberDouble
Options tab

In the Options tab, you can define properties for the file output.

CSV options

CSV options

In the Options tab, you can define properties for the CSV file output.

OptionDescription
SeparatorSpecify the character used to separate the fields in a single line of text, typically a semicolon or tab. Click Insert TAB to place a tab in the Separator field. The default value is semicolon (;).
EnclosureSpecify to enclose fields with a pair of specified strings. It allows for separator characters in fields. This setting is optional and can be left blank. The default value is double quotes (").
Force the enclosure around fields?Specify to force all field names to be enclosed with the character specified in the Enclosure option.
Disable the enclosure fix?Specify to disregard enclosures on string fields and separators.
HeaderClear to indicate that the first line in the output file is not a header row.
Footer

Select to specify that the last line in the output file is a footer row.

When using the Append option, it is not possible to strip a footer from the file content before appending new rows.

FormatSpecify the type of formatting to use. It can be either DOS or UNIX. UNIX files have lines separated by line feeds, while DOS files have lines separated by carriage returns and line feeds. The default value is CR + LF (Windows, DOS).
CompressionSpecify the type of compression (.ZIP or Gzip) to use when compressing the output file. Only one file is placed in a single archive. The default value is None.
EncodingSpecify the file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, PDI searches your system for available encodings.
Right pad fieldsSelect to add spaces to the end of the fields (or remove characters at the end) until the length specified in the table under the Fields tab is reached.
Add Ending line of fileSpecify an alternate ending row to the output file.
Parquet options

Parquet options

In the Options tab, you can define properties for the Parquet file output.

OptionDescription
Compression

Specify the codec to use to compress the Parquet Output file:

  • None: No compression is used (default).
  • Snappy: Using Google's Snappy compression library, writes the data blocks that are followed by the 4-byte, big-endian CRC32 checksum of the uncompressed data in each block.
  • GZIP: Uses a compression format that is based on the Deflate algorithm.
Version

Specify the version of Parquet you want to use:

  • Parquet 1.0
  • Parquet 2.0
Row group size (MB)Specify the group size for the rows. The default value is 0.
Data page size (KB)Specify the page size for the data. The default value is 0.
Dictionary encodingSpecifies the dictionary encoding, which builds a dictionary of values encountered in a column. The dictionary page is written first, before the data pages of the column. Note that if the dictionary grows larger than the Page size, whether in size or number of distinct values, then the encoding method will revert to the plain encoding type.
Page size (KB)Specify the page size when using dictionary encoding. The default value is 1024.