Catalog Output
Use the Catalog Output step to encode CSV text file types or Parquet data formats using the schema defined in PDI to create a new resource or to replace or update an existing resource in Lumada Data Catalog. Metadata can be added. The data is saved in the selected Hadoop or S3 ecosystem and registered as a resource in Data Catalog.
You must have role permissions in Data Catalog to write the data resources. If a new data resource is created, it must be profiled by Data Catalog to be recognized and available for use.
For more information about accessing Lumada Data Catalog in PDI, see PDI and Lumada Data Catalog.
Before you begin
Before using the Catalog Output step, be aware of the following conditions:
- You must have an established Catalog connection to Data Catalog. For details, see Access to Lumada Data Catalog.
- S3 must be configured as the Default S3 Connection in VFS Connections to access S3 storage. For details, see Connecting to Virtual File Systems.
- You must have an established PDI connection to the cluster(s) you plan on using. For example, a Hadoop driver must be configured as a named connection for your distribution for accessing HDFS. For information on named connections, see Connecting to a Hadoop cluster with the PDI client.
General
The following fields are general to this transformation step:
Field | Description |
Step Name | Specify the unique name of the Catalog Output step on the canvas. You can customize the name or leave it as the default. |
Connection |
Use the list to select the name of your connection to Data Catalog. See Connecting to Virtual File Systems for details. |
File tab
The File tab has Operational Type selections that determine how and where the data is written.
Field | Description |
Save by location | Select to save your data according to the Save by location settings.
NoteIn some cases, if missing or incomplete data is returned, you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
|
Overwrite an existing resource | Select to save your data by overwriting the existing resource. |
When Save by location is selected, use these fields to determine the data type that is written, and where and how the data is saved.
Field | Description |
Virtual Folder | Specify a virtual folder in Data Catalog for the data. |
Name | Enter the file name and file extension. |
File Format | Specify the file format. Only applicable for CSV and Parquet file types. |
Replace | Select to replace an existing file. |
Update | Select to update an existing file. Only applicable for CSV file types. Note that this method is currently unavailable when your data source is a MinIO instance. See Connecting to Virtual File Systems for alternate connections. |
Create with timestamp | Select to create a new file with a timestamp when the file already exists. |
When Overwrite an existing resource is selected, use this field to determine the resource that is overwritten.
Field | Description |
Resource Id | Specify the Data Catalog resource identifier to use as the location to save the data. |
Metadata tab
In the Metadata tab, you can add metadata for the data being written.
Field | Description |
Description | Enter a description about the data. This entry will replace any existing description in Data Catalog. |
Tags | Select an existing tag or tags for association with the data using the drop-down menu.
NoteIn some cases, if missing or incomplete data is returned, you may need to change the default limit for returned results. See Lumada Data Catalog searches returning incomplete or missing data for information.
|
ADD (button) | Click to add this metadata to the data being written. |
Fields tab
In the Fields tab, you can define properties for the fields of the data type being exported.
See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.
CSV fields
The table below describes options for configuring the name and format of the fields being written for CSV files.
Column | Description |
Name | Specify the name of the field. |
Type | Select the field's data type from the drop-down menu or enter it manually. |
Format | Select the format mask (number type) from the drop-down menu or enter it manually. See Common Formats for information on common valid date and numeric formats you can use in this step. |
Length |
Specify the length of the field, according to the following field types:
|
Precision | Specify the number of floating-point digits for number-type fields. |
Currency | Specify the symbol used to represent currencies (for example, $ or €). |
Decimal | Specify the symbol used to represent a decimal point, either a period (.) as in 10,000.00 or it can be a comma (,) as in 5.000,00. |
Group | Specify the method used to separate units of thousands in numbers of four digits or larger, either a comma (,) as in 10,000.00 or (.) as in 5.000,00. |
Trim type | Select the trimming method (none, left, right, or both) to apply to a string, which truncates the field before processing. Trimming only works when no field length is specified. |
Null | Specify the string to insert into the output text file when the value of the field is null. |
Get Fields (button) | Click to retrieve a list of fields from the input stream. |
Minimal width (button) | Click to minimize the field length by removing unnecessary characters. If selected, string fields will no longer be padded to their specified length. |
Parquet fields
The table below describes options for configuring the properties of the fields being written for Parquet data.
Column | Description |
Parquet Path | Specify the name of the column in the Parquet file. |
Name | Specify the name of the PDI field. |
Parquet Type | Specify the data type used to store the data in the Parquet file. |
Precision | Specify the total number of significant digits in the number (only applies to the Decimal Parquet type). The default value is 20. |
Scale | Specify the number of digits after the decimal point (only applies to the Decimal Parquet type). The default value is 10. |
Default value | Specify the default value of the field if it is null or empty. |
Null | Specify if the field can contain null values. |
Get Fields (button) | Click to retrieve a list of fields from the input stream. |
You can define the fields manually, or you can click Get Fields to automatically populate the fields. When the fields are retrieved, a PDI type is converted into an appropriate Parquet type, as shown in the table below. You can also change the selected Parquet type by using the Type drop-down menu or by entering the type manually.
PDI Type | Parquet Type |
InetAddress | UTF8 |
String | UTF8 |
TimeStamp | TimestampMillis |
Binary | Binary |
BigNumber | Decimal |
Boolean | Boolean |
Date | Date |
Integer | Int64 |
Number | Double |
Options tab
In the Options tab, you can define properties for the file output.
CSV options
In the Options tab, you can define properties for the CSV file output.
Option | Description |
Separator | Specify the character used to separate the fields in a single line of text, typically a semicolon or tab. Click Insert TAB to place a tab in the Separator field. The default value is semicolon (;). |
Enclosure | Specify to enclose fields with a pair of specified strings. It allows for separator characters in fields. This setting is optional and can be left blank. The default value is double quotes ("). |
Force the enclosure around fields? | Specify to force all field names to be enclosed with the character specified in the Enclosure option. |
Disable the enclosure fix? | Specify to disregard enclosures on string fields and separators. |
Header | Clear to indicate that the first line in the output file is not a header row. |
Footer |
Select to specify that the last line in the output file is a footer row. When using the Append option, it is not possible to strip a footer from the file content before appending new rows. |
Format | Specify the type of formatting to use. It can be either DOS or UNIX. UNIX files have lines separated by line feeds, while DOS files have lines separated by carriage returns and line feeds. The default value is CR + LF (Windows, DOS). |
Compression | Specify the type of compression (.ZIP or Gzip) to use when compressing the output file. Only one file is placed in a single archive. The default value is None. |
Encoding | Specify the file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16 . On first use, PDI searches your system for available encodings. |
Right pad fields | Select to add spaces to the end of the fields (or remove characters at the end) until the length specified in the table under the Fields tab is reached. |
Add Ending line of file | Specify an alternate ending row to the output file. |
Parquet options
In the Options tab, you can define properties for the Parquet file output.
Option | Description |
Compression |
Specify the codec to use to compress the Parquet Output file: |
Version |
Specify the version of Parquet you want to use:
|
Row group size (MB) | Specify the group size for the rows. The default value is 0. |
Data page size (KB) | Specify the page size for the data. The default value is 0. |
Dictionary encoding | Specifies the dictionary encoding, which builds a dictionary of values encountered in a column. The dictionary page is written first, before the data pages of the column. Note that if the dictionary grows larger than the Page size, whether in size or number of distinct values, then the encoding method will revert to the plain encoding type. |
Page size (KB) | Specify the page size when using dictionary encoding. The default value is 1024. |