If you are running your transformation on the Pentaho engine, use the following instructions to set up the ORC Output step.
Enter the following information in the transformation step fields:
|Specify the unique name of the ORC Output step on the canvas. You can customize the name or leave it as the default.
|Specify the location and name of the file or folder. Click Browse to display the Open File window and navigate to the destination file or folder. For the supported file system types, see Connecting to Virtual File Systems. The ORC files are created.
|Overwrite existing output file
|Select to overwrite an existing file that has the same file name and extension.
The ORC Output step features two tabs with fields. Each tab is described below.
In the Fields tab, you can define fields that make up the ORC type description created by this step. The table below describes each of the options for configuring the ORC type description.
|Specify the name of the field as it will appear in the ORC data file or files.
|Specify the name of the PDI field.
|Defines the data type of the field.
|Specify the total number of digits in the number (only applies to the Decimal ORC type). The default value is 20.
|Specify the number of digits after the decimal point (only applies to the Decimal ORC type). The default value is 10.
|Specify the default value of the field if it is null or empty.
|Specifies if the field can contain null values.
To avoid a transformation failure, make sure the Default value field contains values for all fields where Null is set to No.
You can define the fields manually, or you can provide a path to the PDI data stream and click Get Fields to populate all the fields. During the retrieval of the fields, a PDI type is converted to an applicable ORC type, as shown in the table below. You can also change the selected ORC type by using the Type drop-down or by entering the type manually.
The table below shows how a PDI type is converted to an applicable ORC type.
The following options in the Options tab define how the ORC output file will be created.
Specifies which codec is used to compress the ORC output file:
|Stripe size (MB)
|Defines the stripe size in megabytes. An ORC file has one or more stripes. Each stripe is composed of rows of data, an index of the data, and a footer containing metadata about the stripe’s contents. Large stripe sizes enable efficient reads from HDFS. The default is 64.See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC for additional information.
|Compress size (KB)
|Defines the number of kilobytes in each compression chunk. The default is 256.
|If checked, rows are indexed when written for faster filtering and random access on read.
|Rows between entries
|Defines the stride size or number of rows between index entries (must be greater than or equal to 1000). The stride size is the block of data that can be skipped by the ORC reader during a read operation based on the indexes. The default is 10000.
|Include date in file name
|Adds the system date to the filename with format
(20181231 for example).
|Include time in file name
|Adds the system time to the filename with format
HHmmss (235959 for example).
|Specify date time format
|Select to specify the date time format using the dropdown list.