Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Using the ORC Output step on the Spark engine

Parent article

You can set up the ORC Output step to run on the Spark engine. Spark processes null values differently than the Pentaho engine, so you may need to adjust your transformation to process null values following Spark's processing rules.

Because of Cloudera Distribution Spark (CDS) limitations, the step does not support the Adaptive Execution Layer for writing Hive tables containing data files in the ORC format from Spark applications in YARN mode. As an alternative, you can use the Parquet data format for columnar data using Impala.

General

Enter the following information in the transformation step fields:

OptionDescription
Step nameSpecify the unique name of the ORC Output step on the canvas. You can customize the name or leave it as the default.
Folder/File nameSpecify the location and name of the file or folder. Click Browse to display the Open File window and navigate to the destination file or folder. For the supported file system types, see Connecting to Virtual File Systems. A folder is created that may contain multiple ORC files. Partial files will be written to this directory.
Overwrite existing output fileSelect to overwrite an existing file that has the same file name and extension.

Options

The ORC Output step features two tabs with fields. Each tab is described below.

Fields tab

ORC Output step

In the Fields tab, you can define fields that make up the ORC type description created by this step. The table below describes each of the options for configuring the ORC type description.

FieldDescription
ORC pathSpecify the name of the field as it will appear in the ORC data file or files.
NameSpecify the name of the PDI field.
ORC typeDefines the data type of the field.
PrecisionSpecify the total number of digits in the number (only applies to the Decimal ORC type). The default value is 20.
ScaleSpecify the number of digits after the decimal point (only applies to the Decimal ORC type). The default value is 10.
Default valueSpecify the default value of the field if it is null or empty.
NullSpecifies if the field can contain null values.

To avoid a transformation failure, make sure the Default value field contains values for all fields where Null is set to No.

You can define the fields manually, or you can provide a path to the PDI data stream and click Get Fields to populate all the fields. During the retrieval of the fields, a PDI type is converted into an appropriate ORC type, as shown in the table below. You can also change the selected ORC type by using the Type drop-down or by entering the type manually.

AEL types

In AEL, the ORC Output step automatically converts an incoming Spark SQL row to a row in the ORC output file, where the Spark types determine the ORC types that are written to the ORC file.

NoteSome ORC types are not supported because there are no equivalent data types for conversion in Spark.
ORC Type DesiredSpark Type Used
BooleanBoolean
TinyIntNot supported
SmallIntShort
IntegerInteger
BigIntLong
BinaryBinary
FloatFloat
DoubleDouble
DecimalBigNumber
CharNot supported
VarCharNot supported
TimeStampTimeStamp
DateDate

Options tab

ORC Output step Options           tab

The following options in the Options tab define how the ORC output file will be created.

FieldDescription
Compression

Specifies which codec is used to compress the ORC output file:

  • None

    No compression is used (default).

  • Zlib

    Writes the data blocks using the deflate algorithm, as specified in RFC 1951, and typically implemented using the zlib library.

  • LZO

    Writes the data blocks using LZO encoding, which works well for CHAR and VARCHAR columns that store very long character strings.

  • Snappy

    Using Google's Snappy compression library, writes the data blocks that are followed by the 4-byte, big-endian CRC32 checksum of the uncompressed data in each block.

Stripe size (MB)Defines the stripe size in megabytes. An ORC file has one or more stripes. Each stripe is composed of rows of data, an index of the data, and a footer containing metadata about the stripe’s contents. Large stripe sizes enable efficient reads from HDFS. The default is 64.See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC for additional information.
Compress size (KB)Defines the number of kilobytes in each compression chunk. The default is 256.
Inline IndexesIf checked, rows are indexed when written for faster filtering and random access on read.
Rows between entriesDefines the stride size or number of rows between index entries (must be greater than or equal to 1000). The stride size is the block of data that can be skipped by the ORC reader during a read operation based on the indexes. The default is 10000.
Include date in file nameAdds the system date to the filename with format (20181231 for example).
Include time in file nameAdds the system time to the filename with format HHmmss (235959 for example).
Specify date time formatSelect to specify the date time format using the dropdown list.
Important Due to licensing constraints, ORC does not ship with LZO compression libraries; these must be manually installed on each node if you want to use LZO compression.

Metadata injection support

All fields of this step support metadata injection. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.