Parquet Input
The Parquet Input step decodes Parquet data formats and extracts fields using the schema defined in the Parquet source files. The Parquet Input and the Parquet Output transformation steps enable you to gather data from various sources and move that data into the Hadoop ecosystem in the Parquet format. You can execute the transformation with PDI, or with the Adaptive Execution Layer (AEL) using Spark as the processing engine.
Before using the Parquet Input step, you must install and configure the correct shim for your distribution, even if your Location is set to Local. For information on configuring a shim for a specific distribution, see Connect to a Hadoop cluster with the PDI client.
AEL considerations
When using the Parquet Input step with the Adaptive Execution Layer, the following factor affects performance and results:
- Spark processes null values differently than the Pentaho engine. You will need to adjust your transformation to successfully process null values according to Spark's processing rules.
General
The following fields are general to this transformation step:
Field | Description |
Step name | Specify the unique name of the Parquet input step on the canvas. You can customize the name or use the provided default. |
Location | Specify the file system or specific cluster where the source file you want to input is located. For the supported file system types, see Using the virtual file system browser in PDI. |
Folder/File name | Specify the fully qualified URL of the source file or folder name for the input fields.
|
Ignore empty folder | Select to allow the transformation to proceed when the specified source file is not found in the designated location. If not selected, the specified source file is required in the location for the transformation to proceed. |
Fields
The Fields section contains the following items:

- A Pass through fields from the previous step option that allows you to read the fields from the input file without redefining any of the fields.
- A table defining data about the columns to read from the Parquet file.
The table in the Fields section defines the fields to read as input from the Parquet file, the associated PDI field name, and the data type of the field.
Enter the information for the Parquet input step fields, as shown in the following table:
Field | Description |
Path | Specify the name of the field as it will appear in the Parquet data file or files, and the Parquet data type. |
Name | Specify the name of the input field. |
Type | Specify the type of the input field. |
Format | Specify the date format when the Type specified is Date. |
You can define the fields manually, or you can provide a path to a Parquet data file and click Get Fields to populate the fields. When the fields are retrieved, the Parquet type is converted to an appropriate PDI type, as shown in the table below. You can preview the data in the Parquet file by clicking Preview. You can change the PDI type by using the Type drop-down or by entering the type manually.
PDI types
The Parquet to PDI data type values are as shown in the table below:
Parquet Type | PDI Type |
ByteArray | Binary |
Boolean | Boolean |
Double | Number |
Float | Number |
FixedLengthByteArray | Binary |
Decimal | BigNumber |
Date | Date |
Enum | String |
Int8 | Integer |
Int16 | Integer |
Int32 | Integer |
Int64 | Integer |
Int96 | Timestamp |
UInt8 | Integer |
UInt16 | Integer |
UInt32 | Integer |
UInt64 | Integer |
UTF8 | String |
TimeMillis | Timestamp |
TimestampMillis | Timestamp |
AEL types
In AEL, the Parquet Input step automatically converts Parquet rows to Spark SQL rows. The following table lists the conversion types:
Parquet Type | Spark Type Output |
Boolean | Boolean |
Int8 | Short |
Int16 | Short |
Int32 | Integer |
Int64 | Long |
Int96 | Timestamp |
UInt8 | Short |
UInt16 | Short |
UInt32 | Integer |
UInt64 | Long |
Binary | Binary |
FixedLengthByteArray | Binary |
Float | Float |
Double | Double |
Decimal | BigNumber |
UTF8 | String |
VarChar | String |
TimeMillis | Timestamp |
TimestampMillis | Timestamp |
Date | Date |
Metadata injection support
All fields of this step support metadata injection. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.