Parquet Input
The Parquet Input step decodes Parquet data formats and extracts fields based on the structure it defines from source files. For big data users, the Parquet Input and the Parquet Output transformation steps ease the process of gathering raw data from various sources and moving that data into the Hadoop ecosystem to create a useful, summarized data set for analysis. Depending on your setup, you can execute the transformation within PDI or within the Adaptive Execution Layer (AEL), using Spark as the processing engine.
Before using the Parquet Input step, you will need to select and configure the shim for your distribution, even if your Location is set to 'Local'. The Parquet Input step requires the shim classes to read the correct data. For information on configuring a shim for a specific distribution, see Set Up Pentaho to Connect to a Hadoop Cluster.
Options

Option | Description |
---|---|
Step Name | Specifies the unique name of the Parquet Input step on the canvas. You can customize the name or leave it as the default. |
Location |
Indicates the file system or specific cluster on which the source file you want to input can be found. Options are as follows:
|
Folder/File name |
The full name of the source file for the input fields.
|
Fields |
Specify the following information for the input fields:
|
Get Fields | Click this button to insert the list of fields from the input stream into the Fields table. |
Preview | Click this button to preview the rows generated by this step. |
Metadata Injection Support
All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.