Hadoop File Input

Last updated
Save as PDF

The Hadoop File Input step is used to read data from a variety of different text-file types stored on a Hadoop cluster. The most commonly used formats include comma separated values (CSV files) generated by spreadsheets and fixed-width flat files.

You can use this step to specify a list of files to read, or a list of directories with wild cards in the form of regular expressions. In addition, you can accept file names from a previous step.

Select an engine

You can run the Hadoop File Input step on the Pentaho engine or on the Spark engine. Depending on your selected engine, the transformation runs differently. Select one of the following options to view how to set up the Hadoop File Input step for your selected engine.

Using the Hadoop File Input step on the Pentaho engine: Learn how to set up this step when using the Pentaho engine.
Using the Hadoop File Input step on the Spark engine: Learn how to set up this step when using the Spark engine.

For instructions on selecting an engine for your transformation, see Run configurations.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Select an engine