Using the Hadoop File Input step on the Spark engine
You can set up the Hadoop File Input step to run on the Spark engine. Spark processes null values differently than the Pentaho engine, so you may need to adjust your transformation to process null values following Spark's processing rules.
General
Enter the following information in the transformation step name field.
- Step Name: Specifies the unique name of the transformation step on the canvas. The Step Name is set to Hadoop File Input by default.
Options
The Hadoop File Input step features several tabs with fields for setting environments and defining results. Each tab is described below.
File tab
In the File tab, you can specify options for the environment and other details for the file you want to input.
Option | Description |
Environment |
Select the Hadoop cluster where your file resides. See Connect to a Hadoop cluster with the PDI client for instructions on establishing a connection. |
File/Folder | Specify the location and/or name of the text file to read. Click the Ellipsis (…) button to navigate to the source file or folder in the VFS browser.The Spark engine assumes HDFS. |
Wildcard (RegExp) | Specify the regular expression you want to use to select the
files in the directory specified in the File/Folder field.
For example, you may want to process all files that have a .txt
output. See Selecting a file using regular expressions for examples of
regular expressions. |
Required | Indicate if the file is required. |
Include subfolders | Indicate if subdirectories (subfolders) are included. |
Accepting file names from a previous step

Show action buttons
When you have entered information in the File tab fields, select one of the following action buttons:
Button | Description |
Show filename(s) | Select to display a list of all files that are loaded based on the current selected file definitions. |
Show file content | Select to display the raw content of the selected file. |
Show content from first data line | Select to display the content from the first data line for the selected file. |
Selecting a file using regular expressions
Use the Wildcard (RegExp) field in the File tab to search for files by wildcard in the form of a regular expression. Regular expressions are more sophisticated than using * and ? wildcards. This table describes several examples of regular expressions.
File Name | Regular Expression | Files Selected |
/dirA/ | .userdata.\.txt | Find all files in /dirA/ with names containing userdata and ending with .txt |
/dirB/ | AAA.\* | Find all files in /dirB/ with names that start with AAA |
/dirC/ | \[ENG:A-Z\]\[ENG:0-9\].\* | Find all files in /dirC/ with names that start with a capital and followed by a digit (A0-Z9) |
/dirA/ | part-.* | Find all the Spark part files under the
directory /dirA/ |
Open file
When you select S3 in the Environment field, and then select the Ellipsis (…) button in the File/Folder field, the Open File dialog box appears. The fields in the Open File dialog box are not used by the Spark engine.
Content tab
In the Content tab, you can specify the format of the text files that are being read.
Option | Description |
Filetype | Select either CSV or Fixed length. Based on this selection, the PDI client launches a different helper GUI when you click Get Fields in the Fields tab. |
Separator | One or more characters that separate the fields in a single line of text. Typically, this is a semicolon ( ; ) or tab. |
Enclosure | Some fields can be enclosed by a pair of strings to allow separator characters in fields. The enclosure string is optional. |
Allow breaks in enclosed fields | This field is either not used by the Spark engine or not implemented for Spark on AEL. |
Escape | Specify an escape character (or characters) if you have these types of characters in your data. If you have a backslash ( / ) as an escape character, the text Not the nine o\'clock news (with a single quote \[ ' \] as the enclosure) is parsed as Not the nine o'clock news. |
Header & Number of header lines | Select if your text file has a header row (first lines in the
file). Set Header & Number of header lines to
1 (one). |
Footer & Number of footer lines | These fields are either not used by the Spark engine or not implemented for Spark on AEL. |
Wrapped lines & Number of times wrapped | These fields are either not used by the Spark engine or not implemented for Spark on AEL. |
Paged layout (printout), Number of lines per page, & Document header lines | These fields are either not used by the Spark engine or not implemented for Spark on AEL. |
Compression | This field is either not used by the Spark engine or not implemented for Spark on AEL. |
No empty rows | This field is either not used by the Spark engine or not implemented for Spark on AEL. |
Include filename in output? | This field is either not used by the Spark engine or not implemented for Spark on AEL. |
Filename fieldname | This field is either not used by the Spark engine or not implemented for Spark on AEL. |
Rownum in output? | This field is either not used by the Spark engine or not implemented for Spark on AEL. |
Rownum fieldname & Rownum by file? | These fields are either not used by the Spark engine or not implemented for Spark on AEL. |
Format | Select UNIX. |
Encoding & Limit | These fields are either not used by the Spark engine or not implemented for Spark on AEL. |
Be lenient when parsing dates? | This field is either not used by the Spark engine or not implemented for Spark on AEL. |
The date format Locale | This field is either not used by the Spark engine or not implemented for Spark on AEL. |
Add filenames to result | This field is either not used by the Spark engine or not implemented for Spark on AEL. |
Error Handling tab
Filters tab
Pentaho Engine: In the Filters tab, you can specify the lines you want to skip in the text file.
Option | Description |
Filter string | The string for which to search. |
Filter position | The position where the filter string must be placed in the line. Zero (0) is the first position in the line. If you specify a value below zero, the filter string is searched for in the entire string. |
Stop on filter | Enter Y here if you want to stop processing the current text file when the filter string is encountered. |
Positive match | Turns filters into positive mode when turned on. Only lines that match this filter will be passed. Negative filters take precedence and are immediately discarded. |
Fields tab
In the Fields tab, you can specify the information about the name and format of the fields being read from the text file.
Option | Description |
Name | Name of the field. |
Type | Type of the field can be either String, Date, or Number. |
Format | See Number formats for a complete description of format symbols. |
Position | The position is needed when processing the Fixed filetype. It is zero-based, so the first character is starting with position 0. |
Length |
The value of this field depends on format:
|
Precision |
The value of this field depends on format:
|
Currency | Used to interpret numbers such as $10,000.00 or E5.000,00. |
Decimal | A decimal point can be a period (.) as in 10,000.00 or it can be a comma (,) as in 5.000,00. |
Group | A grouping can be a comma (,) as in 10,000.00 or a period (.) as in 5.000,00. |
Null if | Treat this value as null. |
Default | Default value in case the field in the text file was not specified (empty). |
Trim type |
Trim the type before processing. You can specify one of the following options:
|
Repeat | If the corresponding value in this row is empty, repeat the one from the last time it was not empty (Y or N). |
See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.
Number formats
Use the following table to specify number formats. For further information on valid numeric formats used in this step, view the Number Formatting Table.
Symbol | Location | Localized | Meaning |
0 | Number | Yes | Digit. |
# | Number | Yes | Digit, zero shows as absent. |
. | Number | Yes | Decimal separator or monetary decimal separator. |
- | Number | Yes | Minus sign. |
, | Number | Yes | Grouping separator. |
E | Number | Yes | Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix. |
; | Subpattern boundary | Yes | Separates positive and negative patterns. |
% | Prefix or suffix | Yes | Multiply by 100 and show as percentage. |
‰(/u2030) | Prefix or suffix | Yes | Multiply by 1000 and show as per mille. |
¤ (/u00A4) | Prefix or suffix | No | Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator. |
‘ | Prefix or suffix | No | Used to quote special characters in a prefix or suffix, for example, '#'# formats 123 to #123. To create a single quote itself, use two in a row: # o''clock. |
Scientific notation
In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation, for example, 0.###E0 formats the number 1234 as 1.234E3.
Date formats
Use the following table to specify date formats. For further information on valid date formats used in this step, view the Date Formatting Table.
Letter | Date of Time Component | Presentation | Examples |
G | Era designator | Text | AD |
y | Year | Year | 1996 or 96 |
M | Month in year | Month | July, Jul, or 07 |
w | Week in year | Number | 27 |
W | Week in Month | Number | 2 |
D | Day in year | Number | 189 |
d | Day in month | Number | 10 |
F | Day of week in month | Number | 2 |
E | Day in week | Text | Tuesday or Tue |
a | am/pm marker | Text | PM |
H | Hour in day (0-23) | Number 0 | n/a |
k | Hour in day (1-24) | Number 24 | n/a |
K | Hour in am/pm (0-11) | Number 0 | n/a |
h | Hour in am/pm (1-12) | Number 12 | n/a |
m | Minute in hour | Number 30 | n/a |
s | Second in minute | Number 55 | n/a |
S | Millisecond | Number 978 | n/a |
z | Time zone | General time zone | Pacific Standard Time, PST, or GMT-08:00 |
Z | Time zone | RFC 822 time zone | -0800 |
Metadata injection support
All fields of this step support metadata injection except for the Hadoop Cluster field. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.