Using the Hadoop File Input step on the Spark engine

Last updated
Save as PDF

You can set up the Hadoop File Input step to run on the Spark engine. Spark processes null values differently than the Pentaho engine, so you may need to adjust your transformation to process null values following Spark's processing rules.

General

Enter the following information in the transformation step name field.

Step Name: Specifies the unique name of the transformation step on the canvas. The Step Name is set to Hadoop File Input by default.

Options

The Hadoop File Input step features several tabs with fields for setting environments and defining results. Each tab is described below.

File tab

In the File tab, you can specify options for the environment and other details for the file you want to input.

Option	Description
Environment	Select the Hadoop cluster where your file resides. See Connect to a Hadoop cluster with the PDI client for instructions on establishing a connection.
File/Folder	Specify the location and/or name of the text file to read. Click the Ellipsis (…) button to navigate to the source file or folder in the VFS browser.The Spark engine assumes HDFS.
Wildcard (RegExp)	Specify the regular expression you want to use to select the files in the directory specified in the File/Folder field. For example, you may want to process all files that have a `.txt` output. See Selecting a file using regular expressions for examples of regular expressions.
Required	Indicate if the file is required.
Include subfolders	Indicate if subdirectories (subfolders) are included.

Accepting file names from a previous step

These fields are not used by the Spark engine.

Show action buttons

Action buttons

When you have entered information in the File tab fields, select one of the following action buttons:

Button	Description
Show filename(s)	Select to display a list of all files that are loaded based on the current selected file definitions.
Show file content	Select to display the raw content of the selected file.
Show content from first data line	Select to display the content from the first data line for the selected file.

Selecting a file using regular expressions

Use the Wildcard (RegExp) field in the File tab to search for files by wildcard in the form of a regular expression. Regular expressions are more sophisticated than using * and ? wildcards. This table describes several examples of regular expressions.

File Name	Regular Expression	Files Selected
/dirA/	.userdata.\.txt	Find all files in /dirA/ with names containing userdata and ending with .txt
/dirB/	AAA.\*	Find all files in /dirB/ with names that start with AAA
/dirC/	\[ENG:A-Z\]\[ENG:0-9\].\*	Find all files in /dirC/ with names that start with a capital and followed by a digit (A0-Z9)
/dirA/	part-.*	Find all the Spark `part` files under the directory /dirA/

Open file

When you select S3 in the Environment field, and then select the Ellipsis (…) button in the File/Folder field, the Open File dialog box appears. The fields in the Open File dialog box are not used by the Spark engine.

Content tab

In the Content tab, you can specify the format of the text files that are being read.

Option	Description
Filetype	Select either CSV or Fixed length. Based on this selection, the PDI client launches a different helper GUI when you click Get Fields in the Fields tab.
Separator	One or more characters that separate the fields in a single line of text. Typically, this is a semicolon ( ; ) or tab.
Enclosure	Some fields can be enclosed by a pair of strings to allow separator characters in fields. The enclosure string is optional.
Allow breaks in enclosed fields	This field is either not used by the Spark engine or not implemented for Spark on AEL.
Escape	Specify an escape character (or characters) if you have these types of characters in your data. If you have a backslash ( / ) as an escape character, the text `Not the nine o\'clock news` (with a single quote \[ ' \] as the enclosure) is parsed as Not the nine o'clock news.
Header & Number of header lines	Select if your text file has a header row (first lines in the file). Set Header & Number of header lines to `1` (one).
Footer & Number of footer lines	These fields are either not used by the Spark engine or not implemented for Spark on AEL.
Wrapped lines & Number of times wrapped	These fields are either not used by the Spark engine or not implemented for Spark on AEL.
Paged layout (printout), Number of lines per page, & Document header lines	These fields are either not used by the Spark engine or not implemented for Spark on AEL.
Compression	This field is either not used by the Spark engine or not implemented for Spark on AEL.
No empty rows	This field is either not used by the Spark engine or not implemented for Spark on AEL.
Include filename in output?	This field is either not used by the Spark engine or not implemented for Spark on AEL.
Filename fieldname	This field is either not used by the Spark engine or not implemented for Spark on AEL.
Rownum in output?	This field is either not used by the Spark engine or not implemented for Spark on AEL.
Rownum fieldname & Rownum by file?	These fields are either not used by the Spark engine or not implemented for Spark on AEL.
Format	Select UNIX.
Encoding & Limit	These fields are either not used by the Spark engine or not implemented for Spark on AEL.
Be lenient when parsing dates?	This field is either not used by the Spark engine or not implemented for Spark on AEL.
The date format Locale	This field is either not used by the Spark engine or not implemented for Spark on AEL.
Add filenames to result	This field is either not used by the Spark engine or not implemented for Spark on AEL.

Error Handling tab

These fields are not used by the Spark engine.

Filters tab

Pentaho Engine: In the Filters tab, you can specify the lines you want to skip in the text file.

Option	Description
Filter string	The string for which to search.
Filter position	The position where the filter string must be placed in the line. Zero (0) is the first position in the line. If you specify a value below zero, the filter string is searched for in the entire string.
Stop on filter	Enter `Y` here if you want to stop processing the current text file when the filter string is encountered.
Positive match	Turns filters into positive mode when turned on. Only lines that match this filter will be passed. Negative filters take precedence and are immediately discarded.

Fields tab

In the Fields tab, you can specify the information about the name and format of the fields being read from the text file.

Option	Description
Name	Name of the field.
Type	Type of the field can be either String, Date, or Number.
Format	See Number formats for a complete description of format symbols.
Position	The position is needed when processing the Fixed filetype. It is zero-based, so the first character is starting with position 0.
Length	The value of this field depends on format: Number Total number of significant figures in a number. String Total length of string. Date Total length of printed output of the string. For example, `4` only returns the year.
Precision	The value of this field depends on format: Number Number of floating point digits. String, Date, Boolean Unused.
Currency	Used to interpret numbers such as `$10,000.00` or `E5.000,00`.
Decimal	A decimal point can be a period (`.`) as in `10,000.00` or it can be a comma (`,`) as in `5.000,00`.
Group	A grouping can be a comma (`,`) as in `10,000.00` or a period (`.`) as in `5.000,00`.
Null if	Treat this value as null.
Default	Default value in case the field in the text file was not specified (empty).
Trim type	Trim the type before processing. You can specify one of the following options: None Left Right Both
Repeat	If the corresponding value in this row is empty, repeat the one from the last time it was not empty (Y or N).

See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.

Number formats

Use the following table to specify number formats. For further information on valid numeric formats used in this step, view the Number Formatting Table.

Symbol	Location	Localized	Meaning
0	Number	Yes	Digit.
#	Number	Yes	Digit, zero shows as absent.
.	Number	Yes	Decimal separator or monetary decimal separator.
-	Number	Yes	Minus sign.
,	Number	Yes	Grouping separator.
E	Number	Yes	Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.
;	Subpattern boundary	Yes	Separates positive and negative patterns.
%	Prefix or suffix	Yes	Multiply by 100 and show as percentage.
‰(/u2030)	Prefix or suffix	Yes	Multiply by 1000 and show as per mille.
¤ (/u00A4)	Prefix or suffix	No	Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.
‘	Prefix or suffix	No	Used to quote special characters in a prefix or suffix, for example, '#'# formats `123` to #123. To create a single quote itself, use two in a row: `# o''clock`.

Scientific notation

In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation, for example, 0.###E0 formats the number 1234 as 1.234E3.

Date formats

Use the following table to specify date formats. For further information on valid date formats used in this step, view the Date Formatting Table.

Letter	Date of Time Component	Presentation	Examples
G	Era designator	Text	`AD`
y	Year	Year	`1996` or `96`
M	Month in year	Month	`July`, `Jul`, or `07`
w	Week in year	Number	`27`
W	Week in Month	Number	`2`
D	Day in year	Number	`189`
d	Day in month	Number	`10`
F	Day of week in month	Number	`2`
E	Day in week	Text	`Tuesday` or `Tue`
a	am/pm marker	Text	`PM`
H	Hour in day (0-23)	Number 0	n/a
k	Hour in day (1-24)	Number 24	n/a
K	Hour in am/pm (0-11)	Number 0	n/a
h	Hour in am/pm (1-12)	Number 12	n/a
m	Minute in hour	Number 30	n/a
s	Second in minute	Number 55	n/a
S	Millisecond	Number 978	n/a
z	Time zone	General time zone	`Pacific Standard Time`, `PST`, or `GMT-08:00`
Z	Time zone	RFC 822 time zone	`-0800`

Metadata injection support

All fields of this step support metadata injection except for the Hadoop Cluster field. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.