Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Using the Text File Input step on the Spark engine

Parent article

You can set up the Text file input step to run on the Spark engine. Spark processes null values differently than the Pentaho engine, so you may need to adjust your transformation to process null values following Spark's processing rules.

NoteIf you are using this step to extract data from Amazon Simple Storage Service (S3), browse to the URI of the S3 system or specify the Uri field option in the Additional output fields tab. S3 and S3n are supported.

If you are running your transformation on the Spark engine, use the following instructions to set up the Text File Input step.

General

Enter the following information in the transformation step name field:

  • Step name: Specify the unique name of the Text file input step on the canvas. You can customize the name or leave it as the default.

You can use Preview rows to display the rows generated by this step. The Text file input step determines what rows to input based on the information you provide in the option tabs. This preview function helps you to decide if the information provided accurately models the rows you are trying to retrieve.

Options

The Text file input step features several tabs with fields. Each tab is described below.

File tab

Text file input step

Use the File tab to enter the following connection information for your source.

OptionDescription
File or directorySpecify the source location if the source is not defined in a field.

Click Browse to display the Open File window and navigate to the file or folder. For the supported file system types, see Connecting to Virtual File Systems. Click Add to include the source in the Selected files table. If the source location is defined in a field, use the Accept filenames from previous steps to specify your file name.

Regular expressionSpecify a regular expression to match filenames within a specified directory.
Exclude regular expressionSpecify a regular expression to exclude filenames within a specified directory.

Regular expressions

Use the Wildcard (RegExp) field in the File tab to search for files by wildcard in the form of a regular expression. Regular expressions are more sophisticated than using * and ? wildcards. This table describes several examples of regular expressions.

File NameRegular ExpressionFiles Selected
/dirA/.userdata.\.txtFind all files in /dirA/ with names containing userdata and ending with .txt
/dirB/AAA.\*Find all files in /dirB/ with names that start with AAA
/dirC/\[ENG:A-Z\]\[ENG:0-9\].\*Find all files in /dirC/ with names that start with a capital and followed by a digit (A0-Z9)

Selected files table

The Selected files table shows files or directories to use as source locations for input. This table is populated by clicking Add after you specify a File or directory. The input step tries to connect to the specified file or directory when you click Add to include it in the table.

The table contains the following columns:

ColumnDescription
File/DirectoryThe source location indicated by clicking Add after specifying it in File or directory.
Wildcard (RegExp)Specify a regular expression to match filenames within a specified directory.
Exclude wildcardSpecify a regular expression to exclude filenames within a specified directory.
RequiredRequired source location for input.
Include subfoldersWhether subfolders are included within the source location.

Click Delete to remove a source from the table. Click Edit to remove a source from the table and return it back to the File or directory option.

Accept file names

Accept filenames from previous steps

These fields are not used by the Spark engine.

Show action buttons

Show action buttons on Files tab

When you have entered information in the File tab fields, select an action button if you want to look at the source file names or data content.

ButtonDescription
Show filename(s)Select to display the file names of the sources connected to the step.
Show file contentSelect to display the raw content of the selected file.
Show content from first data lineSelect to display the content from the first data line for the selected file.

Content tab

Content tab

In the Content tab, using the following options, you can specify the format of the source files.

OptionDescription
FiletypeSelect either CSV or Fixed length. Depending on the file type you select, a corresponding interface appears when you click Get Fields in the Fields tab.
SeparatorSpecify the character used to separate the fields in a single line of text, typically a semicolon or tab. Click Insert Tab to place a tab in the Separator field. The default value is semicolon (;).
EnclosureSpecify an optional character used to enclose a field if that field contains a separator character. The default value is double quotation marks (").
Allow breaks in enclosed fieldsThis field is either not used by the Spark engine or not implemented for Spark on AEL.
EscapeSpecify one or more characters to indicate if another character is a part of a regular text. For example, if a backslash (\) is the escape character and a single quote (') is an enclosure or separator character, then the text Not the nine o\’clock news is parsed as Not the nine o’clock news.
HeaderSelect if your text file has a header row (first lines in the file). Set Header to 1 (one).
FooterThis field is either not used by the Spark engine or not implemented for Spark on AEL.
Wrapped linesThis field is either not used by the Spark engine or not implemented for Spark on AEL.
Paged layout (printout)This field is either not used by the Spark engine or not implemented for Spark on AEL.
CompressionThis field is either not used by the Spark engine or not implemented for Spark on AEL.
No empty rowsThis field is either not used by the Spark engine or not implemented for Spark on AEL.
Include filename in outputThis field is either not used by the Spark engine or not implemented for Spark on AEL.
Rownum in outputThis field is either not used by the Spark engine or not implemented for Spark on AEL.
FormatSelect UNIX.
EncodingThis field is either not used by the Spark engine or not implemented for Spark on AEL.
LengthThis field is either not used by the Spark engine or not implemented for Spark on AEL.
LimitThis field is either not used by the Spark engine or not implemented for Spark on AEL.
Be lenient when parsing dates?This field is either not used by the Spark engine or not implemented for Spark on AEL.
The date format LocaleThis field is either not used by the Spark engine or not implemented for Spark on AEL.
Add filenames to resultThis field is either not used by the Spark engine or not implemented for Spark on AEL.

Error Handling tab

Error Handling tab

For the Text file input step to work in the Spark environment, you must select the Ignore errors field. The other fields are ignored by the Spark engine, so these fields can remain empty.

Filters tab

Filters tab

In the Filters tab, you can specify the lines you want to skip in the text file.

ColumnDescription
Filter stringThe string for which to search.
Filter positionThe position where the filter string must be placed in the line. Zero (0) is the first position in the line. If you specify a value below zero, the filter string is searched for in the entire string.
Stop on filterEnter Y here if you want to stop processing the current text file when the filter string is encountered.
Positive matchTurns filters into positive mode when turned on. Only lines that match this filter will be passed. Negative filters take precedence and are immediately discarded.

Fields tab

Fields tab

In the Fields tab, you can specify the information about the name and format of the fields being read from the text file.

OptionDescription
NameName of the field.
TypeType of the field can be either String, Date, or Number.
FormatSee Number formats for a complete description of format symbols.
PositionThe position is needed when processing the Fixed filetype. It is zero-based, so the first character is starting with position 0.
Length

The value of this field depends on format:

  • Number

    Total number of significant figures in a number.

  • String

    Total length of string.

  • Date

    Total length of printed output of the string. For example, 4 only returns the year.

Precision

The value of this field depends on format:

  • Number

    Number of floating point digits.

  • String, Date, Boolean

    Unused.

CurrencyUsed to interpret numbers such as $10,000.00 or E5.000,00.
DecimalA decimal point can be a period (.) as in 10,000.00 or it can be a comma (,) as in 5.000,00.
GroupA grouping can be a comma (,) as in 10,000.00 or a period (.) as in 5.000,00.
Null ifTreat this value as null.
DefaultDefault value in case the field in the text file was not specified (empty).
Trim type

Trim the type before processing. You can specify one of the following options:

  • None
  • Left
  • Right
  • Both
RepeatIf the corresponding value in this row is empty, repeat the one from the last time it was not empty (Y or N).

Number formats

Use the following table to specify number formats. For further information on valid numeric formats used in this step, view the Number Formatting Table.

SymbolLocationLocalizedMeaning
0NumberYesDigit.
#NumberYesDigit, zero shows as absent.
.NumberYesDecimal separator or monetary decimal separator.
-NumberYesMinus sign.
,NumberYesGrouping separator.
ENumberYesSeparates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.
;Subpattern boundaryYesSeparates positive and negative patterns.
%Prefix or suffixYesMultiply by 100 and show as percentage.
‰(/u2030)Prefix or suffixYesMultiply by 1000 and show as per mille.
¤ (/u00A4)Prefix or suffixNoCurrency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.
Prefix or suffixNoUsed to quote special characters in a prefix or suffix, for example, '#'# formats 123 to #123. To create a single quote itself, use two in a row: # o''clock.

Scientific notation

In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation, for example, 0.###E0 formats the number 1234 as 1.234E3.

Date formats

Use the following table to specify date formats. For further information on valid date formats used in this step, view the Date Formatting Table.

LetterDate of Time ComponentPresentationExamples
GEra designatorTextAD
yYearYear1996 or 96
MMonth in yearMonthJuly, Jul, or 07
wWeek in yearNumber27
WWeek in MonthNumber2
DDay in yearNumber189
dDay in monthNumber10
FDay of week in monthNumber2
EDay in weekTextTuesday or Tue
aam/pm markerTextPM
HHour in day (0-23)Number 0n/a
kHour in day (1-24)Number 24n/a
KHour in am/pm (0-11)Number 0n/a
hHour in am/pm (1-12)Number 12n/a
mMinute in hourNumber 30n/a
sSecond in minuteNumber 55n/a
SMillisecondNumber 978n/a
zTime zoneGeneral time zonePacific Standard Time, PST, or GMT-08:00
ZTime zoneRFC 822 time zone-0800

Additional output fields tab

Additional output fields tab

The Additional output fields tab contains the following options to specify additional information about the file to process.

OptionDescription
Short filename fieldSpecify the field that contains the filename without path information but with an extension.
Extension fieldSpecify the field that contains the extension of the filename.
Path fieldSpecify the field that contains the path in operating system format.
Size fieldSpecify the field that contains the size of the data.
Is hidden fieldSpecify the field indicating if the file is hidden or not (Boolean).
Last modification fieldSpecify the field indicating the date of the last time the file was modified.
Uri fieldSpecify the field that contains the URI. If you are using this step to extract data from Amazon Simple Storage Service (S3), browse to the URI of the S3 system or use this option. S3 and S3n are supported.
Root uri fieldSpecify the URI output field name.

Metadata injection support

All fields of this step support metadata injection. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.