Using the Text File Input step on the Pentaho engine

Last updated
Save as PDF

If you are running your transformation on the Pentaho engine, use the following instructions to set up the Text File Input step.

General

Enter the following information in the transformation step name field:

Step name: Specify the unique name of the Text file input step on the canvas. You can customize the name or leave it as the default.

You can use Preview rows to display the rows generated by this step. The Text file input step determines what rows to input based on the information you provide in the option tabs. This preview function helps you to decide if the information provided accurately models the rows you are trying to retrieve.

Options

The Text file input step features several tabs with fields. Each tab is described below.

File tab

Text file input step

Use the File tab to enter the following connection information for your source.

Option	Description
File or directory	Specify the source location if the source is not defined in a field. Click Browse to display the Open File window and navigate to the file or folder. For the supported file system types, see Connecting to Virtual File Systems. Click Add to include the source in the Selected files table. If the source location is defined in a field, use the Accept filenames from previous steps to specify your file name.
Regular expression	Specify a regular expression to match filenames within a specified directory.
Exclude regular expression	Specify a regular expression to exclude filenames within a specified directory.

Regular expressions

Use the Wildcard (RegExp) field in the File tab to search for files by wildcard in the form of a regular expression. Regular expressions are more sophisticated than using * and ? wildcards. This table describes several examples of regular expressions.

File Name	Regular Expression	Files Selected
/dirA/	.userdata.\.txt	Find all files in /dirA/ with names containing userdata and ending with .txt
/dirB/	AAA.\*	Find all files in /dirB/ with names that start with AAA
/dirC/	\[ENG:A-Z\]\[ENG:0-9\].\*	Find all files in /dirC/ with names that start with a capital and followed by a digit (A0-Z9)

Selected files table

The Selected files table shows files or directories to use as source locations for input. This table is populated by clicking Add after you specify a File or directory. The input step tries to connect to the specified file or directory when you click Add to include it in the table.

The table contains the following columns:

Column	Description
File/Directory	The source location indicated by clicking Add after specifying it in File or directory.
Wildcard (RegExp)	Specify a regular expression to match filenames within a specified directory.
Exclude wildcard	Specify a regular expression to exclude filenames within a specified directory.
Required	Required source location for input.
Include subfolders	Whether subfolders are included within the source location.

Click Delete to remove a source from the table. Click Edit to remove a source from the table and return it back to the File or directory option.

Accept file names

Accept filenames from previous steps

You can specify your file name and pass it to the input step, which allows the file name to come from any source, such as a text file or database table.

Option	Description
Accept filenames from previous step	Select to get file names from previous steps.
Pass through fields from previous step	Select to get field information from previous steps.
Step to read file names from	Enter the name of the step from which to read the file names.
Field in the input to use as filename	Enter the name of the field in the input step to determine which file name to use.

Show action buttons

Show action buttons on Files tab

When you have entered information in the File tab fields, select an action button if you want to look at the source file names or data content.

Button	Description
Show filename(s)	Select to display the file names of the sources connected to the step.
Show file content	Select to display the raw content of the selected file.
Show content from first data line	Select to display the content from the first data line for the selected file.

Content tab

In the Content tab, using the following options, you can specify the format of the source files.

Option	Description
Filetype	Select either CSV or Fixed length. Depending on the file type you select, a corresponding interface appears when you click Get Fields in the Fields tab.
Separator	Specify the character used to separate the fields in a single line of text, typically a semicolon or tab. Click Insert Tab to place a tab in the Separator field. The default value is semicolon (;).
Enclosure	Specify an optional character used to enclose a field if that field contains a separator character. The default value is double quotation marks (").
Allow breaks in enclosed fields	Not implemented.
Escape	Specify one or more characters to indicate if another character is a part of a regular text. For example, if a backslash (\) is the escape character and a single quote (') is an enclosure or separator character, then the text `Not the nine o\’clock news` is parsed as Not the nine o’clock news.
Header	Select if your text file has a header row (first lines in the file). You can use Number of header lines to specify the number of times the header line appears.
Footer	Select if your text file has a footer row (last lines in the file). You can use Number of footer lines to specify the number of times the footer row appears.
Wrapped lines	Select if you work with data lines that have wrapped beyond a specific page limit. You can use Number of times wrapped to specify the number of times the line is wrapped. Headers and footers are never considered wrapped.
Paged layout (printout)	Select when other text handling options (above) fail on a text file designed to be output to a line printer. You can use Document header lines to skip introductory texts and Number of lines per page to position the data lines.
Compression	Select if your text file is in a ZIP or GZip archive. Only the first file in the archive is read.
No empty rows	Select if you do not want to send empty rows to the next steps.
Include filename in output	Select if you want the file name to be part of the output, and use Filename fieldname to enter the name of the field that contains the file name.
Rownum in output	Select if you want the row number to be part of the output. You can use Rownum fieldname to enter the name of the field that contains the row number. Select Rownum by file if you want to allow the row number to be reset per file.
Format	Select the file format, which can be either DOS, UNIX, or mixed. UNIX files have lines terminated by line feeds. DOS files have lines separated by carriage returns and line feeds. If you specify mixed, no verification is done.
Encoding	Select the text file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, the PDI client searches your system for available encodings.
Length	Select the length of the field according to its type: Characters Bytes
Limit	Specify a limit on the number of records generated from this step. Specify zero (`0`) for an unlimited number of records.
Be lenient when parsing dates?	Clear the check box if you want strict parsing of data fields. If selected, dates like `Jan 32nd` become Feb 1st.
The date format Locale	Specify the locale to use to parse dates written in full, such as `February 2nd, 2006`. For example, parsing February 2nd, 2006, on a system set to French (fr_FR) would not work because February is called Février in that locale.
Add filenames to result	Select to add file names to a resulting list of file names.

Error Handling tab

In the Error Handling tab, you can specify how the step reacts when errors occur, such as malformed records, bad enclosure strings, wrong number of fields, and premature line ends. The following table contains options for error handling:

Option	Description
Ignore errors?	Select if you want to ignore errors during parsing.
Skip error files?	Select if you want to skip those files that contain errors. You can generate a file that contains a listing of files where the errors occur. Otherwise, files with errors are not skipped, and the files that have parsing errors are empty (null).
Error file field name	Specify an error file name if you want to add field names where errors were occurred.
File error message field name	Specify an error message field name if you want to add field names where errors occurred in the error file.
Skip error lines?	Select if you want to skip those lines that contain errors. You can generate an extra file that contains the line numbers where the errors occur. Otherwise, lines with errors are not skipped, and the fields that have parsing errors are empty (null).
Error count fieldname	Specify the field name if you want to add a field containing the number of errors on the line to the output rows.
Error fields fieldname	Specify the field name if you want to add a field containing the names of fields where errors occurred to the output rows.
Error text fieldname	Specify the field name if you want to add a field containing descriptions of the parsing error occurrences to the output rows.
Warning files directory	Specify the location of the directory where warnings are placed if they are generated. The name of the resulting file is <warning dir>/filename.<date_time>.<warning extension>.
Error files directory	Specify the location of the directory where errors are placed if they occur. The name of the resulting file is <errorfile_dir>/filename.<date_time>.<errorfile_extension>.
Failing line numbers files directory	Specify the location of the directory where parsing errors on a line are placed if they occur. The name of the resulting file is <errorline dir>/filename.<date_time>.<errorline extension>.

Filters tab

The Filters tab contains a table with the columns where you can specify the lines you want to skip in the text file.

Column	Description
Filter string	The string that you want to search for.
Filter position	The position where the filter string must be placed in the line. Zero (0) is the first position in the line. If you specify a value below zero (0), the filter string is searched for in the entire string.
Stop on filter	Enter `Y` if you want to stop processing the current text file when the filter string is encountered. Enter `N` to continue processing after encountering the string.
Positive match	Enter `Y` if you want to process lines that match the filter string. Enter `N` to ignore matching lines.

Fields tab

In the Fields tab, you can specify the information about the name and format of the fields being read from the text file.

Option	Description
Name	Name of the field.
Type	Type of the field can be either String, Date, or Number.
Format	See Number formats for a complete description of format symbols.
Position	The position is needed when processing the Fixed filetype. It is zero-based, so the first character is starting with position 0.
Length	The value of this field depends on format: Number Total number of significant figures in a number. String Total length of string. Date Total length of printed output of the string. For example, `4` only returns the year.
Precision	The value of this field depends on format: Number Number of floating point digits. String, Date, Boolean Unused.
Currency	Used to interpret numbers such as `$10,000.00` or `E5.000,00`.
Decimal	A decimal point can be a period (`.`) as in `10,000.00` or it can be a comma (`,`) as in `5.000,00`.
Group	A grouping can be a comma (`,`) as in `10,000.00` or a period (`.`) as in `5.000,00`.
Null if	Treat this value as null.
Default	Default value in case the field in the text file was not specified (empty).
Trim type	Trim the type before processing. You can specify one of the following options: None Left Right Both
Repeat	If the corresponding value in this row is empty, repeat the one from the last time it was not empty (Y or N).

See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.

Number formats

Use the following table to specify number formats. For further information on valid numeric formats used in this step, view the Number Formatting Table.

Symbol	Location	Localized	Meaning
0	Number	Yes	Digit.
#	Number	Yes	Digit, zero shows as absent.
.	Number	Yes	Decimal separator or monetary decimal separator.
-	Number	Yes	Minus sign.
,	Number	Yes	Grouping separator.
E	Number	Yes	Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.
;	Subpattern boundary	Yes	Separates positive and negative patterns.
%	Prefix or suffix	Yes	Multiply by 100 and show as percentage.
‰(/u2030)	Prefix or suffix	Yes	Multiply by 1000 and show as per mille.
¤ (/u00A4)	Prefix or suffix	No	Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.
‘	Prefix or suffix	No	Used to quote special characters in a prefix or suffix, for example, '#'# formats `123` to #123. To create a single quote itself, use two in a row: `# o''clock`.

Scientific notation

In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation, for example, 0.###E0 formats the number 1234 as 1.234E3.

Date formats

Use the following table to specify date formats. For further information on valid date formats used in this step, view the Date Formatting Table.

Letter	Date of Time Component	Presentation	Examples
G	Era designator	Text	`AD`
y	Year	Year	`1996` or `96`
M	Month in year	Month	`July`, `Jul`, or `07`
w	Week in year	Number	`27`
W	Week in Month	Number	`2`
D	Day in year	Number	`189`
d	Day in month	Number	`10`
F	Day of week in month	Number	`2`
E	Day in week	Text	`Tuesday` or `Tue`
a	am/pm marker	Text	`PM`
H	Hour in day (0-23)	Number 0	n/a
k	Hour in day (1-24)	Number 24	n/a
K	Hour in am/pm (0-11)	Number 0	n/a
h	Hour in am/pm (1-12)	Number 12	n/a
m	Minute in hour	Number 30	n/a
s	Second in minute	Number 55	n/a
S	Millisecond	Number 978	n/a
z	Time zone	General time zone	`Pacific Standard Time`, `PST`, or `GMT-08:00`
Z	Time zone	RFC 822 time zone	`-0800`

Additional output fields tab

The Additional output fields tab contains the following options to specify additional information about the file to process.

Option	Description
Short filename field	Specify the field that contains the filename without path information but with an extension.
Extension field	Specify the field that contains the extension of the filename.
Path field	Specify the field that contains the path in operating system format.
Size field	Specify the field that contains the size of the data.
Is hidden field	Specify the field indicating if the file is hidden or not (Boolean).
Last modification field	Specify the field indicating the date of the last time the file was modified.
Uri field	Specify the field that contains the URI.
Root uri field	Specify the URI output field name.

Metadata injection support

All fields of this step support metadata injection. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.