Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Text File Input

 

Parent article

The Text file input step reads data from a variety of text-file types, including formats generated by spreadsheets and fixed width flat files. The features of the step allow you to read from a list of files or directories, use wild cards in the form of regular expressions, and accept genericized filenames from previous steps.

General

 

Enter the following information in the transformation step name field:

  • Step name: Specify the unique name of the Text file input step on the canvas. You can customize the name or leave it as the default.

You can use Preview rows to display the rows generated by this step. The Text file input step determines what rows to input based on the information you provide in the option tabs. This preview function helps you to decide if the information provided accurately models the rows you are trying to retrieve.

Options

 

The Text file input step features several tabs with fields. Each tab is described below.

File tab

 

Text file input step

Use the File tab to enter the following connection information for your source.

Option Description
File or directory Specify the source location if the source is not defined in a field.

Click Browse to display the Open File window and navigate to the file or folder. For the supported file system types, see Connecting to Virtual File Systems. Click Add to include the source in the Selected files table. If the source location is defined in a field, use the Accept filenames from previous steps to specify your file name.

Regular expression Specify a regular expression to match filenames within a specified directory.
Exclude regular expression Specify a regular expression to exclude filenames within a specified directory.

Regular expressions

 

Use the Wildcard (RegExp) field in the File tab to search for files by wildcard in the form of a regular expression. Regular expressions are more sophisticated than using * and ? wildcards. This table describes several examples of regular expressions.

File Name Regular Expression Files Selected
/dirA/ .userdata.\.txt Find all files in /dirA/ with names containing userdata and ending with .txt
/dirB/ AAA.\* Find all files in /dirB/ with names that start with AAA
/dirC/ \[ENG:A-Z\]\[ENG:0-9\].\* Find all files in /dirC/ with names that start with a capital and followed by a digit (A0-Z9)

Selected files table

 

The Selected files table shows files or directories to use as source locations for input. This table is populated by clicking Add after you specify a File or directory. The input step tries to connect to the specified file or directory when you click Add to include it in the table.

The table contains the following columns:

Column Description
File/Directory The source location indicated by clicking Add after specifying it in File or directory.
Wildcard (RegExp) Specify a regular expression to match filenames within a specified directory.
Exclude wildcard Specify a regular expression to exclude filenames within a specified directory.
Required Required source location for input.
Include subfolders Whether subfolders are included within the source location.

Click Delete to remove a source from the table. Click Edit to remove a source from the table and return it back to the File or directory option.

Accept file names

 

Accept filenames from previous steps

You can specify your file name and pass it to the input step, which allows the file name to come from any source, such as a text file or database table.

Option Description
Accept filenames from previous step Select to get file names from previous steps.
Pass through fields from previous step Select to get field information from previous steps.
Step to read file names from Enter the name of the step from which to read the file names.
Field in the input to use as filename Enter the name of the field in the input step to determine which file name to use.

Show action buttons

 

Show action buttons on Files tab

When you have entered information in the File tab fields, select an action button if you want to look at the source file names or data content.

Button Description
Show filename(s) Select to display the file names of the sources connected to the step.
Show file content Select to display the raw content of the selected file.
Show content from first data line Select to display the content from the first data line for the selected file.

Content tab

 

Content tab

In the Content tab, using the following options, you can specify the format of the source files.

Option Description
Filetype Select either CSV or Fixed length. Depending on the file type you select, a corresponding interface appears when you click Get Fields in the Fields tab.
Separator Specify the character used to separate the fields in a single line of text, typically a semicolon or tab. Click Insert Tab to place a tab in the Separator field. The default value is semicolon (;).
Enclosure Specify an optional character used to enclose a field if that field contains a separator character. The default value is double quotation marks (").
Allow breaks in enclosed fields Not implemented.
Escape Specify one or more characters to indicate if another character is a part of a regular text. For example, if a backslash (\) is the escape character and a single quote (') is an enclosure or separator character, then the text Not the nine o\’clock news is parsed as Not the nine o’clock news.
Header Select if your text file has a header row (first lines in the file). You can use Number of header lines to specify the number of times the header line appears.
Footer Select if your text file has a footer row (last lines in the file). You can use Number of footer lines to specify the number of times the footer row appears.
Wrapped lines Select if you work with data lines that have wrapped beyond a specific page limit. You can use Number of times wrapped to specify the number of times the line is wrapped. Headers and footers are never considered wrapped.
Paged layout (printout) Select when other text handling options (above) fail on a text file designed to be output to a line printer. You can use Document header lines to skip introductory texts and Number of lines per page to position the data lines.
Compression Select if your text file is in a ZIP or GZip archive. Only the first file in the archive is read.
No empty rows Select if you do not want to send empty rows to the next steps.
Include filename in output Select if you want the file name to be part of the output, and use Filename fieldname to enter the name of the field that contains the file name.
Rownum in output Select if you want the row number to be part of the output. You can use Rownum fieldname to enter the name of the field that contains the row number. Select Rownum by file if you want to allow the row number to be reset per file.
Format Select the file format, which can be either DOS, UNIX, or mixed. UNIX files have lines terminated by line feeds. DOS files have lines separated by carriage returns and line feeds. If you specify mixed, no verification is done.
Encoding Select the text file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, the PDI client searches your system for available encodings.
Length

Select the length of the field according to its type:

  • Characters
  • Bytes
Limit Specify a limit on the number of records generated from this step. Specify zero (0) for an unlimited number of records.
Be lenient when parsing dates? Clear the check box if you want strict parsing of data fields. If selected, dates like Jan 32nd become Feb 1st.
The date format Locale Specify the locale to use to parse dates written in full, such as February 2nd, 2006. For example, parsing February 2nd, 2006, on a system set to French (fr_FR) would not work because February is called Février in that locale.
Add filenames to result Select to add file names to a resulting list of file names.

Error Handling tab

 

Error Handling tab

In the Error Handling tab, you can specify how the step reacts when errors occur, such as malformed records, bad enclosure strings, wrong number of fields, and premature line ends. The following table contains options for error handling:

Option Description
Ignore errors? Select if you want to ignore errors during parsing.
Skip error files? Select if you want to skip those files that contain errors. You can generate a file that contains a listing of files where the errors occur. Otherwise, files with errors are not skipped, and the files that have parsing errors are empty (null).
Error file field name Specify an error file name if you want to add field names where errors were occurred.
File error message field name Specify an error message field name if you want to add field names where errors occurred in the error file.
Skip error lines? Select if you want to skip those lines that contain errors. You can generate an extra file that contains the line numbers where the errors occur. Otherwise, lines with errors are not skipped, and the fields that have parsing errors are empty (null).
Error count fieldname Specify the field name if you want to add a field containing the number of errors on the line to the output rows.
Error fields fieldname Specify the field name if you want to add a field containing the names of fields where errors occurred to the output rows.
Error text fieldname Specify the field name if you want to add a field containing descriptions of the parsing error occurrences to the output rows.
Warning files directory Specify the location of the directory where warnings are placed if they are generated. The name of the resulting file is <warning dir>/filename.<date_time>.<warning extension>.
Error files directory Specify the location of the directory where errors are placed if they occur. The name of the resulting file is <errorfile_dir>/filename.<date_time>.<errorfile_extension>.
Failing line numbers files directory Specify the location of the directory where parsing errors on a line are placed if they occur. The name of the resulting file is <errorline dir>/filename.<date_time>.<errorline extension>.

Filters tab

 

Filters tab

The Filters tab contains a table with the columns where you can specify the lines you want to skip in the text file.

Column Description
Filter string The string that you want to search for.
Filter position The position where the filter string must be placed in the line. Zero (0) is the first position in the line. If you specify a value below zero (0), the filter string is searched for in the entire string.
Stop on filter Enter Y if you want to stop processing the current text file when the filter string is encountered. Enter N to continue processing after encountering the string.
Positive match Enter Y if you want to process lines that match the filter string. Enter N to ignore matching lines.

Fields tab

 

Fields tab

In the Fields tab, you can specify the information about the name and format of the fields being read from the text file.

Option Description
Name Name of the field.
Type Type of the field can be either String, Date, or Number.
Format See Number formats for a complete description of format symbols.
Position The position is needed when processing the Fixed filetype. It is zero-based, so the first character is starting with position 0.
Length

The value of this field depends on format:

  • Number

    Total number of significant figures in a number.

  • String

    Total length of string.

  • Date

    Total length of printed output of the string. For example, 4 only returns the year.

Precision

The value of this field depends on format:

  • Number

    Number of floating point digits.

  • String, Date, Boolean

    Unused.

Currency Used to interpret numbers such as $10,000.00 or E5.000,00.
Decimal A decimal point can be a period (.) as in 10,000.00 or it can be a comma (,) as in 5.000,00.
Group A grouping can be a comma (,) as in 10,000.00 or a period (.) as in 5.000,00.
Null if Treat this value as null.
Default Default value in case the field in the text file was not specified (empty).
Trim type

Trim the type before processing. You can specify one of the following options:

  • None
  • Left
  • Right
  • Both
Repeat If the corresponding value in this row is empty, repeat the one from the last time it was not empty (Y or N).

See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.

Number formats

 

Use the following table to specify number formats. For further information on valid numeric formats used in this step, view the Number Formatting Table.

Symbol Location Localized Meaning
0 Number Yes Digit.
# Number Yes Digit, zero shows as absent.
. Number Yes Decimal separator or monetary decimal separator.
- Number Yes Minus sign.
, Number Yes Grouping separator.
E Number Yes Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.
; Subpattern boundary Yes Separates positive and negative patterns.
% Prefix or suffix Yes Multiply by 100 and show as percentage.
‰(/u2030) Prefix or suffix Yes Multiply by 1000 and show as per mille.
¤ (/u00A4) Prefix or suffix No Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.
Prefix or suffix No Used to quote special characters in a prefix or suffix, for example, '#'# formats 123 to #123. To create a single quote itself, use two in a row: # o''clock.

Scientific notation

 

In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation, for example, 0.###E0 formats the number 1234 as 1.234E3.

Date formats

 

Use the following table to specify date formats. For further information on valid date formats used in this step, view the Date Formatting Table.

Letter Date of Time Component Presentation Examples
G Era designator Text AD
y Year Year 1996 or 96
M Month in year Month July, Jul, or 07
w Week in year Number 27
W Week in Month Number 2
D Day in year Number 189
d Day in month Number 10
F Day of week in month Number 2
E Day in week Text Tuesday or Tue
a am/pm marker Text PM
H Hour in day (0-23) Number 0 n/a
k Hour in day (1-24) Number 24 n/a
K Hour in am/pm (0-11) Number 0 n/a
h Hour in am/pm (1-12) Number 12 n/a
m Minute in hour Number 30 n/a
s Second in minute Number 55 n/a
S Millisecond Number 978 n/a
z Time zone General time zone Pacific Standard Time, PST, or GMT-08:00
Z Time zone RFC 822 time zone -0800

Additional output fields tab

 

Additional output fields tab

The Additional output fields tab contains the following options to specify additional information about the file to process.

Option Description
Short filename field Specify the field that contains the filename without path information but with an extension.
Extension field Specify the field that contains the extension of the filename.
Path field Specify the field that contains the path in operating system format.
Size field Specify the field that contains the size of the data.
Is hidden field Specify the field indicating if the file is hidden or not (Boolean).
Last modification field Specify the field indicating the date of the last time the file was modified.
Uri field Specify the field that contains the URI.
Root uri field Specify the URI output field name.

Metadata injection support

 

All fields of this step support metadata injection. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.