Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

CSV File Input

Parent article

The CSV File Input step reads data from delimited text files into a PDI transformation. While this step is called CSV File Input, you can also use CSV File Input with many other separator types, such as pipes, tabs, and semicolons.

NoteThe semicolon (;) is set as the default separator type for this step.
NoteThe options for this step are a subset of the Text File Input step. This step differs from the Text File Input step in the following ways:
  • NIO

    Non-blocking I/O is used for native system calls to read the file faster, but is limited to local files. It does not support VFS.

  • Parallel Running

    If you configure this step to run in multiple copies (or in a clustered mode) and you enable parallel running, each copy will read a separate block of a single file. You can distribute the reading of a file to several threads or even several slave nodes in a clustered transformation.

  • Lazy Conversion

    If you are reading many fields from a file and many of those fields will not be manipulated but merely passed through the transformation to land in some other text file or a database, lazy conversion can prevent PDI from performing unnecessary work on those fields (such as converting them into objects like strings, dates, or numbers).

An example of a simple CSV input transformation (CSV Input - Reading customer data.ktr) can be found in the data-integration/samples/transformations directory.

Options

CSV Input step

The CSV File Input step has the following options:

OptionDescription
Step nameSpecify the unique name of the CSV File Input step on the canvas. You can customize the name or leave it as the default.
FilenameSpecify the name of the input CSV file or navigate to the input file by clicking Browse.

If your source is from a previous step, the Browse button is hidden. Use the drop-down menu in the text box to select the field to use as the name or names of your CSV file(s).

Include the filename in the output? (Only appears if your source is from a previous step)If your source is from a previous step, select if you want the name of the input source file included in the output.
DelimiterSpecify the file delimiter character used in the source file. Special characters (for example, CHAR HEX01) can be set with the format $[value]. For example, $[01] or $[6F,FF,00,1F].

The default delimiter for the CSV File Input step is a semicolon ;.

EnclosureSpecify the enclosure character used in the source file. Special characters (for example, CHAR HEX01) can be set with the format $[value], such as $[01] or $[6F,FF,00,1F].
NIO buffer sizeSpecify the size of the read buffer, the number of bytes that is read at one time from the source.
Lazy conversion?Indicate if the lazy conversion algorithm may be used to improve performance. The lazy conversion algorithm tries to avoid unnecessary data type conversions if possible. It can result in significant performance improvements. The typical example is reading from a text file and writing back to a text file.
Header row present?Indicate if the source file contains a header row containing column names.
Add filename to resultAdds the CSV source filename(s) to the result of this transformation.
The row number field name (optional)Specify the name of the field that will contain the row number in the output of this step.
Running in parallel?Indicate if you will have multiple instances of this step running (step copies) and if you want each instance to read a separate part of the CSV file(s).

When reading multiple files, the total size of all files is taken into consideration to split the workload. In that specific case, make sure that ALL step copies receive all files that need to be read, otherwise, the parallel algorithm will not work correctly.

Caution: For technical reasons, parallel reading of CSV files is only supported on files that do not have fields with line breaks or carriage returns in them.

New line possible in fields?Indicate if data fields may contain new line characters.
FormatSelect the file format, which can be either DOS, UNIX, or mixed. UNIX files have lines terminated by line feeds. DOS files have lines separated by carriage returns and line feeds. If you specify mixed, no verification is done
File encodingSpecify the encoding of the source file.

Fields

You can specify what fields to read from your CSV file through the Fields table. Click Get fields to have the step populate the table with fields derived from the source file based on the current specified settings (such as Delimiter or Enclosure). All fields identified by this step are added to the table.

The table contains the following columns:

ColumnDescription
NameName of the field.
TypeType of field (either String, Date, or Number).
FormatAn optional mask for converting the format of the original field. See Common Formats for information on common valid date and numeric formats you can use in this step.
Length

The length of the field depends on the following field types:

  • Number: Total number of significant figures in a number.
  • String: Total length of string.
  • Date: Length of printed output of the string (for example, four is a length for a year).
PrecisionNumber of floating point digits for number-type fields.
CurrencySymbol used to represent currencies ($5,000.00 or €5.000,00 for example).
DecimalA decimal point can be a "." or "," (5,000.00 or 5.000,00 for example).
GroupA grouping can be a "," or "." (5,000.00 or 5.000,00 for example).
Trim TypeThe trimming method to apply to a string.

Click Preview to view the data coming from the source file.

Metadata injection support

You can use the metadata injection supported fields with ETL metadata injection step to pass metadata to your transformation at runtime. The following Option and Value fields of the CSV File Input step support metadata injection:

  • Options: Filename, Delimiter, Enclosure, NIO Buffer Size, Lazy Conversion, Header Row Present?, Add Filename to Result, The Row Number Field Name, Running in Parallel?, and File Encoding.
  • Values: Name, Length, Decimal, Type, Precision, Group, Format, Currency, and Trim Type.