Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Microsoft Excel Input

 

Parent article

You can use the Microsoft Excel Input step to read data from Microsoft Excel. The following sections describe each of the available features for configuring this step.

The default spreadsheet type is set to Excel 97-2003 XLS. When you are reading other file types, such as OpenOffice ODS or Excel 2007, and using special functions like protected worksheets, you need to change the Spread sheet type (engine) in the Files tab accordingly. If you are using password protected worksheets, you must set Spread sheet type (engine) to Excel 2007 XLSX (Apache POI).

General

 

Enter the following information in the transformation step name field:

  • Step name: Specifies the unique name of the Microsoft Excel Input transformation step on the canvas. You can customize the name or leave it as the default.

You can use Preview rows to display the rows generated by this step. The Microsoft Excel Input step determines what rows to input based on the information you provide in the option tabs. This preview function helps you to decide if the information provided accurately models the rows you are trying to retrieve.

Options

 

The Microsoft Excel Input step features several tabs with fields. Each tab is described below.

Files tab

 
Microsoft Excel Input Files Tab dialog

Use the Files tab to enter the following options to define the location of the Microsoft Excel source files:

Option Description
Spread sheet type (engine)

Select the spreadsheet type. The following types are supported:

  • Excel 97-2003 XLS: This engine is the default, backward compatible type provided for by the JXL software backend.
  • Excel 2007 XLSX (Apache POI): If you select this spread sheet type you can read all known Excel file types. Functions are provided by the Apache POI project.
    NoteIf you are using password protected worksheets, you must set Spread sheet type (engine) to Excel 2007 XLSX (Apache POI).
  • Excel 2007 XLSX (Apache POI Streaming): This spread sheet type allows you to read large Excel files.
  • Open Office ODS: By selecting this type you can read OpenOffice spreadsheet using the ODFDOM engine.
File or directory Specify the source location if the source is not defined in a field. Click Browse to navigate to your source file or directory. Click Add to include the source in the Selected Files table.
Regular expression Specify a regular expression to match filenames within a specified directory.
Exclude regular expression Specify a regular expression to exclude filenames within a specified directory.
Password Specify the password to open the Excel file when password protection is set.
Accept filenames from previous steps Select the previous step that contains file names and the input field for reading in your data.

Selected files table

 

The Selected files table shows files or directories to use as source locations for input. This table is populated by specifying File or directory, then by clicking Add. The Microsoft Excel Input step tries to connect to the specified file or directory when you click Add to include it into the table.

The table contains the following columns:

Column Description
File/Directory The source location indicated by clicking Add
Wildcard (RegExp) Wildcards as specified in Regular expression
Exclude wildcard Excluded wildcards as specified in Exclude regular expression
Required Required source location for input
Include subfolders Whether subfolders are included within the source location

Click Delete to remove a source from the table. Click Edit to remove a source from the table and return it back to the File or directory option.

Use Show filename(s) to display the file names of the sources successfully connected to the Microsoft Excel Input step.

Sheets tab

 
Microsoft Excel Input Sheets Tab Dialog

Use the table in the Sheets tab to specify which worksheets and grid locations for reading data from the Microsoft Excel source files.

The table contains the following columns:

Column Description
Sheet name The name of the sheet in the Excel workbook to read
Start row The starting row in the sheet to read. The row numbers are zero-based (start at the number 0).
Start column The starting column in the sheet to read. The column numbers are zero-based (start at the number 0).

You can also read all the sheets in a workbook by clearing the table and typing only the start row and column in the first row, which will be used for all sheets. To read all the sheets in a workbook, do not to specify any sheet name (leave Sheet name blank). For this case, the field structure of each sheet needs to be the same.

Click Get sheetname(s) to fill out the table with all the sheets from your source specified by File or directory in the Files tab.

Content tab

 
Microsoft Excel Input Content Tab

Use Content tab to configure which data values to retrieve. The following options are available:

Option Description
Header Select if the sheets specified in the Sheets tab contain a header row to skip.
No empty rows Select if you do not want empty rows to appear in the output of this step.
Stop on empty rows Select to stop reading the current sheet of a file when a empty line is encountered.
Limit Specify a limit on the number of records generated from this step. Results are not limited when set to zero.
Encoding Specify which text file encoding to use. Leave this option blank to use the default system encoding. On first use, PDI searches your system for available encodings. To use Unicode, specify UTF-8 or UTF-16.

Error Handling tab

 
Microsoft Excel Input Error Handling Tab

The Error Handling tab allows you to configure the following properties:

Option Description
Strict types? Select to have PDI report data type errors while reading.
Ignore errors? Select if you want to ignore errors during parsing. These lines can be dumped to a separate file by specifying a path in Warnings file directory, Error files directory, and Failing line numbers files directory. Clear this option to have lines with errors appear as NULL values in the output of this step.
Skip error lines? Select to have PDI skip lines that contain errors.
Warnings file directory Specify the location of the directory where warnings are placed if they are generated. The name of the resulting file is <warning dir>/filename.<date_time>.<warning extension>.
Error files directory Specify the location of the directory where errors are placed if they occur. The name of the resulting file is <errorfile_dir>/filename.<date_time>.<errorfile_extension>.
Failing line numbers files directory Specify the location of the directory where parsing errors on a line are placed if they occur. The name of the resulting file is <errorline dir>/filename.<date_time>.<errorline extension>.

Fields tab

 
Microsoft Excel Input Additional Output Fields Tab

The Fields tab displays field definitions for extracting values from the Microsoft Excel spreadsheet. The table in this tab contain the following columns:

Column Description
Name Name of field that maps to the corresponding field in the Microsoft Excel Input stream
Type Data type of the input field
Length Length of the field
Precision Number of floating point digits for number-type fields
Trim type The trim method to apply to a string
Repeat The corresponding value from the last row repeated if a row is empty
Format An optional mask for converting the format of the original field. See Common Formats for information on common valid date and numeric formats you can use in this step.
Currency Currency symbol ($ or for example)
Decimal A decimal point can be a "." (5,000.00 for example) or "," (5.000,00 for example)
Group A grouping can be a "," (10,000.00 for example) or "." (5.000,00 for example)

Click Get fields from header row to have the step populate the table with fields derived from the source file. All fields identified by this step will be added to the table.

See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.

Additional output fields tab

 
Microsoft Excel Input Additional Output Fields Tab

The Additional output fields tab contains the following options to specify additional information about the file to process:

Option Description
Full filename field Specify the field that contains the full file name plus the extension.
Sheetname field Specify the field that contains the name of worksheet you want to use.
Sheet row nr field Specify the field that contains the current sheet row number you want to use.
Row nr written field Specify the field that contains the number of rows written.
Short filename field Specify the field that contains the filename without path information but with an extension.
Extension field Specify the field that contains the extension of the filename.
Path field Specify the field that contains the path in operating system format.
Size field Specify the field that contains the size of the data.
Is hidden field Specify the field indicating if the file is hidden or not (Boolean).
Uri field Specify the field that contains the URI.
Root uri field Specify the field that contains only the root part of the URI.

Metadata injection support

 

You can use the metadata injection supported fields with the ETL metadata injection step to pass metadata to your transformation at runtime. The following option fields and values in the Microsoft Excel Input step support metadata injection:

Options:

  • File and Directory
  • Regular Expression
  • Exclude Regular Expression
  • Is file Required
  • Include subfolders
  • Spreadsheet type
  • Sheet name
  • Sheet start row
  • Sheet start col

Values:

  • Name
  • Type
  • Length
  • Precision
  • Trim Type
  • Repeat
  • Format
  • Currency
  • Decimal
  • Grouping