Pentaho Data Integration performance tips
To substantially increase performance in Pentaho Repository transactions, we recommend upgrading to the latest version of Pentaho Data Integration (PDI). Besides upgrading, here are some tips and tricks to improve PDI performance. Most tips involve streamlining jobs and transformations. The following tips may help you to identify and correct performance-related issues associated with PDI transformations:
Step | Tip | Description |
JavaScript | Turn off compatibility mode | Rewriting JavaScript to use a format that is not compatible with previous
versions is, in most instances, easy to do and makes scripts easier to work with and
to read. By default, old JavaScript programs run in compatibility mode. That means
that the step will process like it did in a previous version. You may see a small
performance drop because of the overload associated with forcing compatibility. If you
want make use of the new architecture, disable compatibility mode and change the code
as shown below:
NoteIt is no longer possible to modify data in-place using the value
methods. This was a design decision to ensure that no data with the wrong type would
end up in the output rows of the step. Instead of modifying fields in-place, create
new fields using the table at the bottom of the Modified JavaScript
transformation. |
JavaScript | Combine steps | One large JavaScript step runs faster than three consecutive smaller steps. Combining processes in one larger step helps to reduce overhead. |
JavaScript | Avoid the JavaScript step or write a custom plugin | Remember that while JavaScript is the fastest scripting language for Java, it is still a scripting language. If you do the same amount of work in a native step or plugin, you avoid the overhead of the JS scripting engine. This has been known to result in significant performance gains. It is also the primary reason why the Calculator step was created — to avoid the use of JavaScript for simple calculations. |
JavaScript | Create a copy of a field | No JavaScript is required for this; a Select Values step does the trick. You can specify the same field twice. Once without a rename, once (or more) with a rename. Another trick is to use B=NVL(A,A) in a Calculator step where B is forced to be a copy of A. An explicit "create copy of field A" function has been added to the Calculator. |
JavaScript | Data conversion | Consider performing conversions between data types (dates, numeric data, and so on) in a Select Values step. You can do this in the Metadata tab of the step. |
JavaScript | Variable creation | If you have variables that can be declared once at the beginning of the transformation, make sure you put them in a separate script and mark that script as a startup script (right click on the script name in the tab). JavaScript object creation is time consuming so if you can avoid creating a new object for every row you are transforming, this will translate to a performance boost for the step. |
Not applicable | Launch several copies of a step | There are two important reasons why launching multiple copies of a step may
result in better performance:
|
Not applicable | Manage thread priorities | This feature that is found in the Transformation Settings dialog box under the Misc tab improves performance by reducing the locking overhead in certain situations. This feature is enabled by default for new transformations that are created in recent versions, but for older transformations this can be different. |
Select Value | If possible, don't remove fields in Select Value | Don't remove fields in Select Value unless you must. It's a CPU-intensive task as the engine needs to reconstruct the complete row. It is almost always faster to add fields to a row rather than delete fields from a row. |
Get Variables | Watch your use of Get Variables | May cause bottlenecks if you use it in a high-volume stream (accepting input). To solve the problem, take the Get Variables step out of the transformation (right click, detach) then insert it in with a Join Rows (cart prod) step. Make sure to specify the main step from which to read in the Join Rows step. Set it to the step that originally provided the Get Variables step with data. |
Not applicable | Use new text file input | The new CSV Input or Fixed Input steps provide optimal performance. If you have a fixed width (field/row) input file, you can even read data in parallel. (multiple copies) These new steps have been rewritten using Non-blocking I/O (NIO) features. Typically, the larger the NIO buffer you specify in the step, the better your read performance will be. |
Not applicable | When appropriate, use lazy conversion | In instances in which you are reading data from a text file and you write the data back to a text file, use Lazy conversion to speed up the process. The principle behind lazy conversion that it delays data conversion in hopes that it isn't necessary (reading from a file and writing it back comes to mind). Beyond helping with data conversion, lazy conversion also helps to keep the data in "binary" storage form. This, in turn, helps the internal Kettle engine to perform faster data serialization (sort, clustering, and so on). The Lazy Conversion option is available in the CSV Input and Fixed input text file reading steps. |
Join Rows | Use Join Rows | You need to specify the main step from which to read. This prevents the step from performing any unnecessary spooling to disk. If you are joining with a set of data that can fit into memory, make sure that the cache size (in rows of data) is large enough. This prevents (slow) spooling to disk. |
Not applicable | Review the big picture: database, commit size, row set size and other factors | Consider how the whole environment influences performance. There can be limiting factors in the transformation itself and limiting factors that result from other applications and PDI. Performance depends on your database, your tables, indexes, the JDBC driver, your hardware, speed of the LAN connection to the database, the row size of data and your transformation itself. Test performance using different commit sizes and changing the number of rows in row sets in your transformation settings. Change buffer sizes in your JDBC drivers or database. |
Not applicable | Step Performance Monitoring | You can track the performance of individual steps in a transformation. Step Performance Monitoring is an important tool that allows you identify the slowest step in your transformation. |
Limit in-memory log output
In Spoon, the following parameters control logging:
KETTLE_MAX_LOG_SIZE_IN_LINES
the maximum number of log lines that are kept internally by Kettle. Setting this to 0 (the default) forces PDI to keep all rows.
KETTLE_MAX_LOG_TIMEOUT_IN_MINUTES
the maximum age (in minutes) that a log line should be kept internally by PDI. Setting this to 0 (the default) keeps all rows indefinitely.
KETTLE_MAX_JOB_TRACKER_SIZE
the maximum number of job trackers kept in memory. Default value is: 1000.
KETTLE_MAX_JOB_ENTRIES_LOGGED
the maximum number of job entry results kept in memory for logging purposes. Default value is: 1000.
KETTLE_MAX_LOGGING_REGISTRY_SIZE
the maximum number of logging registry entries kept in memory for logging purposes. Default value is: 1000.
The equivalent parameters to the first two variables, which can be set on each KTR or KJB individually using Kitchen or Pan, are:
- maxloglines
- maxlogtimeout
HBase Output performance considerations
The HBase Output step's Configure connection tab provides a field for setting the size of the write buffer used to transfer data to HBase. A larger buffer consumes more memory (on both the client and server), but results in fewer remote procedure calls. The default (defined in the hbase-default.xml file) is: 2MB When left blank, the buffer is 2MB, auto flush is enabled, and Put operations are executed immediately. This means that each row will be transmitted to HBase as soon as it arrives at the step. Entering a number (even if it is the same as the default) for the size of the write buffer will disable auto flush and will result in incoming rows only being transferred once the buffer is full.
There is also a checkbox for Disable write to WAL that disables writing to the Write Ahead Log (WAL). The WAL is used as a lifeline to restore the status quo if the server goes down while data is being inserted. However, the tradeoff for error-recovery is speed.
The Create/Edit mappings tab has options for creating new tables. In the HBase table name field, you can suffix the name of the new table with parameters for specifying what kind of compression to use, and whether or not to use Bloom filters to speed up lookups. The options for compression are: NONE, GZ and LZO; the options for Bloom filters are: NONE, ROW, and ROWCOL. If nothing is selected (or only the name of the new table is defined), then the default of NONE is used for both compression and Bloom filters. For example, the following string entered in the HBase table name field specifies that a new table called "NewTable" should be created with GZ compression and ROWCOL Bloom filters:
NewTable@GZ@ROWCOL