Pentaho Data Integration performance tips
To substantially increase performance in Pentaho Repository transactions, we recommend upgrading to the latest version of Pentaho Data Integration (PDI). Besides upgrading, here are some tips and tricks to improve PDI performance. Most tips involve streamlining jobs and transformations. The following tips may help you to identify and correct performance-related issues associated with PDI transformations:
versions is, in most instances, easy to do and makes scripts easier to work with and
that the step will process like it did in a previous version. You may see a small
performance drop because of the overload associated with forcing compatibility. If you
want make use of the new architecture, disable compatibility mode and change the code
as shown below:|
|Not applicable||Launch several copies of a step||There are two important reasons why launching multiple copies of a step may
result in better performance:|
|Not applicable||Manage thread priorities||This feature that is found in the Transformation Settings dialog box under the Misc tab improves performance by reducing the locking overhead in certain situations. This feature is enabled by default for new transformations that are created in recent versions, but for older transformations this can be different.|
|Select Value||If possible, don't remove fields in Select Value||Don't remove fields in Select Value unless you must. It's a CPU-intensive task as the engine needs to reconstruct the complete row. It is almost always faster to add fields to a row rather than delete fields from a row.|
|Get Variables||Watch your use of Get Variables||May cause bottlenecks if you use it in a high-volume stream (accepting input). To solve the problem, take the Get Variables step out of the transformation (right click, detach) then insert it in with a Join Rows (cart prod) step. Make sure to specify the main step from which to read in the Join Rows step. Set it to the step that originally provided the Get Variables step with data.|
|Not applicable||Use new text file input||The new CSV Input or Fixed Input steps provide optimal performance. If you have a fixed width (field/row) input file, you can even read data in parallel. (multiple copies) These new steps have been rewritten using Non-blocking I/O (NIO) features. Typically, the larger the NIO buffer you specify in the step, the better your read performance will be.|
|Not applicable||When appropriate, use lazy conversion||In instances in which you are reading data from a text file and you write the data back to a text file, use Lazy conversion to speed up the process. The principle behind lazy conversion that it delays data conversion in hopes that it isn't necessary (reading from a file and writing it back comes to mind). Beyond helping with data conversion, lazy conversion also helps to keep the data in "binary" storage form. This, in turn, helps the internal Kettle engine to perform faster data serialization (sort, clustering, and so on). The Lazy Conversion option is available in the CSV Input and Fixed input text file reading steps.|
|Join Rows||Use Join Rows||You need to specify the main step from which to read. This prevents the step from performing any unnecessary spooling to disk. If you are joining with a set of data that can fit into memory, make sure that the cache size (in rows of data) is large enough. This prevents (slow) spooling to disk.|
|Not applicable||Review the big picture: database, commit size, row set size and other factors||Consider how the whole environment influences performance. There can be limiting factors in the transformation itself and limiting factors that result from other applications and PDI. Performance depends on your database, your tables, indexes, the JDBC driver, your hardware, speed of the LAN connection to the database, the row size of data and your transformation itself. Test performance using different commit sizes and changing the number of rows in row sets in your transformation settings. Change buffer sizes in your JDBC drivers or database.|
|Not applicable||Step Performance Monitoring||You can track the performance of individual steps in a transformation. Step Performance Monitoring is an important tool that allows you identify the slowest step in your transformation.|
Limit in-memory log output
In Spoon, the following parameters control logging:
the maximum number of log lines that are kept internally by Kettle. Setting this to 0 (the default) forces PDI to keep all rows.
the maximum age (in minutes) that a log line should be kept internally by PDI. Setting this to 0 (the default) keeps all rows indefinitely.
the maximum number of job trackers kept in memory. Default value is: 1000.
the maximum number of job entry results kept in memory for logging purposes. Default value is: 1000.
the maximum number of logging registry entries kept in memory for logging purposes. Default value is: 1000.
The equivalent parameters to the first two variables, which can be set on each KTR or KJB individually using Kitchen or Pan, are:
HBase Output performance considerations
The HBase Output step's Configure connection tab provides a field for setting the size of the write buffer used to transfer data to HBase. A larger buffer consumes more memory (on both the client and server), but results in fewer remote procedure calls. The default (defined in the hbase-default.xml file) is: 2MB When left blank, the buffer is 2MB, auto flush is enabled, and Put operations are executed immediately. This means that each row will be transmitted to HBase as soon as it arrives at the step. Entering a number (even if it is the same as the default) for the size of the write buffer will disable auto flush and will result in incoming rows only being transferred once the buffer is full.
There is also a checkbox for Disable write to WAL that disables writing to the Write Ahead Log (WAL). The WAL is used as a lifeline to restore the status quo if the server goes down while data is being inserted. However, the tradeoff for error-recovery is speed.
The Create/Edit mappings tab has options for creating new tables. In the HBase table name field, you can suffix the name of the new table with parameters for specifying what kind of compression to use, and whether or not to use Bloom filters to speed up lookups. The options for compression are: NONE, GZ and LZO; the options for Bloom filters are: NONE, ROW, and ROWCOL. If nothing is selected (or only the name of the new table is defined), then the default of NONE is used for both compression and Bloom filters. For example, the following string entered in the HBase table name field specifies that a new table called "NewTable" should be created with GZ compression and ROWCOL Bloom filters: