Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Pentaho Data Integration Performance Tips

To substantially increase performance in Pentaho Repository transactions, we recommend upgrading to the latest version of Pentaho Data Integration (PDI). Besides upgrading, here are some tips and tricks to improve PDI performance. Most tips involve streamlining jobs and transformations. The following tips may help you to identify and correct performance-related issues associated with PDI transformations:

Step Tip Description
JavaScript Turn off compatibility mode Rewriting JavaScript to use a format that is not compatible with previous versions is, in most instances, easy to do and makes scripts easier to work with and to read. By default, old JavaScript programs run in compatibility mode. That means that the step will process like it did in a previous version. You may see a small performance drop because of the overload associated with forcing compatibility. If you want make use of the new architecture, disable compatibility mode and change the code as shown below:
  • intField.getInteger() --> intField
  • numberField.getNumber() --> numberField
  • dateField.getDate() --> dateField
  • bigNumberField.getBigNumber() --> bigNumberField
  • and so on...
Instead of Java methods, use the built-in library. Notice that the resulting program code is more intuitive. For example :
  • checking for null is now: field.isNull() --> field==null
  • Converting string to date: field.Clone().str2dat() --> str2date(field)
  • and so on...
If you convert your code as shown above, you may get significant performance benefits.
Note: It is no longer possible to modify data in-place using the value methods. This was a design decision to ensure that no data with the wrong type would end up in the output rows of the step. Instead of modifying fields in-place, create new fields using the table at the bottom of the Modified JavaScript transformation.
JavaScript Combine steps One large JavaScript step runs faster than three consecutive smaller steps. Combining processes in one larger step helps to reduce overhead.
JavaScript Avoid the JavaScript step or write a custom plug in Remember that while JavaScript is the fastest scripting language for Java, it is still a scripting language. If you do the same amount of work in a native step or plugin, you avoid the overhead of the JS scripting engine. This has been known to result in significant performance gains. It is also the primary reason why the Calculator step was created — to avoid the use of JavaScript for simple calculations.
JavaScript Create a copy of a field No JavaScript is required for this; a "Select Values" step does the trick. You can specify the same field twice. Once without a rename, once (or more) with a rename. Another trick is to use B=NVL(A,A) in a Calculator step where B is forced to be a copy of A. An explicit "create copy of field A" function has been added to the Calculator.
JavaScript Data conversion Consider performing conversions between data types (dates, numeric data, and so on) in a "Select Values" step. You can do this in the Metadata tab of the step.
JavaScript Variable creation If you have variables that can be declared once at the beginning of the transformation, make sure you put them in a separate script and mark that script as a startup script (right click on the script name in the tab). JavaScript object creation is time consuming so if you can avoid creating a new object for every row you are transforming, this will translate to a performance boost for the step.
Not Applicable Launch several copies of a step There are two important reasons why launching multiple copies of a step may result in better performance:
  1. The step uses a lot of CPU resources and you have multiple processor cores in your computer. Example: a JavaScript step
  2. Network latencies and launching multiple copies of a step can reduce average latency. If you have a low network latency of say 5ms and you need to do a round trip to the database, the maximum performance you get is 200 (x5) rows per second, even if the database is running smoothly. You can try to reduce the round trips with caching, but if not, you can try to run multiple copies. Example: a database lookup or table output
Not Applicable Manage thread priorities This feature that is found in the "Transformation Settings" dialog box under the (Misc tab) improves performance by reducing the locking overhead in certain situations. This feature is enabled by default for new transformations that are created in recent versions, but for older transformations this can be different.
Select Value If possible, don't remove fields in Select Value Don't remove fields in Select Value unless you must. It's a CPU-intensive task as the engine needs to reconstruct the complete row. It is almost always faster to add fields to a row rather than delete fields from a row.
Get Variables Watch your use of Get Variables May cause bottlenecks if you use it in a high-volume stream (accepting input). To solve the problem, take the "Get Variables" step out of the transformation (right click, detach)then insert it in with a "Join Rows (cart prod)" step. Make sure to specify the main step from which to read in the "Join Rows" step. Set it to the step that originally provided the "Get Variables" step with data.
Not Applicable Use new text file input The new "CSV Input" or "Fixed Input" steps provide optimal performance. If you have a fixed width (field/row) input file, you can even read data in parallel. (multiple copies) These new steps have been rewritten using Non-blocking I/O (NIO) features. Typically, the larger the NIO buffer you specify in the step, the better your read performance will be.
Not applicable When appropriate, use lazy conversion In instances in which you are reading data from a text file and you write the data back to a text file, use Lazy conversion to speed up the process. The principle behind lazy conversion that it delays data conversion in hopes that it isn't necessary (reading from a file and writing it back comes to mind). Beyond helping with data conversion, lazy conversion also helps to keep the data in "binary" storage form. This, in turn, helps the internal Kettle engine to perform faster data serialization (sort, clustering, and so on). The Lazy Conversion option is available in the "CSV Input" and "Fixed input" text file reading steps.
Join Rows Use Join Rows You need to specify the main step from which to read. This prevents the step from performing any unnecessary spooling to disk. If you are joining with a set of data that can fit into memory, make sure that the cache size (in rows of data) is large enough. This prevents (slow) spooling to disk.
Not Applicable Review the big picture: database, commit size, row set size and other factors Consider how the whole environment influences performance. There can be limiting factors in the transformation itself and limiting factors that result from other applications and PDI. Performance depends on your database, your tables, indexes, the JDBC driver, your hardware, speed of the LAN connection to the database, the row size of data and your transformation itself. Test performance using different commit sizes and changing the number of rows in row sets in your transformation settings. Change buffer sizes in your JDBC drivers or database.
Not Applicable Step Performance Monitoring You can track the performance of individual steps in a transformation. Step Performance Monitoring is an important tool that allows you identify the slowest step in your transformation.