What's new in Pentaho 9.0

The Pentaho 9.0 Enterprise Edition delivers a variety of features and enhancements, including access to multiple Hadoop clusters and vendor versions, step-level Spark tuning, and Copybook transformation steps. Pentaho 9.0 also continues to enhance the Pentaho platform experience by introducing new features and improvements.

Access to multiple Hadoop clusters from different vendors in PDI

You can now access and process data from multiple Hadoop clusters from different vendors, including different versions, through drivers and named connections within a single session of PDI. Previously, you accessed a single Hadoop cluster through a shim in a PDI session. Additionally, cluster configuration features a simplified user interface for an improved setup experience.

A named connection is information, such as the IP address and port number, used to connect to the Hadoop cluster which is then stored by an assigned name for later use. Before you create a named connection, you must add drivers for your Hadoop clusters. See Connecting to a Hadoop cluster with the PDI client to learn more about creating a named connection.

If you have not yet set up a driver and named connection for your Hadoop cluster before running your transformation or job associated with an existing shim connection, PDI can work with your cluster in a legacy (fallback) mode until you can set up your driver (formerly referenced as a shim) and named connection. See Big Data issues for more information on this mode and other Big Data troubleshooting items.

Step-level Spark tuning in PDI transformations

When you run a transformation using the Spark engine, you can now tune Spark parameters on a step in the transformation. Step-level tuning can improve the performance of your PDI transformations executed on the Spark engine. These parameters adjust critical factors such as the number of Spark partitions, driver and executor memory, and persistent storage level to optimize the execution of your PDI transformation. For a list of applicable Spark parameters and PDI steps, see Spark Tuning.

Copybook steps in PDI

Pentaho Data Integration now supports simple integration with fixed-length records in mainframe binary data files, so more users can ingest, integrate, and blend mainframe data as part of their data integration pipelines. PDI has two transformation steps you can use to read mainframe records from a file and transform them into PDI rows.

Copybook input
Reads the mainframe binary data files that were originally created using the copybook definition file and outputs the converted data to the PDI stream for use in transformations.
Read metadata from Copybook
Reads the metadata of a copybook definition file to use with ETL metadata injection in PDI.

For more information about using copybook steps in PDI, see Copybook steps in PDI

Pentaho Server Upgrade Installer

The new Pentaho Server Upgrade Installer is an easy-to-use interface tool that automatically applies the new release version to your archive installation of the Pentaho Server. You can upgrade versions 7.1 and later of the Pentaho Server directly to version 9.0 using this simplified upgrade process via the user interface of the Pentaho Server Upgrade Installer. For instruction on the new upgrade process, see Upgrade the Pentaho Server.

VFS connection enhancements and S3 support

PDI now features a new Open dialog box to access your VFS connections, which includes support for S3, Snowflake staging (read-only), Hitachi Content Platform, and Google Cloud Storage. In the S3 protocol, we support S3A and session tokens in Pentaho 9.0. You can also use the new Open dialog box with select PDI steps and entries.

For more information, see Connecting to Virtual File Systems.

Minor Data Integration enhancements

Pentaho 9.0 includes the following minor Data Integration improvements:

Snowflake enhancements
The Bulk load into Snowflake job entry features a new Columns option in the Output tab. Use this option to preview the column names and associated data types within your selected database table.
Amazon Redshift enhancements
The Bulk load into Amazon Redshift job entry features a new Options tab and Columns option in the Output tab.
AMQP and Kinesis improvements
You can now choose string or binary as the data format for your streaming records in the Fields tabs of the AMQP Consumer and Kinesis Consumer PDI steps.
PDI expanded metadata injection support
A new metadata injection example is included in the 9.0 Pentaho distribution. See the example in ETL metadata injection for further details. Additionally, the following PDI steps now support metadata injection:
JMS Consumer improvements
The JMS Consumer PDI step now includes the following fields for more efficient processing: MessageID, JMS Timestamp, and JMS Redelivered.
Text File Input and Output improvements
You can set up the Text File Input step to run on the Spark engine using AEL. Additionally, the Header option in the Text file output step now works on Spark. For more information, see Using the Text File Output step on the Spark engine.

Minor Business Analytics enhancements

Pentaho 9.0 includes the following minor Business Analytics improvements:

Analyzer improvements
You can now display column totals at the top and row totals on the left for Analyzer reports. See Set Analyzer report options for details.
Dashboard Designer improvements
You can now export Analyzer reports as PDFs, CSVs , or Excel workbooks. See Pentaho Dashboard Designer for instructions.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.