What's New in Pentaho 8.1
The Pentaho 8.1 Enterprise Edition delivers a wide range of features and improvements, from new streaming and Spark capabilities in PDI to enhanced big data and cloud data functionality and security. Pentaho 8.1 also continues to improve the Pentaho platform experience by introducing many new features and improvements.
Improved Streaming Steps in PDI
Pentaho Data Integration (PDI) features several improvements to its streaming step profile, including the addition of two new steps.
MQTT Consumer and Producer Steps. The PDI client can now pull streaming data from an MQTT broker or clients through an MQTT transformation. The MQTT Consumer step runs a child transformation that executes according to the message batch size or duration, allowing you to process a continuous stream of records in near real-time. The MQTT Producer step allows you to publish messages in near-real-time to an MQTT broker.
JMS Consumer and Producer Step Improvements. The JMS Consumer and Producer steps now support IBM MQ middleware, allowing you to build streaming data pipelines with such legacy data sources as IBM MQ. Like our other streaming steps, the JMS Consumer step now operates as a parent transformation which runs a child transformation that executes according to the message batch size or duration, processing a continuous stream of records in near real-time.
Safely Stop Streaming Transformations. You can now safely stop streaming transformations without loss of records. This safe stop is available in batch transformation within Spoon, Carte, and the Abort step.
Increased Spark Capabilities in PDI
Spark Engine Supported on PDI Transformation Steps. You can now run PDI transformations with the Spark engine using the following improved steps:
- Group By. To learn about the differences when running this step in Spark, see the section on Use Group By with Spark.
- Unique Rows (Hashset). Use this step with the Spark processing engine to help overcome memory constraint issues.
- Unique Rows.
Run Sub-transformations with Spark. You can now run sub-transformations with Spark on AEL using the Transformation Executor step, allowing you to design more complex pipelines in PDI and execute them in Spark.
Spark History Server. Configure Spark event logging to be captured and viewed using the Spark History Server. See Configure Even Logging for more information.
Google Cloud Data Enhancements
Pentaho 8.1 gives you the ability to seamlessly connect to the Google Cloud Storage using a VFS browser for importing and exporting data to and from Google Drive. With the addition of the new Google BigQuery Loader job entry, you can now use BigQuery as a data source with the Pentaho User Console or PDI client, set up your JDBC connections using a Simba Driver, and create ETL pipelines to access, enrich, and store data with Google Cloud big data services.
Increased AWS S3 Security
PDI can now assume IAM role permissions and provide secure read/write access to S3 without the need to provide hardcoded credentials in every step. This added flexibility accommodates different AWS security scenarios to provide a better user experience due to a lower credential management burden, while reducing the security risk resulting from exposed credentials. The revised S3 CSV Input and S3 File Output transformation steps now enable PDI to extract data from Amazon Web Services with the necessary security enhancements. Both steps allow you to seamlessly get IAM security keys from environment variables, your machine’s home directory, or the EC2 instance profile.
New and Updated Big Data Steps
ORC Input and Output Added. Optimized Record Columnal (ORC) Input and Output transformation steps have been added to enable PDI to perform the columnar data serialization method with indexing to to ease the development of pipelines that handle these formats. Native handling of ORC files through input and output steps is available from any standard storage system and is also accessible through Virtual File System (VFS) drivers. To improve performance, native execution of the steps can occur in the Pentaho engine or in Spark using AEL. See ORC Input and ORC Output steps.
New Options for ORC, Avro and Parquet. New formatting options have been added to the ORC, Avro and Parquet input and output steps.
- Option to append the date, time, or a timestamp to output file names.
- Overwrite existing files.
- Data type conversion, enabling you to change the data types in each of these steps.
See Avro Input and Avro Output, Parquet Input and Parquet Output steps.
Additional Big Data Updates:
Cassandra. These steps are updated to support Cassandra version 3.11 and DataStax version 5.1. See the following reference articles for more information on specific PDI steps: Cassandra Input, Cassandra Output, and SSTable Output.
HBase. In the HBase Input and HBase Output steps, you can delete rows by using a mapping key. A new option enables you to create a mapping template to extract and write tuples to and from HBase.
MongoDB. As a security enhancement, the MongoDB Input and MongoDB Output steps now support SSL connections. MongoDB has also been upgraded to the 3.6.3 driver, which supports versions 3.4 and 3.6. See the following reference articles for more information: MongoDB Input and MongoDB Output.
Splunk. PDI upgraded to version 7.0. See the following reference articles for more information: Splunk Input and Splunk Output.
Data Integration Improvements
New Data Lineage Analyzers. Data Lineage now features JSON Input and Output analyzers. To learn about how to create analyzers, see Contribute Additional Step and Job Entry Analyzers to the Pentaho Metaverse for more information.
Metadata Injection Support Added to Table Input Step. The Connection field in the Table Input step now features Metadata Injection. You can use this step to save injected transformations into the Pentaho Repository.
Generic Database Connection. When you set up a generic database, you can use the Dialect setting to help you define a custom JDBC driver and URL for a specific database dialect.
New Select Filter in Data Explorer. Use the Select filter to search a list of values to select as a filter while you are inspecting your data within the PDI client. See Use Filters to Explore Your Data for more information.
Clear Step and Entry Search. In the Explore pane of the PDI client, you can now clear out your current transformation step or job entry search by clicking the "X" next to the search field. Added capability of administrators to delete content of individual users.
PDI Repository Improved. You will now experience improved performance when opening files, saving files, and exploring your Pentaho Repository. See Use Pentaho Repositories in PDI for more information on the PDI Pentaho Repository.
Salesforce Transformation Steps Improved. PDI 8.1 uses API version 41.0 for the Salesforce Webservice URL in all Salesforce steps. The following steps are now updated in PDI:
CSV File Step Improved. Improved the CSV File Input transformation step by adding milliseconds in the date format field which allows the user better control over file handling usage for controlling the maximum number of simultaneously open files and time between flushes of files.
Logging Improvements. We have added the PDI.log file to capture running of transformation and jobs. Additionally, you can now scroll through the log output and copy out sections of the log text. See PDI Logging for more information.
Administrator Permission Improvements. Administrators can now better manage content in the Pentaho Repository Explorer. When individual users delete transformations, jobs and database connections, administrators can permanently empty their trash folders. This option is helpful when users leave an organization and their deleted files need to be permanently cleared.
Business Analytics Improvements
Continuous Axis for Time Dimensions in Visualizations. Line, Area, and Chart visualizations now use a continuous display of data for the Time Dimension. The data points are now proportional to the time duration for a more visually accurate representation of data trends. Previously, the time axis used discrete data points equally spaced.
Timestamp in Reports. You can now append a timestamp to the generated content when you run a report in background or schedule a report in the User Console. Report types include Report Designer reports (.prpt), Analyzer reports (.xanalyzer) and Interactive Reports.