Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at


Hitachi Vantara Lumada and Pentaho Documentation

Use Hadoop with Pentaho

Pentaho provides a complete big data analytics solution that supports the entire big data analytics process. From big data aggregation, preparation, and integration, to interactive visualization, analysis, and prediction, Pentaho allows you to harvest the meaningful patterns buried in big data stores. Analyzing your big data sets gives you the ability to identify new revenue sources, develop loyal and profitable customer relationships, and run your organization more efficiently and cost effectively.

NoteTo work with big data in the PDI client, you must install the Pentaho Data Integration Hadoop add-on. For more information, see Install the PDI tools and plugins.

Pentaho, big data, and Hadoop

The term big data applies to very large, complex, or dynamic datasets that need to be stored and managed over a long time. To derive benefits from big data, you need the ability to access, process, and analyze data as it is being created. However, the size and structure of big data makes it very inefficient to maintain and process it using traditional relational databases.

Big data solutions re-engineer the components of traditional databases—data storage, retrieval, query, processing—and massively scales them.

Pentaho big data overview

Pentaho increases speed-of-thought analysis against even the largest of big data stores by focusing on the features that deliver performance.

  • Instant access

    Pentaho provides visual tools to make it easy to define the sets of data that are important to you for interactive analysis. These data sets and associated analytics can be easily shared with others, and as new business questions arise, new views of data can be defined for interactive analysis.

  • High performance platform

    Pentaho is built on a modern, lightweight, high performance platform. This platform fully leverages 64-bit, multi-core processors and large memory spaces to efficiently leverage the power of contemporary hardware.

  • Extreme-scale, in-memory caching

    Pentaho is unique in leveraging external data grid technologies, such as Infinispan and Memcached to load vast amounts of data into memory so that it is instantly available for speed-of-thought analysis.

  • Federated data integration

    Data can be extracted from multiple sources, including big data and traditional data stores, integrated together and then flowed directly into reports, without needing an enterprise data warehouse or data mart.

About Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

A Hadoop platform consists of a Hadoop kernel, a MapReduce model, a distributed file system, and often a number of related projects, such as Apache Hive, Apache HBase, and others.

A Hadoop Distributed File System, commonly referred to as HDFS, is a Java-based, distributed, scalable, and portable file system for the Hadoop framework.

Get started with Hadoop and PDI

Pentaho Data Integration (PDI) can operate in two distinct modes: job orchestration and data transformation. Within PDI they are called jobs and transformations.

PDI jobs sequence a set of entries that encapsulate actions. An example of a PDI big data job would be to check for new log files, copy the new files to HDFS, execute a MapReduce task to aggregate the weblog into a click stream, and stage that click stream data in an analytic database.

PDI transformations consist of a set of steps that execute in parallel and operate on a stream of data columns. Through the default Pentaho engine, columns usually flow from one system where new columns are calculated or values are looked up and added to the stream. The data stream is then sent to a receiving system like a Hadoop cluster, a database, or the Pentaho Reporting engine. PDI job entries and transformation steps are described in Transformation step reference and Job entry reference.

Before you begin

PDI contains all the job entries and transformation steps required for working with Hadoop, Cassandra, and MongoDB. Your cluster administrator can set up the Pentaho Server to communicate with most Hadoop distributions. See Set up the Pentaho Server to connect to a Hadoop cluster for more information. For a list of supported big data technology, including which configurations of Hadoop are currently supported, see the Components Reference.

Manage Hadoop configurations in PDI

Your cluster administrator can edit a PDI properties file to manage a Hadoop configuration and include or exclude Hadoop classes or packages from the configuration.

Learn more

Hadoop connection and access information list

To access the Hadoop cluster and its services, you need permissions and connection information. Your cluster administrator can provide this information.

Learn more

Connect to your Hadoop clusters in the PDI client

You can establish connections in the PDI client to multiple Hadoop clusters and versions through drivers that act as adapters between Pentaho and your clusters.

Learn more

Use PDI outside and inside the Hadoop cluster

When connections are established to one or more clusters, you can use PDI to execute both outside of your Hadoop clusters and within the nodes of the clusters.

Learn more

Advanced topics

The following topics help to extend your knowledge of multidimensional data beyond basic setup and use:

Troubleshoot the Pentaho system

See our list of common problems and resolutions.

Learn more