Set up Pentaho to connect to a Hadoop cluster

Last updated
Save as PDF

If you are an IT Administrator for a team working with Big Data, you will need to configure Pentaho to connect to a Hadoop cluster.

Pentaho can connect to Cloudera Distribution for Hadoop (CDH), Hortonworks Data Platform (HDP), Microsoft Azure HDInsight (HDI), Amazon Elastic MapReduce (EMR), or MapR. Pentaho also supports many related services such as HDFS, HBase, Oozie, Zookeeper, and Spark. You can connect to clusters and services from these Pentaho components: the PDI client (Spoon), the Pentaho Server, Analyzer, Pentaho Interactive Reports, Pentaho Report Designer (PRD), and Pentaho Metadata Editor (PME).

The Pentaho Server can be configured to connect to a Hadoop cluster through an adaptive big-data layer referred to as a shim. You must modify shim properties and configuration files before you can connect to a Hadoop cluster. Pentaho regularly develops and releases shims, even in between releases, so that customers can easily keep abreast of the latest technological developments. To see which shims are supported for this version of Pentaho, see the Components Reference.

If the Hadoop Distribution that you want to use is not listed, visit Configuring Pentaho for your Hadoop Distro and Version. A previous version of our software might support older Hadoop Distributions.

To learn how to configure a shim for a specific distribution, click one of the following links:

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.