Skip to main content

Pentaho+ documentation is moving!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Set up the Pentaho Server to connect to a Hadoop cluster

Parent article

This article is for IT administrators who need to configure Pentaho to connect to a Hadoop cluster for teams working with Big Data.

Pentaho can connect to Amazon Elastic MapReduce (EMR), Azure HDInsight (HDI), Cloudera Distribution for Hadoop (CDH) and Cloudera Data Platform (CDP), Google Dataproc, and Hortonworks Data Platform (HDP). Pentaho also supports related services such as HDFS, HBase, Hive, Oozie, Pig, Sqoop, Yarn/MapReduce, ZooKeeper, and Spark. You can connect to clusters and services from these Pentaho components:

  • PDI client (Spoon), along with Kitchen and Pan command line tools
  • Pentaho Server
  • Analyzer (PAZ)
  • Pentaho Interactive Reports (PIR)
  • Pentaho Report Designer (PRD)
  • Pentaho Metadata Editor (PME)

You can configure the Pentaho Server to connect to a Hadoop cluster through a compatibility layer called a driver. Pentaho regularly develops and releases new drivers, so you can stay up-to-date with the latest technological developments. To view which drivers are supported for this version of Pentaho, see the Components Reference.

When drivers for new Hadoop versions are released, you can download them from the Hitachi Vantara Lumada and Pentaho Support Portal and then add them to Pentaho to connect to the new Hadoop distributions. For more information about downloading and adding a new driver, see Adding a new driver.

NotePentaho ships with a generic Apache Hadoop driver. For specific vendor drivers, visit the Hitachi Vantara Lumada and Pentaho Support Portal to download the drivers.

Before you can add a named connection to a cluster, you must install a driver for the vendor and version of the Hadoop cluster that you are connecting to.

To learn about additional configurations for a specific distribution, click one of the following links:

Before you begin

Before you connect to the Pentaho Server, set the connection path to the metastore, which is where these types of connections are stored.

Procedure

  1. Navigate to the pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin directory and open the plugin.properties file with any text editor.

  2. Locate the hadoop.configurations.path property and set the value to the metastore directory. For example, /home/devuser/.pentaho/metastore.

  3. Save and close the plugin.properties file.

  4. Restart the server

Install a driver for the Pentaho Server

Before you can add a named connection to a cluster, you must install a driver for the vendor and version of the Hadoop cluster that you are connecting to. Perform the following steps to install a driver for the Pentaho Server.

Before you begin

This task assumes that you have downloaded your driver from the Hitachi Vantara Lumada and Pentaho Support Portal or that you are using the Apache Hadoop driver that is shipped with Pentaho.

Procedure

  1. Verify that you are connected to a repository.

  2. In the PDI client, select the View tab of your transformation or job.

  3. Right-click the Hadoop clusters folder and click Add driver.

    The Add driver dialog box appears.Add driver dialog box
  4. Click Browse

    The Choose File to Upload dialog box appears.
  5. Navigate to the directory where you downloaded your .kar file from the Lumada and Pentaho Support Portal.

  6. Select the driver (.kar file) you want to add, click Open, and then click Next.

    The selected file name appears in the Browse text field. The vendor distribution files contain their abbreviations in the .kar file names as shown below:
    • Amazon EMR (emr)
    • Azure HDInsight (hdi)
    • Cloudera (cdh)
    • Cloudera Data Platform (cdp)
    • Google Dataproc (dataproc)
    • Hortonworks (hdp)
  7. Click Next.

    The Congratulations dialog box appears, notifying you that you must restart the Pentaho Server and the PDI client. The installed driver is now available for selection in the Driver field in the New cluster and Import cluster dialog boxes.

Manually install a driver for the Pentaho Server

You can manually install a driver for the Pentaho Server, even when you are not connected to the Pentaho Server with the PDI client. This task assumes that you have downloaded your driver from the Hitachi Vantara Lumada and Pentaho Support Portal or that you are using the Apache Hadoop driver that is shipped with Pentaho.

Perform the following steps to manually install a driver for the Pentaho Server :

Procedure

  1. Navigate to the directory where you downloaded your .kar file from the Lumada and Pentaho Support Portal.

  2. Select the driver (.kar file) you want to add and copy it to the <pentaho home>/server/pentaho-server/pentaho-solutions/drivers directory on the machine with the Pentaho Server.

    The vendor distribution files contain their abbreviations in the .kar file names as shown below:
    • Amazon EMR (emr)
    • Azure HDInsight (hdi)
    • Cloudera (cdh)
    • Cloudera Data Platform (cdp)
    • Google Dataproc (dataproc)
    • Hortonworks (hdp)
  3. Restart the Pentaho Server.