Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Set up the Pentaho Server to connect to a Hadoop cluster

Parent article

This article is for IT administrators who need to configure Pentaho to connect to a Hadoop cluster for teams working with Big Data.

Pentaho can connect to Amazon Elastic MapReduce (EMR), Cloudera Distribution for Hadoop (CDH) and Cloudera Data Platform (CDP), Google Dataproc, and Hortonworks Data Platform (HDP). Pentaho also supports related services such as HDFS, HBase, Oozie, ZooKeeper, and Spark. You can connect to clusters and services from these Pentaho components:

  • PDI client (Spoon)
  • Pentaho Server
  • Analyzer
  • Pentaho Interactive Reports
  • Pentaho Report Designer (PRD)
  • Pentaho Metadata Editor (PME)

You can configure the Pentaho Server to connect to a Hadoop cluster through a compatibility layer called a driver. Pentaho regularly develops and releases new drivers, so you can stay up-to-date with the latest technological developments. To view which drivers are supported for this version of Pentaho, see the Components Reference.

When drivers for new Hadoop versions are released, you can download them from the Pentaho Customer Support Portal and then add them to Pentaho to connect to the new Hadoop distributions. For more information about downloading and adding a new driver, see Adding a new driver.

Pentaho ships with drivers for Amazon EMR, Cloudera, Google Dataproc, and Hortonworks that you can install for the Pentaho Server. Before you can add a named connection to a cluster, you must install a driver for the vendor and version of the Hadoop cluster that you are connecting to.

NoteIf you are using the Pentaho Metadata Editor or Pentaho Report Designer, the drivers are already installed.

To learn about additional configurations for a specific distribution, click one of the following links:

Before you begin

Before you connect to the Pentaho Server, set the connection path to the metastore, which is where these types of connections are stored.

Procedure

  1. Navigate to the pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin directory and open the plugin.properties file with any text editor.

  2. Locate the hadoop.configurations.path property and set the value to the metastore directory. For example, /home/devuser/.pentaho/metastore.

  3. Save and close the plugin.properties file.

  4. Restart the server

Install a driver for the Pentaho Server

Before you can add a named connection to a cluster, you must install a driver for the vendor and version of the Hadoop cluster that you are connecting to. This task assumes that you have downloaded your driver from the Pentaho Customer Support Portal or that you are using a driver for Amazon EMR, Cloudera, Google Dataproc, or Hortonworks that is shipped with Pentaho.

Perform the following steps to install a driver for the Pentaho Server.

Procedure

  1. Verify that you are connected to a repository.

  2. In the PDI client, select the View tab of your transformation or job.

  3. Right-click the Hadoop clusters folder and click Add driver.

    The Add driver dialog box appears.Add driver dialog box
  4. Click Browse

    The Choose File to Upload dialog box appears.
  5. Navigate to the <pentaho home>/server/pentaho-server/pentaho-solutions/ADDITIONAL-FILES/drivers directory, where <pentaho home> is the directory where Pentaho is installed.

  6. Select the driver (.kar file) you want to add, click Open, and then click Next.

    The selected file name appears in the Browse text field. The vendor distribution files contain their abbreviations in the .kar file names as shown below:
    • Amazon EMR (emr)
    • Cloudera (cdh)
    • Cloudera Data Platform (cdp)
    • Google Dataproc (dataproc)
    • Hortonworks (hdp)
  7. Click Next.

    The Congratulations dialog box appears, notifying you that you must restart the Pentaho Server and the PDI client. The installed driver is now available for selection in the Driver field in the New cluster and Import cluster dialog boxes.

Manually install a driver for the Pentaho Server

You can manually install a driver for the Pentaho Server, even when you are not connected to the Pentaho Server with the PDI client. This task assumes that you have downloaded your driver from the Pentaho Customer Support Portal or that you are using a driver for Amazon EMR, Cloudera, Google Dataproc, or Hortonworks that is shipped with Pentaho.

Perform the following steps to manually install a driver for the Pentaho Server :

Procedure

  1. Navigate to the <pentahohome>/server/pentaho-server/pentaho-solutions/ADDITIONAL-FILES/drivers directory, where <pentaho home> is the directory where Pentaho is installed.

  2. Select the driver (.kar file) you want to add and copy it to the <pentaho home>/server/pentaho-server/pentaho-solutions/drivers directory on the machine with the Pentaho Server.

    The vendor distribution files contain their abbreviations in the .kar file names as shown below:
    • Amazon EMR (emr)
    • Cloudera (cdh)
    • Cloudera Data Platform (cdp)
    • Google Dataproc (dataproc)
    • Hortonworks (hdp)
  3. Restart the Pentaho Server.