Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Connecting to a Hadoop cluster with the PDI client

Parent article

To connect to a Hadoop cluster, you must add and install a driver, create a named connection, then configure and test your connection. A named connection is information, including the IP address and port number, used to connect to the Hadoop cluster which is then stored by the name you assign to the connection for later use. You can create named connections to any supported vendor cluster and vendor version.

After you have a named connection set up, you can edit or duplicate that connection. For example, if you want to use a configuration with different security credentials, you can duplicate a connection, then edit the security settings on the copy. Named connections are useful when you move your jobs and transformations from a development server to a production server because you only need to update the connection information for your cluster name in the Hadoop Clusters dialog box. The jobs and transformations use the new connection information from the named connection.

Audience and prerequisites

The audience for this article is ETL developers, data engineers, and data analysts.

Before you begin, verify that your Hadoop administrator has set up your user account on the cluster and granted permissions to access the applicable HDFS directories. You need access to your home directory and any other directories required for your tasks.

Pentaho ships with supported versions of drivers for Amazon EMR, Cloudera, and Hortonworks that you can install on the PDI client. You must have a driver for each vendor and version of Hadoop for connecting to each cluster. You must install a driver before it is available for selection when you add a new connection to the cluster.

NoteIf you are using the Pentaho Metadata Editor or Pentaho Report Designer, the drivers are already installed.

When drivers for new Hadoop versions are released, you can download them from the Pentaho Customer Support Portal and then add them to Pentaho to connect to the new Hadoop distributions. You would then install them using the following procedure.

Verify that your Hadoop administrator has configured the Pentaho Server to connect to the Hadoop cluster on your computer. For more information, see Set Up Pentaho to Connect to a Hadoop Cluster. Ask your Hadoop administrator to provide you with a copy of the site.xml files from the cluster and the following information:

  • Distribution and version of the cluster (for example, Cloudera Distribution 6.1).
  • IP addresses and port numbers for HDFS, JobTracker, and Zookeeper (if used).
  • Kerberos and cluster credentials if you are connecting to a secured cluster.
  • Oozie URL (if used).

Install a driver for the PDI client

Before you can add a named connection to a cluster, you must install a driver for the vendor and version of the Hadoop cluster that you are connecting to. This task assumes that you have added your driver from the Pentaho Customer Support Portal or that you are using a driver for Amazon EMR, Cloudera, Google Dataproc, and Hortonworks that is shipped with Pentaho.
NoteIf you are using the Pentaho Metadata Editor or Pentaho Report Designer, the drivers are already installed.

Perform the following steps to install a driver for the PDI client:

Procedure

  1. In the PDI client, select the View tab of your transformation or job.

  2. Right-click the Hadoop clusters folder and click Add driver.

    The Add driver dialog box appears.Add driver dialog box
  3. Click Browse

    The Choose File to Upload dialog box appears.
  4. Navigate to the <pentaho home>/design-tools/data-integration/ADDITIONAL-FILES/drivers directory, where <pentaho home> is the directory where Pentaho is installed.

  5. Select the driver (.kar file) you want to add, click Open, and then click Next.

    The selected file name appears in the Browse text field. The vendor distribution files contain their abbreviations in the .kar file names as shown below:
    • Amazon EMR (emr)
    • Cloudera (cdh)
    • Cloudera Data Platform (cdp)
    • Google Dataproc (dataproc)
    • Hortonworks (hdp)
  6. Click Next.

    The Congratulations dialog box appears, notifying you that you must restart the Pentaho Server and the PDI client. The Driver field in the New cluster and Import cluster dialog boxes now displays the driver you have added.

Adding a cluster connection

You can add named connections manually or by importing them. If you are using high availability (HA) clusters, you must manually add the connection information in the New cluster dialog box to create your connection.

If you are connected to the Pentaho Repository when you add a new cluster connection, you and other users can reuse the connection. If you are not connected to the Pentaho Repository when you create the connection, then only you can reuse the connection.

NoteSecurity is set up on a per-user basis. Security information is not stored in the repository.

Add a cluster connection by import

You can create a new cluster by importing the site.xml files from an existing cluster. Perform the following steps to create a cluster by import.

Procedure

  1. In the PDI client, create a new job or transformation or open an existing one.

  2. Click the View tab and then right-click the Hadoop Clusters folder.

  3. Click Import cluster.

    The Hadoop Clusters dialog box appears.Hadoop Clusters Import dialog
  4. Enter a user-defined name to assign the cluster connection in the Cluster name field.

    Characters allowed in the cluster name field are uppercase and lowercase letters, numbers, and hyphens.
    NoteValid cluster names may include uppercase and lowercase letters, numbers, and hyphens. However, the cluster name cannot end with a hyphen. To ensure a valid cluster name, do not use any other symbols, punctuation characters, or blank spaces.
    After you create the connection, you can locate this named connection in the View tab on the PDI client.
  5. Use the Driver and Version options to select the distribution of Hadoop on your cluster and its version number. Pentaho ships with drivers for supported versions of Amazon EMR, Cloudera, Google Dataproc, and Hortonworks which you can install.

  6. Click Browse to add file(s) and browse to the directory containing the site.xml files that were provided to you by your cluster administrator.

    The required files include:
    • hive-site.xml
    • mapred-site.xml
    • yarn-site.xml
    • core-site.xml
    • hbase-site.xml
    • hdfs-site.xml
    • oozie-site.xml (if you are using Oozie in your configuration)
  7. Click Open.

    The Site XML files section displays the files you selected.
  8. Enter your user name and password in the HDFS section if you are connecting to a secure cluster.

  9. Click Next and specify the security option for your cluster.

Add a cluster connection manually

To add a cluster connection manually, you need access to the location of the required site.xml files, which are typically provided by your cluster administrator. If you are using high availability (HA) clusters, you must manually add the connection information using this method.

This task assumes you are in the PDI client.

Perform the following steps to manually add a named connection in the Hadoop Clusters dialog box.

Procedure

  1. In the PDI client, create a new job or transformation or open an existing one.

  2. Click the View tab and then right-click the Hadoop Clusters folder.

    View tab of New Hadoop Cluster command
  3. From the menu that displays, click New cluster.

    The Hadoop Clusters dialog box appears.
  4. Enter the connection information from your cluster administrator in the Hadoop Clusters dialog box.

    NoteAs a best practice, use Kettle variables for each connection parameter value to reduce risks associated with running jobs and transformations in environments that are disconnected from the repository.
    Hadoop New cluster dialog box
    OptionDescription
    Cluster NameEnter the name you want to assign to the cluster connection.
    NoteValid cluster names may include uppercase and lowercase letters, numbers, and hyphens. However, the cluster name cannot end with a hyphen. To ensure a valid cluster name, do not use any other symbols, punctuation characters, or blank spaces.
    After you create the connection, you can locate this named connection in the View tab on the PDI client.
    Driver and VersionSelect the distribution of Hadoop on your cluster and its version number. Pentaho ships with supported versions of Amazon EMR, Cloudera, Google Dataproc, and Hortonworks that you can install.
    Where are your site XML files? (Optional)Enter the location of the site.xml files provided by your cluster administrator. Click Browse to select file(s) and browse to the directory containing your site.xml files. Pentaho creates the applicable directory on the machine where the PDI client is located and copies the site.xml files to that directory. If you leave this option blank, Pentaho creates the directory for the distribution and version of Hadoop you selected in the Driver and Version options. You must then copy the site.xml files to that directory.
    Hostname (HDFS)Enter the hostname for the HDFS node in your Hadoop cluster.
    Port (HDFS)

    Enter the port for the HDFS node in your Hadoop cluster.

    NoteIf your cluster is enabled for high availability (HA), then you do not need a port number. Clear the port number.
    Username (HDFS) and Password (HDFS)Enter the user name and password for the HDFS node, which are provided by your cluster administrator.
    Hostname (JobTracker )and Port ( JobTracker)Enter the hostname and port for the JobTracker node in your Hadoop cluster. If you have a separate job tracker node, enter the hostname here.
    Hostname (ZooKeeper) and Port (Zookeeper)Enter the hostname and port for the Zookeeper node in your Hadoop cluster. Supply these options only if you want to connect to a Zookeeper service.
    URL (Oozie)Enter the Oozie client address. Supply this address only if you want to connect to the Oozie service.
    Bootstrap servers (Kafka)Enter the host/port pair(s) for the initial connection to the Kafka cluster. Use a comma-separated list for multiple servers, for example, ‘host1:port1,host2:port2’. Although you do not need to include all the servers used for Kafka, you might want to include more than one in the event that a server is down.
  5. Click Next and specify the security option for your cluster.

Add security to cluster connections

If you have a secured Hadoop cluster, your security options depend on your driver. All drivers have the Kerberos option. If you are using a Hortonworks driver, you can also select Knox as your security type. If you are connected to a Pentaho Repository, you can specify additional Kerberos options for secure impersonation. See Kerberos authentication versus secure impersonation for further information on secure impersonation.

If you are not sure what security type is set up for your Hadoop cluster, contact your cluster administrator for the correct credentials.

Specify Kerberos security

Perform the following steps to specify the credentials for your Kerberos security.

Procedure

  1. Select Kerberos as your security type.

  2. Click Next to select your Kerberos security method.

  3. Choose one of the following security methods and specify the Kerberos credentials you obtained from your cluster administrator:

    • Password: Specify the Authentication username and Password options.Password options for Hadoop cluster secured with         Kerberos

      If you are connected to the Pentaho Repository and are using secure impersonation, specify the additional Impersonation username and Password options. See Manual and advanced secure impersonation configuration if your environment requires advanced settings, your server is on Windows, or you are using a Cloudera Impala database for secure impersonation.

    • Keytab: Specify the Authentication username and Authentication Keytab options. Click Browse to navigate to your keytab file.Keytab options for Hadoop cluster secured with         Kerberos

      If you are connected to the Pentaho Repository and are using secure impersonation, specify the additional Impersonation username and Impersonation Keytab options. See Manual and advanced secure impersonation configuration if your environment requires advanced settings, your server is on Windows, or you are using a Cloudera Impala database for secure impersonation.

  4. Click Next to test your connection. See Test the cluster connection for more information.

Results

After you specify your security credentials, PDI tests the connection to your Hadoop cluster. If no errors occur during the connection, PDI is successfully connected to your Hadoop cluster.
NoteYou can define different principal users for each of your named connections only if all the clusters for these connections are in the same Kerberos realm. See MIT Kerberos Documentation for more information about Kerberos realms.

Next steps

If you have problems, see troubleshooting of Security Issues and Big Data issues to resolve the errors, then test your connection again.

Specify Knox security

Perform the following steps to specify the credentials for your Knox security, which is only available for the Hortonworks driver.

Procedure

  1. Select Knox as your security type.

  2. Click Next to specify the Knox credentials you obtained from your cluster administrator.

    Gateway options for Hadoop cluster secured with Knox
  3. Specify the Gateway URL for the Knox server.

  4. Specify the Gateway Username and Gateway Password for the Knox server.

  5. Click Next to test your connection. See Test the cluster connection for more information.

Results

After you specify your security credentials, PDI tests the connection to your Hadoop cluster. If no errors occur during the connection, PDI is successfully connected to your Hadoop cluster.

Next steps

If you have problems, see troubleshooting of Security Issues and Big Data issues to resolve the errors, then test your connection again.

Test the cluster connection

After you have created a new cluster manually or by import, the Test results dialog box appears. Each component in the dialog box displays one of the following three icons:
  • Green checkmark A green checkmark indicates that the connection to the cluster service was successful.
  • Red No symbol A red "no symbol" indicates that the connection failed. Check your connection information. If you suspect a different issue, see Big Data issues or consult your cluster administrator.
  • Yellow Skipped icon A yellow warning symbol indicates that the cluster service information was not supplied, so the test for that component was skipped.

Perform the following steps to test a connection:

Procedure

  1. In the View tab tree, right-click the cluster you want to test and click Test cluster.

  2. Test results appear in the Test results window.

    NoteYou can click the drop-down arrow in the Hadoop file system test for more details on the file system test.
    If you have errors, see Big Data Issues to resolve the issues or consult your cluster administrator, then test again. When no error messages are returned, the connection is properly configured.PDI Hadoop clusters test result
  3. Click Close in the Hadoop Cluster Test dialog box.

Managing Hadoop cluster connections

After cluster connections are added, you can edit, copy, and test them as needed. Once a connection is no longer required, you can delete that named connection.

Edit Hadoop cluster connections

How updates occur depend on whether you are connected to the repository.
  • If you are connected to a repository

    Hadoop cluster connection changes are registered by all transformations and jobs in the repository. The Hadoop cluster connection information is loaded during execution unless it cannot be found.

  • If you are not connected to a repository

    Hadoop cluster connection changes are registered by your local (file system) transformations and jobs. Note that changes to the Hadoop cluster connection are not updated in any transformations or jobs for the purpose of fallback unless they are saved again.

Perform the following steps to edit a Hadoop cluster connection:

Procedure

  1. Click the Hadoop Clusters folder in the View tab.

  2. Right-click the existing connection, then select Edit. Optionally, you can double-click the existing connection.

    The Edit cluster dialog box appears.
  3. Make your changes, then click Next.

  4. For your security type, select None and click Next or see Add security to cluster connections to add or edit security.

    The Test results dialog box displays.
  5. Click Close to save your changes.

Duplicate a Hadoop cluster connection

You can duplicate a cluster connection. This task is useful if you want to test a change to a named connection without affecting your existing setup or if you want to add different security permissions.

To duplicate a cluster connection, perform the following steps:

Procedure

  1. Click the Hadoop clusters folder in the View tab.

  2. Right-click an existing connection and select Duplicate cluster.

    The Hadoop clusters (Edit cluster) dialog box appears.
  3. Enter a different name in the Cluster Name field.

  4. Click Browse to add files(s). Use the file browser to select the site.xml files you want to import.

    NoteDuplicating a cluster connection copies the existing site.xml files to a new metastore directory. If you select site.xml files in this step, these files replace the copied site.xml.
  5. Click Next.

  6. For your security type, select None and click Next or see Add security to cluster connections to add or edit security.

  7. Click Edit cluster to open the Edit cluster dialog box

  8. Make the applicable changes to your cluster configuration values, then click Next.

    The Congratulations dialog box appears.
  9. Click Close.

Delete a Hadoop cluster connection

CautionIf you delete a named connection, the deleted connection cannot be restored. You must recreate the connection.
To delete a Hadoop cluster connection in a transformation or job, perform the following steps:

Procedure

  1. Click the Hadoop clusters folder in the View tab.

  2. Right-click the Hadoop cluster connection you want to delete and select Delete cluster.

    A message appears asking whether you to confirm the deletion.
  3. Click Yes, Delete.

    Your cluster connection is deleted, including all security credentials.