Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Advanced settings for connecting to a Hortonworks cluster

Parent article

This article explains advanced settings for configuring the Pentaho Server to connect to a working Hortonworks Data Platform (HDP) cluster.

Before you begin

Before you begin to set up Pentaho to connect to a HDP cluster, you must perform the following tasks: .

Procedure

  1. Check the Components Reference to verify that your Pentaho version supports your version of the HDP cluster.

  2. Prepare your HDP cluster by performing the following tasks;

    1. Configure a HDP cluster.

      See Hortonwork's documentation if you need help.
    2. Install any required services and service client tools.

    3. Test the cluster.

  3. From your Hadoop administrator, get the connection information for the cluster and services that you intend to use. Some of this information may be from Ambari or other cluster management tools.

  4. Add the YARN user on the cluster to the group defined by the dfs.permissions.superusergroup property. The dfs.permissions.superusergroup property can be found in hdfs-site.xml file on your cluster or in the cluster management application.

  5. Read the Notes section to review special configuration instructions for your version of HDP.

  6. Set up the Pentaho Server to connect to a Hadoop cluster. You need to install the driver for your version of HDP.

Set up a secured cluster

If you are connecting to a HDP cluster that is secured with Kerberos, you must also perform the following tasks.

Procedure

  1. Configure Kerberos security on the cluster, including the Kerberos Realm, Kerberos KDC, and Kerberos Administrative Server.

  2. Configure the name, data, secondary name, job tracker, and task tracker nodes to accept remote connection requests.

  3. If you are have deployed Hadoop using an enterprise-level program, set up Kerberos for name, data, secondary name, job tracker, and task tracker nodes.

  4. Add the user account credential for each PDI client user who should have access to the Hadoop cluster to the Kerberos database.

  5. Verify that an operating system user account on each node in the Hadoop cluster exists for each user who you want to add to the Kerberos database.

    Add operating system user accounts if necessary.
    NoteThe user account UIDs must be greater than the minimum user ID value (min.user.id). Usually, the minimum user ID value is set to 1000.
  6. Set up Kerberos on your Pentaho machines. For instructions, see Set Up Kerberos for Pentaho.

Edit the configuration files for users

Your cluster administrators must download the configuration files from the cluster for the applications your teams are using, and then edit the files to include Pentaho-specific and user-specific parameters. These files must be copied to the user's directory: <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name>. This directory and the config.properties file are created when you set up a named connection.

The following files must be modified and provided to your users:

  • config.properties
  • hbase-site.xml
  • hive-site.xml
  • mapred-site.xml
  • yarn-site.xml

Edit HBase site XML file

If you are using HBase, you must edit the location of the temporary directory in the hbase-site.xml file to create an HBase local storage directory.

Perform the following steps to edit the hbase-site.xml file:

Procedure

  1. Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the hbase-site.xml file.

  2. Add the following value:

    ParameterValue
    hbase.tmp.dir /tmp/hadoop/hbase
  3. Save and close the file.

Edit Hive site XML file

If you are using Hive, follow these instructions to set the location of the hive metastore in the hive-site.xml file:

Procedure

  1. Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the hive-site.xml file.

  2. Add the following value:

    ParameterValue
    hive.metastore.urisSet this to the location of your hive metastore.
  3. Save and close the file.

Next steps

See Hive for further configuration information when using Hive with Spark on AEL.

Edit Mapred site XML file

If you are using MapReduce, you must edit the mapred-site.xml file to indicate where the job history logs are stored and to allow MapReduce jobs to run across platforms.

Perform the following steps to edit the mapred-site.xml file:

Procedure

  1. Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the mapred-site.xml file.

  2. Add the following values:

    ParameterValue
    mapreduce.jobhistory.addressSet this to the folder where you want to store the job history logs.
    mapreduce.application.classpathAdd classpath information. Here is an example:
    <property>
    	<name>mapreduce.application.classpath</name>
    	<value>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*
    			:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*
    			:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*
    			:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*
    			:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*
    			:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure
    	</value>
    </property>
    mapreduce.application.framework.pathSet the framework path. Here is an example:
    <property>
      <name>mapreduce.application.framework.path</name>
      <value>/hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz#mr-framework</value>
    </property>
  3. Verify the mapreduce.app-submission.cross-platform property is in the mapred-site.xml file. If it is not in the file, add it as follows.

    ParameterValue
    mapreduce.app-submission.cross-platformAdd this property to allow MapReduce jobs to run on either Windows client or Linux server platforms.
    <property>
      <name>mapreduce.app-submission.cross-platform</name>
      <value>true</value>
    </property>
  4. Save and close the file.

Edit YARN site XML file

If you are using YARN, you must verify that the following parameters are set in the yarn-site.xml file.

Perform the following steps to edit the yarn-site.xml file:

Procedure

  1. Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the yarn-site.xml file.

  2. Add these values:

    ParameterValue
    yarn.application.classpath​Add the classpaths needed to run YARN applications, as shown in the following example:
    <property> <name>yarn.application.classpath</name>
     <value>$HADOOP_CONF_DIR,/usr/hdp/current/hadoop-client/*,
    /usr/hdp/current/hadoop-client/lib/*,/usr/hdp/current/hadoop-hdfs-client/*,
    /usr/hdp/current/hadoop-hdfs-client/lib/*,/usr/hdp/current/hadoop-yarn-client/*,
    /usr/hdp/current/hadoop-yarn-client/lib/*</value>
     </property>
    Use commas to separate multiple paths.
    yarn.resourcemanager.hostnameUpdate the hostname in your environment or use the default: sandbox.hortonworks.com
    yarn.resourcemanager.addressUpdate the hostname and port for your environment.
    yarn.resourcemanager.admin.addressUpdate the hostname and port for your environment.
  3. Save and close the file.

Oozie configuration

If you are using Oozie on a cluster, you must configure the cluster and the server. For instructions, see Using Oozie

Windows configuration for a secured cluster

If you are on a Windows machine, perform the following steps to edit the configuration properties:

Procedure

  1. Navigate to the server/pentaho-server directory and open the start-pentaho.bat file with any text editor.

  2. Set the CATALINA_OPTS environment variable to the location of the krb5.conf or krb5.ini file on your system, as shown in the following example:

    set "CATALINA_OPTS=%"-Djava.security.krb5.conf=C:\kerberos\krb5.conf
    
  3. Save and close the file.

Connect to a Hadoop cluster with the PDI client

After you have set up the Pentaho Server to connect to a cluster, you must configure and test the connection to the cluster. For more information about setting up the connection, see Connecting to a Hadoop cluster with the PDI client.

Connect other Pentaho components to the Hortonworks cluster

The following sections explain how to create and test a connection to the cluster in the Pentaho Server, PRD, and PME. Creating and testing a connection to the cluster in the PDI client involves the following tasks:

Create and test connections

For each Pentaho component, create the test as described in the following list.

  • Pentaho Server for DI

    Create a transformation in the PDI client and run it remotely.

  • Pentaho Server for BA

    Create a connection to the cluster in the Data Source Wizard.

  • PME

    Create a connection to the cluster in PME.

  • PRD

    Create a connection to the cluster in PRD.

After you have connected to the cluster and its services properly, provide connection information to users who need access to the cluster and its services. Those users can only obtain access from machines that are properly configured to connect to the cluster.

Here is what users need to connect:

  • Hadoop Distribution and version of the cluster
  • HDFS, JobTracker, ZooKeeper, and Hive2/Impala Hostnames, IP addresses and port numbers
  • Oozie URL (if used)

Users also require the permissions to access the directories they need on HDFS, such as their home directory and any other required directories.

Users might also require more information for specific job entries, transformation steps, and services. Here's a more detailed list of information that your users might need from you.

Notes

The following sections are special notes for the Hortonworks Data Platform.

Simba Spark SQL driver support

If you are using Pentaho 7.0 or later, the Pentaho HDP drivers support the Simba Spark SQL driver. You need to download, install, and configure the Simba Spark SQL driver to use Simba Spark SQL with PDI.

Procedure

  1. Download the Simba Spark SQL driver.

  2. Extract the ZIP file, and then copy the following 3 files into the lib/ directory of the Pentaho HDP driver:

    • SparkJDBC41.jar
    • TCLIServiceClient.jar
    • QI.jar
  3. In the Database Connection window, select SparkSQL option.

    The default port for the Spark thrift server is 10015.
  4. For secure connections, set the following additional parameters on the JDBC URL through the Options tab:

    • KrbServiceName
    • KrbHostFQDN
    • KrbRealm
  5. For unsecure connections, if your Spark SQL configuration specifies hive.server2.authentication=NONE, then include an appropriate User Name in the Database Connection window.

    Otherwise, the connection is assumed to be NOSASL authentication, which causes a connection failure after timeout.
  6. Stop and restart the component.

HDP 3.1 notes

The following note addresses issues related to HDP 3.1.

Using the 3.0 driver for HDP 3.1 clusters

You can use the HDP 3.0 driver to connect to your HDP 3.1 cluster by updating the PDI config.properties file.

Perform the following steps to update your java.syste.hdp.version driver configuration parameter to HDP 3.1:

Procedure

  1. On your HDP cluster, use the hdp-select command to determine the full version of your cluster, such as '3.1.0.0-78'.

  2. In the Pentaho distribution, open the config.properties file located in the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory.

  3. Change the java.system.hdp.version parameter from the existing version to the full version of your cluster, which you obtained by running the hdp-select command in Step 1. For example, the existing version of '3.0.0.0-1634' might be changed to '3.1.0.0-78'.

  4. Save and close the config.properties file.

Results

Your HDP 3.0 driver now works with your 3.1 HDP cluster after you restart your PDI client.