Advanced settings for connecting to Azure HDInsight

Before you begin

Before you begin setting up Pentaho to connect to HDI, perform the following tasks:

Procedure

Check the Components Reference to verify that your Pentaho version supports your version of HDI.
Prepare your HDI instance by performing the following tasks:
1. Configure your Azure HDInsight instance.
  View the HDI documentation if you need help.
2. Install any required services and service client tools.
3. Test the platform.
Contact your platform administrator for connection information to HDI and services that you intend to use. Some of this information may be available from the Azure manager tool or other management tools. You also need to supply some of this information to users after you are finished.
Add the YARN user on the platform to the group defined by the dfs.permissions.superusergroup property. The dfs.permissions.superusergroup property is located in the hdfs-site.xml file on your platform.
Set up the Pentaho Server to connect to a Hadoop cluster. You need to install the driver for your version of HDI.

Set up a secured instance

If you are connecting to HDI secured with Kerberos, you must also perform the following tasks.

Procedure

Configure Kerberos security on the platform, including the Kerberos Realm, Kerberos KDC, and Kerberos Administrative Server.
Configure the following items to accept remote connection requests:
- Name
- Data
- Secondary
- Job tracker
- Task tracker nodes
If you have deployed HDI using an enterprise-level program, set up Kerberos for name, data, secondary name, job tracker, and task tracker nodes.
Add user account credentials to the Kerberos database for each Pentaho user that needs access to HDI.
Verify that an operating system user account exists on each node in HDI for each user you want to add to the Kerberos database. Add operating system user accounts if necessary.

NoteThe user account UIDs should be greater than the minimum user ID value (min.user.id). Usually, the minimum user ID value is set to 1000.
Set up Kerberos on your Pentaho machines. For instructions, see Set Up Kerberos for Pentaho.

Edit configuration files for users

Ask your Azure administrator to download the configuration files from the platform for the applications your teams are using, and then edit these files to include Pentaho-specific and user-specific parameters. Copy the files to the user's directory: <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name>. This directory and the config.properties file are created when you create a named connection.

Modify the following files and provide them to your users:

core-site.xml if you are using secured instance of HDI
hbase-site.xml
hive-site.xml
mapred-site.xml
yarn-site.xml

Edit Core site XML file

If you are using a secured instance of HDI, follow these instructions to update the core-site.xml file:

Procedure

Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the core-site.xml file.

Add the following storage values defined for your storage type.

If you are using WASB storage:

Parameter	Value
fs.AbstractFileSystem.wasb.impl	Add this property to associate your WASB storage with the file system. <property> <name>fs.AbstractFileSystem.wasb.impl</name> <value>org.apache.hadoop.fs.azure.Wasb</value> </property>
pentaho.runtime.fs.default.name	Add this property to specify the container and storage names. <property> <name>pentaho.runtime.fs.default.name</name> <value>wasb://<container name associated with cluster>@<storage account name>.blob.core.windows.net</value> </property>

Parameter

Value

fs.AbstractFileSystem.wasb.impl

Add this property to associate your WASB storage with the file system.

<property>
      <name>fs.AbstractFileSystem.wasb.impl</name> 
      <value>org.apache.hadoop.fs.azure.Wasb</value>
</property>

pentaho.runtime.fs.default.name

Add this property to specify the container and storage names.

<property>
      <name>pentaho.runtime.fs.default.name</name> 
      <value>wasb://<container name associated with cluster>@<storage account name>.blob.core.windows.net</value>
</property>

If you are using ADLS storage:

Parameter	Value
pentaho.runtime.fs.default.name	Add this property to specify the container and storage names. <property> <name>pentaho.runtime.fs.default.name</name> <value>abfs://<container name associated with cluster>@<storage account name>.dfs.core.windows.net</value> </property>

Save and close the file.

Edit HBase site XML file

If you are using HBase, edit the location of the temporary directory in the hbase-site.xml file to create an HBase local storage directory.

Perform the following steps to edit the hbase-site.xml file:

Procedure

Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the hbase-site.xml file.
Add the following value:

Parameter Value
hbase.tmp.dir /tmp/hadoop/hbase
Save and close the file.

Parameter	Value
hbase.tmp.dir	/tmp/hadoop/hbase

Edit Hive site XML file

If you are using Hive, follow these instructions to set the location of the hive metastore in the hive-site.xml file:

Procedure

Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the hive-site.xml file.

Add the following values:

Parameter	Value
hive.metastore.uris	Set this to the location of your hive metastore if it differs from what is on your instance of HDI.
fs.azure.account.keyprovider.eastorageacct2.blob.core.windows.net	Add this property for security. <property> <name>fs.azure.account.keyprovider.eastorageacct2.blob.core.windows.net</name> <value>hive/<Kerberos principal realm@example.com></value> </property>

Save and close the file.

Next steps

See Hive for further configuration information when using Hive with Spark on AEL.

Edit Mapred site XML file

If you are using MapReduce, edit the mapred-site.xml file to indicate where the job history logs are stored and to allow MapReduce jobs to run across platforms.

Perform the following steps to edit the mapred-site.xml file:

Procedure

Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the mapred-site.xml file.

Verify that the mapreduce.jobhistory.address and mapreduce.job.hdfs-servers properties properties are in the mapred-site.xml file. If they are not in the file, add them as follows.

Parameter	Value
mapreduce.jobhistory.address	Set this property to the place where job history logs are stored, as shown in the following example. <property> <name>mapreduce.jobhistory.address</name> <value><active node name in the cluster>:10020</value> </property>
mapreduce.job.hdfs-servers properties	Add this property for YARN. <property> <name>mapreduce.job.hdfs-servers</name> <value>hdfs://<active node name in the cluster>:8020</value> </property>

Parameter

Value

mapreduce.jobhistory.address

Set this property to the place where job history logs are stored, as shown in the following example.

<property>
   <name>mapreduce.jobhistory.address</name>
   <value><active node name in the cluster>:10020</value>
</property>

mapreduce.job.hdfs-servers properties

Add this property for YARN.

<property>
   <name>mapreduce.job.hdfs-servers</name>
   <value>hdfs://<active node name in the cluster>:8020</value>
</property>

Save and close the file.

Edit YARN site XML file

If you are using YARN, set the following parameters in the yarn-site.xml file.

Perform the following steps to edit the yarn-site.xml file:

Procedure

Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the yarn-site.xml file.

Add the following values:

Parameter	Value
yarn.resourcemanager.hostname	Set this property to the hostname of the resource manager in your environment, as shown in the following example. <property> <name>yarn.resourcemanager.hostname</name> <value>hdfs://<active node name in the cluster></value> </property>
yarn.resourcemanager.address	Set this poperty to the hostname and port for your environment. For example, `<value>hdfs://<active node name in the cluster>:8050</value>`
yarn.resourcemanager.admin.address	Set this property to the hostname and port for your environment. For example, `<value>hdfs://<active node name in the cluster>:8141</value>`

Save and close the file.

Oozie configuration

If you are using Oozie on HDI, you must configure the platform and the server. For instructions, see Using Oozie

Windows configuration for a secured cluster

If you are on a Windows machine, perform the following steps to edit the configuration properties:

Procedure

Navigate to the server/pentaho-server directory and open the start-pentaho.bat file with any text editor.
Set the CATALINA_OPTS environment variable to the location of the krb5.conf or krb5.ini file on your system, as shown in the following example:
```
set "CATALINA_OPTS=%"-Djava.security.krb5.conf=C:\kerberos\krb5.conf
```
Save and close the file.

Connect to HDI with the PDI client

After you have set up the Pentaho Server to connect to HDI, configure and test the connection to the platform. For more information about setting up the connection, see Connecting to a Hadoop cluster with the PDI client.

Connect other Pentaho components to HDI

The following sections explain how to create and test a connection to HDI in the Pentaho Server, Pentaho Report Designer (PRD), and Pentaho Metadata Editor (PME). Creating and testing a connection to HDI in the PDI client involves two tasks:

Install a driver for the Pentaho Server. See Set up the Pentaho Server to connect to a Hadoop cluster
Create and test the HDI connections to other Pentaho components.

Create and test connections for other Pentaho components

For each Pentaho component, create the test as described in the following list.

Pentaho Server for DI
Create a transformation in the PDI client and run it remotely.
Pentaho Server for BA
Create a connection to HDI in the Data Source Wizard.
PME
Create a connection to HDI in PME.
PRD
Create a connection to HDI in PRD.

After you have properly connected to HDI and its services, provide the connection information to your users who need access to the platform and its services.

Users need the following information and permissions to connect:

Distribution and version of CDP
HDFS, JobTracker, ZooKeeper, and Hive2/Impala hostnames, IP addresses, and port numbers
Oozie URL (if used)
Permissions to access the directories that they need on HDFS, including their home directory and any other required directories.

Additionally, users might need more information depending on the transformation steps, job entries, and services they use. See Hadoop connection and access information list.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Before you begin

Set up a secured instance

Edit configuration files for users

Edit Core site XML file

Edit HBase site XML file

Edit Hive site XML file

Edit Mapred site XML file

Edit YARN site XML file

Oozie configuration

Windows configuration for a secured cluster

Connect to HDI with the PDI client

Connect other Pentaho components to HDI

Create and test connections for other Pentaho components