Advanced settings for connecting to Cloudera Data Platform
This article explains advanced settings for configuring the Pentaho Server to connect to Cloudera Data Platform (CDP).
Before you begin
Procedure
Check the Components Reference to verify that your Pentaho version supports your version of CDP.
Prepare your CDP by performing the following tasks:
Configure your Cloudera Data Platform.
See CDP's documentation if you need help.Install any required services and service client tools.
Test the platform.
Contact your platform administrator for connection information to CDP and services that you intend to use. Some of this information may be from Cloudera Manager or other management tools. You also need to supply some of this information to users after you are finished.
Add the YARN user on the platform to the group defined by the dfs.permissions.superusergroup property. The dfs.permissions.superusergroup property can be found in the hdfs-site.xml file on your platform or in the Cloudera Manager.
Set up the Pentaho Server to connect to a Hadoop cluster. You need to install the driver for your version of CDP.
Set up a secured instance of CDP
Procedure
Configure Kerberos security on the platform, including the Kerberos Realm, Kerberos KDC, and Kerberos Administrative Server.
Configure the following items to accept remote connection requests:
- Name
- Data
- Secondary
- Job tracker
- Task tracker nodes
If you have deployed CDP using an enterprise-level program, set up Kerberos for name, data, secondary name, job tracker, and task tracker nodes.
Add user account credentials to the Kerberos database for each Pentaho user that needs access to CDP.
Verify that an operating system user account exists on each node in CDP for each user you want to add to the Kerberos database. Add operating system user accounts if necessary.
NoteThe user account UIDs should be greater than the minimum user ID value (min.user.id). Usually, the minimum user ID value is set to 1000.Set up Kerberos on your Pentaho machines. For instructions, see Set Up Kerberos for Pentaho.
Edit configuration files for users
Your Cloudera administrators must download the configuration files from the platform for the applications your teams are using, and then edit these files to include Pentaho-specific and user-specific parameters. These files must be copied to the user's directory: <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name>. This directory and the config.properties file are created when you create a named connection.
The following files must be modified and provided to your users:
- config.properties
- core-site.xml if you are using secured instance of CDP
- hive-site.xml
- mapred-site.xml
- yarn-site.xml
Edit Core site XML file
Procedure
Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the core-site.xml file.
Add the following values:
Parameter Value hadoop.proxyuser.oozie.hosts Set to any Oozie hosts on your instance of CDP. hadoop.proxyuser.oozie.groups Set to any Oozie groups on your instance of CDP. hadoop.proxyuser.<security_service>.hosts Set to any other proxy user hosts on your instance of CDP. hadoop.proxyuser.<security_service>.groups Set to any other proxy user groups on your instance of CDP. fs.s3a.access.key Set to you S3 access key if you are accesing S3 elements on your instance of CDP. fs.s3a.secret.key Set to you S3 secret key if you are accesing S3 elements on your instance of CDP. Save and close the file.
Edit Hive site XML file
Procedure
Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the hive-site.xml file.
Add the following values:
Parameter Value hive.metastore.uris Set this to the location of your hive metastore if it differs from what is on your instance of CDP. hive.server2.enable.impersonation Add this property if you are using impersonation for security. <property> <name>hive.server2.enable.impersonation</name> <value>true</value> </property>
hive.server2.enable.doAs Add this property if you are using impersonation for security. <property> <name>hive.server2.enable.doAs</name> <value>true</value> </property>
tez.lib.uris Add this property if you are using Hive3 on Tez. <property> <name>tez.lib.uris</name> <value>/user/tez/0.9.1.7.1.4.0-203/tez.tar.gz</value> </property>
Save and close the file.
Edit Mapred site XML file
Perform the following steps to edit the mapred-site.xml file:
Procedure
Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the mapred-site.xml file.
Verify the mapreduce.jobhistory.address and mapreduce.app-submission.cross-platform properties are in the mapred-site.xml file. If they are not in the file, add them as follows.
Parameter Value mapreduce.jobhistory.address Set this to the place where job history logs are stored. mapreduce.app-submission.cross-platform Add this property to allow MapReduce jobs to run on either Windows client or Linux server platforms.
<property> <name>mapreduce.app-submission.cross-platform</name> <value>true</value> </property>
Save and close the file.
Edit YARN site XML file
Perform the following steps to edit the yarn-site.xml file:
Procedure
Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the yarn-site.xml file.
Add the following values:
Parameter Value yarn.application.classpath Add the classpaths you need to run YARN applications. Use commas to separate multiple paths. yarn.resourcemanager.hostname Change to the hostname of the resource manager in your environment. yarn.resourcemanager.address Change to the hostname and port for your environment. yarn.resourcemanager.admin.address Change to the hostname and port for your environment. yarn.resourcemanager.proxy-user-privileges.enabled Add this property if you are using a proxy user for security. <property> <name>yarn.resourcemanager.proxy-user-privileges.enabled</name> <value>true</value> </property>
Save and close the file.
Oozie configuration
If you are using Oozie on CDP, you must configure the platform and the server. For instructions, see Using Oozie
Windows configuration for a secured cluster
If you are on a Windows machine, perform the following steps to edit the configuration properties:
Procedure
Navigate to the server/pentaho-server directory and open the start-pentaho.bat file with any text editor.
Set the CATALINA_OPTS environment variable to the location of the krb5.conf or krb5.ini file on your system, as shown in the following example:
set "CATALINA_OPTS=%"-Djava.security.krb5.conf=C:\kerberos\krb5.conf
Save and close the file.
Connect to CDP with the PDI client
After you have set up the Pentaho Server to connect to CDP, you must configure and test the connection to the platform. For more information about setting up the connection, see Connecting to a Hadoop cluster with the PDI client.
Connect other Pentaho components to CDP
The following sections explain how to create and test a connection to CDP in the Pentaho Server, Pentaho Report Designer (PRD), and Pentaho Metadata Editor (PME). Creating and testing a connection to CDP in the PDI client involves two tasks:
- Install a driver for the Pentaho Server. See Set up the Pentaho Server to connect to a Hadoop cluster
- Create and test the CDP connections to other Pentaho components.
Create and test connections for other Pentaho components
For each Pentaho component, create the test as described in the following list.
Pentaho Server for DI
Create a transformation in the PDI client and run it remotely.
Pentaho Server for BA
Create a connection to CDP in the Data Source Wizard.
PME
Create a connection to CDP in PME.
PRD
Create a connection to CDP in PRD.
After you have properly connected to CDP and its services, provide connection information to your users who need access to the platform and its services.
Users need the following information and permissions to connect:
- Distribution and version of CDP
- HDFS, JobTracker, ZooKeeper, and Hive2/Impala Hostnames, IP addresses and port numbers
- Oozie URL (if used)
- Permissions to access the directories they need on HDFS, including their home directory and any other required directories.
Additionality, users might need more information depending on the transformation steps, job entries, and services they use. Here's a more detailed list of information that your users might need from you.