Set up Pentaho to Connect to an Amazon EMR Cluster
Overview
Learn how to configure Pentaho to connect to a Amazon (EMR) cluster.
These instructions explain how to configure Pentaho's Amazon Elastic MapReduce (EMR) shim so Pentaho can connect to a working EMR cluster.
Before You Begin
Do these things before you configure the shim.
Task | Description |
---|---|
Verify Support | Check the Component Reference to verify that your Pentaho version supports your version of the EMR cluster. |
Set Up a EMR cluster | Pentaho can connect to an EMR cluster.
|
Install PDI on EMR | Install PDI on an EC2 instance that is within the same VPC as the EMR Hadoop Cluster. |
Get Connection Information | Get connection information for the cluster and services that you will use. You ca get this information from the Hadoop Administrator, or from a Cluster Management tool. |
Add Yarn User to Superuser Group | Add the yarn user on the cluster to the group defined by dfs.permissions.superusergroup property. The dfs.permissions.superusergroup property can be found in hdfs-site.xml file on your cluster or in the cluster management application. |
Review the Version-Specific Notes Section | Read the Version-Specific Notes section to review special configuration instructions for your version of EMR. |
Edit Configuration Files on Cluster
There are no Pentaho-specific edits that need to be made to the *-site.xml configuration files on the cluster.
Configure Pentaho Component Shims
You must configure the shim in each of the following Pentaho components, on each computer from which Pentaho will be used to connect to the cluster.
- Spoon (PDI Client)
- Pentaho Data Integration (DI) Server
- Business Analytics (BA) Server (including Analyzer and Pentaho Interactive Reporting).
- Pentaho Report Designer (PRD)
- Pentaho Metadata Editor (PME)
As a best practice, configure the shim in Spoon first. Spoon has handy features that will help you test your configuration. Then, copy the tested Spoon configuration files to other components, making changes if necessary.
You can also opt to go through these instructions each Pentaho component, and not copy the shim files from Spoon. If you do not not plan to connect to the cluster from Spoon, you can configure the shim in another component first instead.
Here are the shim configuration steps.
- Locate the Shim Directories
- Select the Correct Shim
- Download the Shim from the Support Portal (Optional Step)
- Copy Configuration Files from the Cluster to the Shim
- Edit the Shim Configuration Files
- Connect to the EMR Cluster from Spoon
Locate the Pentaho Big Data Plugin and Shim Directories
Shims and other parts of the Pentaho Adaptive Big Data Layer are in the Pentaho Big Data Plugin directory. The path to this directory differs by component. You need to know the locations of this directory, in each component, to complete shim configuration and testing tasks.
<pentaho home> is the directory where Pentaho is installed.
Components | Location of Pentaho Big Data Plugin Directory |
---|---|
Spoon | <pentaho home>/design-tools/data-integration/plugins/pentaho-big-data-plugin |
DI Server | <pentaho home>/server/data-integration-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin |
BA Server | <pentaho home>/server/biserver-ee/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin |
Pentaho Report Designer | <pentaho home>/design-tools/report-designer/plugins/pentaho-big-data-plugin |
Pentaho Metadata Editor | <pentaho home>/design-tools/metadata-editor/plugins/pentaho-big-data-plugin |
Shims are located in the pentaho-big-data-plugin/hadoop-configurations directory. Shim directory names consist of a three or four letter Hadoop Distribution abbreviation followed by the Hadoop Distribution's version number. The version number does not contain a decimal point. For example, the shim directory named cdh54 is the shim for the CDH (Cloudera Distribution for Hadoop), version 5.4. Here is a list of the shim directory abbreviations.
Abbreviation | Shim |
---|---|
cdh | Cloudera's Distribution of Apache Hadoop |
emr | Amazon Elastic Map Reduce |
hdp | Hortonworks Data Platform |
mapr | MapR |
Select the Correct Shim
For the location of the pentaho-big-data-plugin directory listed in these instructions see Locate the Shim Directories.
Although Pentaho often supports one or more versions of a Hadoop distribution, the download of the Pentaho BA suite only contains the latest, supported, Pentaho-certified version of the shim. The other supported versions of shims can be downloaded from the Pentaho Customer Support Portal.
Before you begin, verify that the shim you want is supported by your version of Pentaho shown in the Components Reference.
- In a shell tool, go to the pentaho-big-data-plugin/hadoop-configurations directory. Shim directories are listed there.
- If the shim you want to use is there, you can go to the next step: Copy the Configuration Files from Cluster to Shim.
- Go to the Pentaho Customer Support Portal Knowledge Base's Downloads page. You are prompted to log in if you have not done so already.
- Enter the name of the shim you want in the search box. Select the shim from the search results.
- Read the instructions, then download the shim. You might need to scroll down to see the download link.
- Unzip the downloaded shim package to the pentaho-big-data-plugin/hadoop-configurations directory.
- Go to Copy the Configuration Files from Cluster to Shim.
Copy the Configuration Files from Cluster to Shim
If you are using a cluster, copying configuration files from the cluster to the shim keeps the configuration files in sync and reduces troubleshooting errors.
The location of the pentaho-big-data-plugin directory listed in these instructions is referenced in the Locate the Shim Directories section of this document.
- Back up the existing EMR shim files in the pentaho-big-data-plugin/hadoop-configurations/emrxx directory.
- Copy the following configuration files from the EMR cluster to pentaho-big-data-plugin/hadoop-configurations/emrxx. They will overwrite the existing files.
- core-site.xml
- hdfs-site.xml
- emrfs-site.xml
- httpfs-site.xml
- mapred-site.xml
- yarn-site.xml
Edit the Shim Configuration Files
The location of the pentaho-big-data-plugin directory listed in these instructions is referenced in the Locate the Shim Directories section of this document.
You need to verify or change authentication, oozie, hive, mapreduce, and yarn settings. Changes are made in these shim configuration files:
- core-site.xml
- mapred-site.xml
Edit core-site.xml
If you plan to run MapReduce jobs on an EMR cluster, make sure you have read, write, and execute access to the S3 Buffer directories specified in the core-site.xml file on the EMR cluster.
Edit the core-site.xml file to add information about your AWS Access Key ID and Access key. You will also need to indicate your LZO compression setting.
- Go to pentaho-big-data-plugin/hadoop-configurations/emrxx and open core-site.xml.
- Add these values.
Parameter | Values |
---|---|
fs.s3.awsAccessKeyId | Value of your S3 AWS Access Key ID. <property> <name>fs.s3.awsAccessKeyId</name> <value>[INSERT YOUR VALUE HERE]</value> </property> |
fs.s3.awsSecretAccessKey | Value of your AWS secret access key. <property> <name>fs.s3.awsSecretAccessKey</name> <value>[INSERT YOUR VALUE HERE]</value> </property> |
- If needed, enter the AWS Acess Key ID and Access Key for S3N like this:
Parameter | Values |
---|---|
fs.s3n.awsAccessKeyId | Value of your S3N AWS Access Key ID. <property> <name>fs.s3n.awsAccessKeyId</name> <value>[INSERT YOUR VALUE HERE]</value> </property> |
fs.s3n.awsSecretAccessKey | Value of your 3N AWS secret access key. <property> <name>fs.s3n.awsSecretAccessKey</name> <value>[INSERT YOUR VALUE HERE]</value> </property> |
- Modify these values.
Parameter | Values |
---|---|
fs.s3n.impl | <property> <name>fs.s3n.impl</name> <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value> </property> |
fs.s3.impl | <property> <name>fs.s3.impl</name> <value>org.apache.hadoop.fs.s3.S3FileSystem</value> </property> |
- LZO is a compression format that EMR supports. If you want to configure for LZO compression, you will need to download a jar file. If you do not, you will need to remove a parameter from the core-site.xml file.
- If you are not going to use LZO compression: Remove any references to the iocompression parameter in the core-site.xml file: com.hadoop.compression.lzo.LzoCodec
- If you are going to use LZO compression: Download the LZO jar and add it to pentaho-big-data-plugin/hadoop-configurations/emr3x/lib directory. The LZO jar can be found here: http://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.19/
- Save and close the file.
Edit mapred-site.xml
Make changes to indicate where job history logs are stored and to allow mapreduce jobs to run on different platforms.
- Go to pentaho-big-data-plugin/hadoop-configurations/emrxx and open mapred-site.xml.
- Make the following changes.
Parameter | Value |
---|---|
mapreduce.app-submission.cross-platform | This property allows mapreduce jobs to run on windows and linux platforms, and vice versa. <property> <name>mapreduce.app-submission.cross-platform</name> <value>true</value> </property>
|
- Save and close the file.
Next Step
See Connect Pentaho Components to EMR Cluster or instructions on how to configure and test your connection.
Connect Pentaho Components to EMR Cluster
Creating a connection to the cluster involves setting an active shim, then configuring and testing the connection to the cluster. Making a shim active means it is used by default when you access a cluster. When you initially install Pentaho, no shim is active by default. You must choose a shim to make active before you can connect to a cluster. Only one shim can be active at a time. The way you make a shim active, as well as the way you configure and test the cluster connection differs by Pentaho component.
- If you want to create a connection in Spoon see Create and Test a Connection to the Cluster in Spoon.
- If you want to create and test a connection in the other components see Create and Test a Connection to the DI Server, BA Server, PRD, and PME.
Create and Test a Connection to the Cluster in Spoon
To connect to the EMR cluster from Spoon involves two tasks:
- Set the Active Shim in Spoon
- Configure and Test the Cluster Connection
Set the Active Shim in Spoon
Set the active shim when you want to connect to a Hadoop cluster the first time, or when you want to switch clusters. Only one shim can be active at a time.
- Start Spoon.
- Select Hadoop Distribution... from the Tools menu.
- In the Hadoop Distribution window, select the Hadoop distribution you want.
- Click OK.
- Stop, then restart Spoon.
Configure and Test the Cluster Connection
Provide connection details for the cluster and services you will use, such as the hostname for HDFS or the URL for Oozie. Then, you can use a built-in tool to test your configuration to find and troubleshoot common configuration issues, such as wrong hostnames and user permission errors.
Connection settings are set in the Hadoop cluster window. You can get to the settings from several places, but in these instructions, you will get the Hadoop cluster window from the View tab in a transformation or job.
Copy Spoon Shim Files to Other Pentaho Components
Once your connection has been properly configured on Spoon, copy configuration files to the shim directories in other Pentaho components.
The location of the pentaho-big-data-plugin directory listed in these instructions is referenced in the Locate the Shim Directories section of this document.
- Copy following configuration files from the pentaho-big-data-plugin/hadoop-configurations/emrxx directory in Spoon to pentaho-big-data-shim/emrxx on the DI Server, BA Server, PRD, or PME.
- core-site.xml
- hdfs-site.xml
- emrfs-site.xml
- httpfs-site.xml
- mapred-site.xml
- yarn-site.xml
- Complete the tasks in the Connect Other Components to EMR Cluster section to connect and test.
Connect Other Components to EMR Cluster
These instructions explain how to connect the DI Server, BA Server, PRD, and PME to the EMR Cluster.
- Set the Active Shim on PRD, PME, and the DI and BA Servers
- Create and test the cluster connections.
Set the Active Shim on PRD, PME, and the DI and BA Servers
Modify a properties file to set the active shim for the DI Server, BA Server, PRD, and PME.
The location of the pentaho-big-data-plugin directory listed in these instructions is referenced in the Locate the Shim Directories section of this document.
- Stop the component.
- Locate the pentaho-big-data-plugin directory for your component.
- Go the hadoop-configurations directory. For more information on directory names, see Locate the Pentaho Big Data Plugin and Shim Directories.
- Go back to the pentaho-big-data-plugin directory and open the plugin.properties file.
- Set the active.hadoop.configuration property to the directory name of the shim you want to make active. Here is an example:
active.hadoop.configuation=cdh54
- Save and close the plugin.properties file.
- Restart the component.
Create and Test Connections
Connection tests appear in the following table.
Component | Test |
---|---|
DI Server | Create a transformation in Spoon and run it remotely. |
BA Server | Create a connection to the cluster in the Data Source Wizard. |
PME | Create a connection to the cluster in PME. |
PRD | Create a connection to the cluster in PRD. |
Once you've connected to the cluster and its services properly, provide connection information to users who need access to the cluster and its services. Those users can only obtain access from computers that have been properly configured to connect to the cluster.
Here is what they need to connect.
- Hadoop Distribution and Version of the Cluster
- HDFS, JobTracker, Zookeeper, and Hive2/Impala Hostnames, IP Addresses and Port Numbers
- Oozie URL (if used)
- Users also require the appropriate permissions to access the directories they need on HDFS. This typically includes their home directory and any other required directories.
They might also need more information depending on the job entries, transformation steps, and services they use. Here's a more detailed list of information that your users might need from you.
General Notes
Sqoop "Unsupported major.minor version" Error
If you are using Pentaho 6.0 and the Java version on your cluster is older than the Java version that Pentaho uses, you must change Pentaho's JDK so it is the same major version as the JDK on the cluster. The JDK that you install for Pentaho must meet the requirements in the Supported Components matrix. To learn how to download and install the JDK read this article.
Version-Specific Notes
Impala and HBase Shim Support
HBase | Impala | |
---|---|---|
EMR 3.4 | Not supported | Supported |
EMR 3.10 | Not supported | Supported |
EMR 4.1 | Not supported | Not supported |
EMR 4.1
The following issue can occur in the EMR 4.1 shim.
Cannot Create S3 Buffer Directory Error
This error can occur if you use PDI to run a Pentaho MapReduce on the EMR cluster. If you get this error, the MapReduce job fails when you try to run it. This error usually occurs on Windows computers only.
- To resolve the error, make sure the person who is running the PDI job has read, write, and execute access on EMR's S3 buffer directories. These directories are specified in the core-site.xml file on the EMR cluster, but are usually /mnt/s3 and /mnt1/s3.
- If you still get this error, comment out these lines in the core-site.xml file in the EMR shim directory on the computers where Spoon and the DI Server are installed:
<!-- <property> <name>fs.s3.buffer.dir</name> <value>/mnt/s3,/mnt1/s3</value> </property> -->
To learn where the EMR shim directory is located, see Locate the Pentaho Big Data Plugin and Shim Directories.
- Save and close the core-site.xml files on Spoon and the DI Server.
- Stop and restart Spoon and the DI Server.
- Run the job again.
Troubleshoot Cluster and Service Configuration Issues
General Configuration Problems
The issues in this section explain how to resolve common configuration problems.
Shim and Configuration Issues
Symptoms | Common Causes | Common Resolutions |
---|---|---|
No shim |
|
|
Shim doesn't load |
|
|
The file system's URL does not match the URL in the configuration file. | Configuration files (*-site.xml files) were not configured properly. | Verify that the configuration files were configured correctly. Verify that the core-site.xml file is configured correctly. See the instructions for your Hadoop distribution in the Set Up Pentaho to Connect to an Apache Hadoop Cluster article for details. |
Connection Problems
Symptoms | Common Causes | Common Resolutions |
---|---|---|
Hostname incorrect or not resolving properly. |
|
|
Port name is incorrect. |
|
|
Can't connect. |
|
|
Directory Access or Permissions Issues
Symptoms | Common Causes | Common Resolutions |
---|---|---|
Can't access directory. |
|
|
Can't create, read, update, or delete files or directories | Authorization and/or authentication issues. |
|
Test file cannot be overwritten. | Pentaho test file is already in the directory. |
|
Oozie Issues
Symptoms | Common Causes | Common Resolutions |
---|---|---|
Can't connect to Oozie. |
|
|
Zookeeper Problems
Symptoms | Common Causes | Common Resolutions |
---|---|---|
Can't connect to Zookeeper. |
|
|
Zookeeper hostname or port not found or doesn't resolve properly. |
|
|