Set up Pentaho to connect to an Amazon EMR cluster

Last updated
Save as PDF

Before you begin

Before you begin, you'll need to do a few things.

Procedure

Check the Components Reference to verify that your Pentaho version supports your version of the Amazon EMR cluster.
Set up an Amazon EMR cluster.
Pentaho can connect to an Amazon EMR cluster:
1. Configure an Amazon EC2 cluster.
  See Amazon's documentation if you need help.
2. Install any required services and service client tools.
3. Test the cluster.
Install PDI on an Amazon EC2 instance that is within the same Amazon Virtual Private Cloud (VPC) as the Amazon EMR cluster.
Get the connection information for the cluster and services that you will use from your Hadoop administrator, or from a cluster management tool. You'll also need to supply some of this information to users once you are finished.
Add the YARN user on the cluster to the group defined by dfs.permissions.superusergroup property. The dfs.permissions.superusergroup property can be found in hdfs-site.xml file on your cluster or in the cluster management application.
Read the Notes section to review special configuration instructions for your version of Amazon EMR.

Next steps

NoteThere are no Pentaho-specific edits that need to be made to the *-site.xml configuration files on the cluster.

Configure Pentaho component shims

You must configure the shim in each of the following Pentaho components, on each computer from which Pentaho will be used to connect to the cluster:

PDI client (Spoon)
Pentaho Server
Pentaho Report Designer (PRD)
Pentaho Metadata Editor (PME)

As a best practice, configure the shim in the PDI client first. The PDI client has features that will help you test your configuration. Then copy the tested PDI client configuration files to other components, making changes if necessary.

You can also opt to go through these instructions for each Pentaho component, and not copy the shim files from the PDI client. If you do not plan to connect to the cluster from the PDI client, you can configure the shim in another component first instead.

Step 1: Locate the Pentaho Big Data Plugin and shim directories

Shims and other parts of the Pentaho Adaptive Big Data Layer are in the Pentaho Big Data Plugin directory. The path to this directory differs by component. You need to know the locations of this directory, for each component, to complete shim configuration and testing tasks.

Note<pentaho home> is the directory where Pentaho is installed.

Components	Location of Pentaho Big Data Plugin Directory
PDI client	`<pentaho home>`/design-tools/data-integration/plugins/pentaho-big-data-plugin
Pentaho Server	`<pentaho home>`/server/pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin
Pentaho Report Designer	`<pentaho home>`/design-tools/report-designer/plugins/pentaho-big-data-plugin
Pentaho Metadata Editor	`<pentaho home>`/design-tools/metadata-editor/plugins/pentaho-big-data-plugin

Shims are located in the pentaho-big-data-plugin/hadoop-configurations directory. Shim directory names consist of a three or four-letter Hadoop Distribution abbreviation followed by the Hadoop Distribution's version number. The version number does not contain a decimal point. For example, the shim directory named cdh54 is the shim for the CDH (Cloudera Distribution for Hadoop), version 5.4. Here is a list of the shim directory abbreviations.

Abbreviation	Shim
cdh	Cloudera's Distribution of Apache Hadoop
emr	Amazon Elastic Map Reduce
hdi	Microsoft Azure HDInsight
hdp	Hortonworks Data Platform
mapr	MapR

Step 2: Select the correct shim

Although Pentaho often supports one or more versions of a Hadoop distribution, the download of the Pentaho Suite only contains the latest, supported, Pentaho-certified version of the shim. The other supported versions of shims can be downloaded from the Pentaho Customer Support Portal.

NoteBefore you begin, verify that the shim you want is supported by your version of Pentaho shown in the Components Reference.

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations directory to view the shim directories.
If the shim you want to use is already there, you can go to Step 3: Copy the configuration files from cluster to shim.
On the Customer Portal home page, sign in using the Pentaho support user name and password provided to you in your Pentaho Welcome Packet.
In the search box, enter the name of the shim you want, then select the shim from the search results.
(Optional) You can browse the shims by version on the Downloads page.
Read all prerequisites, warnings, and instructions.
On the bottom of the page in the Box widget, click the shim ZIP file to download it.
Unzip the downloaded shim package to the pentaho-big-data-plugin/hadoop-configurations directory.

Step 3: Copy the configuration files from cluster to shim

Copying configuration files from the cluster to the shim keeps the configuration files in sync and reduces troubleshooting errors.

Procedure

Back up the existing Amazon EMR shim files in the pentaho-big-data-plugin/hadoop-configurations/emrxx directory.
Copy the following configuration files from the EMR cluster to pentaho-big-data-plugin/hadoop-configurations/emrxx (overwrite the existing files):
- core-site.xml
- hdfs-site.xml
- emrfs-site.xml
- httpfs-site.xml
- mapred-site.xml
- yarn-site.xml

Step 4: Edit the shim configuration files

You need to verify or change settings in authentication, Oozie, Hive, MapReduce, and YARN in these shim configuration files:

core-site.xml
mapred-site.xml

Verify or edit Core Site XML file

NoteIf you plan to run MapReduce jobs on an Amazon EMR cluster, make sure you have read, write, and execute access to the S3 Buffer directories specified in the core-site.xml file on the EMR cluster.

Edit the core-site.xml file to add information about your AWS Access Key ID and Access key. You will also need to indicate your LZO compression setting.

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations/emrxx directory and open the core-site.xml file.

Add the following values:

Parameter	Values
fs.s3.awsAccessKeyId	Value of your S3 AWS Access Key ID. <property> <name>fs.s3.awsAccessKeyId</name> <value>[INSERT YOUR VALUE HERE]</value> </property>
fs.s3.awsSecretAccessKey	Value of your AWS secret access key. <property> <name>fs.s3.awsSecretAccessKey</name> <value>[INSERT YOUR VALUE HERE]</value> </property>

If needed, enter the AWS Access Key ID and Access Key for S3N like this:

Parameter	Values
fs.s3n.awsAccessKeyId	Value of your S3N AWS Access Key ID. <property> <name>fs.s3n.awsAccessKeyId</name> <value>[INSERT YOUR VALUE HERE]</value> </property>
fs.s3n.awsSecretAccessKey	Value of your 3N AWS secret access key. <property> <name>fs.s3n.awsSecretAccessKey</name> <value>[INSERT YOUR VALUE HERE]</value> </property>

Add the following values:

Parameter	Values
fs.s3n.impl	<property> <name>fs.s3n.impl</name> <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value> </property>
fs.s3.impl	<property> <name>fs.s3.impl</name> <value>org.apache.hadoop.fs.s3.S3FileSystem</value> </property>

Parameter

Values

fs.s3n.impl

<property>
   <name>fs.s3n.impl</name>
   <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>

fs.s3.impl

<property>
   <name>fs.s3.impl</name>
   <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>

LZO is a compression format that Amazon EMR supports. If you want to configure for LZO compression, you will need to download a JAR file. If you do not, you will need to remove a parameter from the core-site.xml file.
- If you are not going to use LZO compression, remove any references to the iocompression parameter in the core-site.xml file: com.hadoop.compression.lzo.LzoCodec
- If you are going to use LZO compression, download the LZO JAR and add it to pentaho-big-data-plugin/hadoop-configurations/emr3x/lib directory. The LZO JAR can be found here: http://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.19/.
Save and close the file.

Edit Mapred Site XML file

Edit the mapred-site.xml file to indicate where the job history logs are stored and to allow MapReduce jobs to run across platforms.

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations/emrxx directory and open the mapred-site.xml file.

Add the following values:

Parameter	Value
mapreduce.app-submission.cross-platform	This property allows MapReduce jobs to run on either Windows client or Linux server platforms. <property> <name>mapreduce.app-submission.cross-platform</name> <value>true</value> </property>

Save and close the file.

Connect to a Hadoop cluster with the PDI client

Once you have set up your shim, you must make it active, then configure and test the connection to the cluster. For details on setting up the connection, see the article Connect to a Hadoop cluster with the PDI client.

Copy the PDI client shim files to other Pentaho components

Once your connection has been properly configured on the PDI client, you can copy the configuration files to the shim directories in the other Pentaho components. Copy the following configuration files from the pentaho-big-data-plugin/hadoop-configurations/emrxx directory to the pentaho-big-data-shim/emrxx directory on the Pentaho Server, PRD, or PME:

core-site.xml
hdfs-site.xml
emrfs-site.xml
httpfs-site.xml
mapred-site.xml
yarn-site.xml

Connect other Pentaho components to the Amazon EMR cluster

These instructions explain how to create and test a connection to the cluster in the Pentaho Server, PRD, and PME. Creating and testing a connection to the cluster in the PDI client involves two tasks:

Setting the active shim on PRD, PME, and the Pentaho Servers
Creating and testing the cluster connections.

Set the active shim on PRD, PME, and Pentaho Server

Modify the plugin.properties file to set the active shim for the Pentaho Server, PRD, and PME.

Procedure

Stop the component.
Locate the pentaho-big-data-plugin directory for your component.
Navigate to the hadoop-configurations directory.
Navigate to the pentaho-big-data-plugin directory and open the plugin.properties file.
Set the active.hadoop.configuration property to the directory name of the shim you want to make active.
Here is an example:
```
active.hadoop.configuation=emr46
```
Save and close the plugin.properties file.
Restart the component.

Create and test connections

Connection tests appear in the following table:

Component	Test
Pentaho Server for DI	Create a transformation in the PDI client and run it remotely.
Pentaho Server for BA	Create a connection to the cluster in the Data Source Wizard.
PME	Create a connection to the cluster in PME.
PRD	Create a connection to the cluster in PRD.

Once you have connected to the cluster and its services properly, provide connection information to users who need access to the cluster and its services. Those users can only obtain access from computers that have been properly configured to connect to the cluster.

Here is what they need to connect:

Hadoop distribution and version of the cluster
HDFS, JobTracker, ZooKeeper, and Hive2/Impala Hostnames, IP addresses and port numbers
Oozie URL (if used)
Users also require the appropriate permissions to access the directories they need on HDFS. This typically includes their home directory and any other required directories.

They might also need more information depending on the job entries, transformation steps, and services they use. Here is a more detailed list of information that your users might need from you.

Notes

Impala and HBase shim support

EMR Version	HBase	Impala
3.4	Not supported	Supported
3.10	Not supported	Supported
4.1	Not supported	Not supported

Amazon EMR 4.1

The following issue can occur in the Amazon EMR 4.1 shim.

Cannot create S3 buffer directory error

This error can occur if you use PDI to run a Pentaho MapReduce on the Amazon EMR cluster. If you get this error, the MapReduce job fails when you try to run it. This error usually occurs on Windows computers only.

Procedure

To resolve the error, make sure the person who is running the PDI job has read, write, and execute access on Amazon EMR's S3 buffer directories. These directories are specified in the core-site.xml file on the Amazon EMR cluster, but are usually /mnt/s3 and /mnt1/s3.
If you still get this error, comment out these lines in the core-site.xml file in the Amazon EMR shim directory on the computers where the PDI client and the Pentaho Server are installed:
```

```
Save and close the core-site.xml files on the PDI client and the Pentaho Server.
Stop and restart the PDI client and the Pentaho Server.
Run the job again.

Next steps

For troubleshooting cluster and service configuration issues, refer to Big Data issues.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Before you begin

Configure Pentaho component shims

Step 1: Locate the Pentaho Big Data Plugin and shim directories

Step 2: Select the correct shim

Step 3: Copy the configuration files from cluster to shim

Step 4: Edit the shim configuration files

Verify or edit Core Site XML file

Edit Mapred Site XML file

Connect to a Hadoop cluster with the PDI client

Copy the PDI client shim files to other Pentaho components

Connect other Pentaho components to the Amazon EMR cluster

Set the active shim on PRD, PME, and Pentaho Server

Create and test connections

Notes

Impala and HBase shim support

Amazon EMR 4.1

Cannot create S3 buffer directory error