Connect to a Hadoop Cluster in Spoon
Overview
Explains how to connect to a Hadoop cluster.
To connect Pentaho to a Hadoop cluster you will need to do two things:
- Set the active shim
- Create and test the connection
A shim is a bit like an adapter that enables Pentaho to connect to a Hadoop distribution, like Cloudera Distribution for Hadoop (CDH). The active shim is used by default when you run big data transformations, jobs, and reports. When you first install Pentaho, no shim is active, so this is the first thing you need to do before you try to connect to a Hadoop cluster.
After the active shim is set, you must configure, then test the connection. Spoon has built in tools to help you do this.
Before You Begin
Before you begin, make sure that your Hadoop Administrator has granted you permission to access the HDFS directories you need. This typically includes your home directory as well as any other directories you need to do your work. Your Hadoop Administrator should have already configured Pentaho to connect to the Hadoop cluster on your computer. For more details on how to do this, see the Set Up Pentaho to Connect to an Apache Hadoop Cluster article. You also need to know these things:
- Distribution and version of the cluster (e.g. Cloudera Distribution 5.4)
- IP Addresses and Port Numbers for HDFS, JobTracker, and Zookeeper (if used).
- Oozie URL (if used)
Set the Active Shim in Spoon
Set the active shim when you want to connect to a Hadoop cluster the first time, or when you want to switch clusters. Only one shim can be active at a time.
- Start Spoon.
- Select Hadoop Distribution... from the Tools menu.
- In the Hadoop Distribution window, select the Hadoop distribution you want.
- Click OK.
- Stop, then restart Spoon.
Configure and Test the Cluster Connection
Configured connection information is available for reuse in other steps and entries. Whether you are connected to the DI repository when you create the connection determines who can reuse it.
- If you are connected to the DI repository when you create the connection, you and other users can reuse the connection.
- If you are not connected to the DI repository when you create the connection, only you can reuse the connection.
Open the Hadoop Cluster Window
Connection settings are set in the Hadoop cluster window. You can get to the settings from these places:
- Steps and Entries
- View tab in a transformation or job
- Repository Explorer window
Steps and Entries
- Create a new job or transformation or open an existing one.
- Add a step or entry that can connect to a Hadoop cluster to the Spoon canvas.
- Click the New button next to the Hadoop Cluster field. The Hadoop cluster window appears.
- Configure and Test the Hadoop Cluster connection.
View Tab
- In Spoon, create a new job or transformation or open an existing one.
- Click the View tab.
- Right-click the Hadoop cluster folder, then click New. The Hadoop cluster window appears.
- Configure and Test the Hadoop Cluster connection.
Repository Explorer
- In Spoon, connect to the repository where you want to store the transformation or job.
- Select Repository from the Tools menu.
- Select Explore to open the Repository Explorer window.
- Click the Hadoop clusters tab.
- Click the New button. The Hadoop Cluster window appears.
- Configure and Test the Hadoop Cluster connection.
Configure and Test Connection
Once you have opened the Hadoop cluster window from a step or entry, the View tab, or the Repository Explorer window, configure the connection.
- Enter information in the Hadoop cluster window. You can get most of the information you need from your Hadoop Administrator.
As a best practice, use Kettle variables for each connection parameter value to mitigate risks associated with running jobs and transformations in environments that are disconnected from the repository.
Option | Definition |
---|---|
Cluster Name | Name that you assign the cluster connection. |
Use MapR Client | Indicates that this connection is for a MapR cluster. If this box is checked, the fields in the HDFS and JobTracker sections are disabled because those parameters are not needed to configure MapR. |
Hostname (in HDFS section) | Hostname for the HDFS node in your Hadoop cluster. |
Port (in HDFS section) | Port for the HDFS node in your Hadoop cluster. |
Username (in HDFS section) | Username for the HDFS node. |
Password (in HDFS section) | Password for the HDFS node. |
Hostname (in JobTracker section) | Hostname for the JobTracker node in your Hadoop cluster. If you have a separate job tracker node, type in the hostname here. Otherwise use the HDFS hostname. |
Port (in JobTracker section) | Port for the JobTracker in your Hadoop cluster. Job tracker port number; this cannot be the same as the HDFS port number. |
Hostname (in ZooKeeper section) | Hostname for the Zookeeper node in your Hadoop cluster. Supply this only if you want to connect to a Zookeeper service. |
Port (in Zookeeper section) | Port for the Zookeeper node in your Hadoop cluster. Supply this only if you want to connect to a Zookeeper service. |
URL (in Oozie section) | Oozie client address. Supply this only if you want to connect to the Oozie service. |
- Click the Test button. Test results appear in the Hadoop Cluster Test window. If you have problems, see Troubleshoot Connection Issues to resolve the issues, then test again.
- If there are no more errors, congratulations! The connection is properly configured. Click the Close button to the remaining Hadoop Cluster Test window.
- When complete, click the OK button to close the Hadoop cluster window.
Troubleshoot Connection Issues
General Configuration Problems
The issues in this section explain how to resolve common configuration problems.
Shim and Configuration Issues
Symptoms | Common Causes | Common Resolutions |
---|---|---|
No shim |
|
|
Shim doesn't load |
|
|
The file system's URL does not match the URL in the configuration file. | Configuration files (*-site.xml files) were not configured properly. | Verify that the configuration files were configured correctly. Verify that the core-site.xml file is configured correctly. See the instructions for your Hadoop distribution in the Set Up Pentaho to Connect to an Apache Hadoop Cluster article for details. |
Connection Problems
Symptoms | Common Causes | Common Resolutions |
---|---|---|
Hostname incorrect or not resolving properly. |
|
|
Port name is incorrect. |
|
|
Can't connect. |
|
|
Directory Access or Permissions Issues
Symptoms | Common Causes | Common Resolutions |
---|---|---|
Can't access directory. |
|
|
Can't create, read, update, or delete files or directories | Authorization and/or authentication issues. |
|
Test file cannot be overwritten. | Pentaho test file is already in the directory. |
|
Oozie Issues
Symptoms | Common Causes | Common Resolutions |
---|---|---|
Can't connect to Oozie. |
|
|
Zookeeper Problems
Symptoms | Common Causes | Common Resolutions |
---|---|---|
Can't connect to Zookeeper. |
|
|
Zookeeper hostname or port not found or doesn't resolve properly. |
|
|
Manage Existing Hadoop Cluster Connections
Once cluster connections have been created, you can manage them.
- Edit Hadoop Cluster Connections
- Duplicate Hadoop Cluster Connections
- Delete Hadoop Cluster Connections
Edit Hadoop Cluster Connections
How updates occur depend on whether you are connected to the repository.
- If you are connected to a repository: Hadoop Cluster connection changes are picked up by all transformations and jobs in the repository. The Hadoop Cluster connection information is loaded during execution unless it cannot be found. If the connection information cannot be found, the connection values that were stored when the transformation or job was saved are used instead.
- If you are not connected to a repository: Hadoop Cluster connection changes are only picked up by your local (file system) transformations and jobs. If you run these transformations and jobs outside of Kettle, they will not have access to the Hadoop Cluster connection, so a copy of the connection is saved as a fallback. Note that changes to the Hadoop Cluster connection are not updated in any transformations or jobs for the purpose of fallback unless they are saved again.
You can edit Hadoop cluster connections in three places:
- Steps and entries
- View tab
- Repository Explorer window
Steps and Entries
To edit Hadoop cluster connection in a step or entry, complete these steps.
- Open the Hadoop cluster window in a step or entry.
- Make changes, then click Test.
- Click the OK button.
View Tab
To edit Hadoop cluster connection from the transformation or job View tab, complete these steps.
- Click the Hadoop Clusters folder in the View tab.
- Right-click a connection, then select Edit.
- The Hadoop cluster window appears.
- Make changes, then click Test.
- Click the OK button.
Repository Explorer
To edit Hadoop cluster connection from the Repository Explorer window, do the following.
- Click the Hadoop Clusters tab in the Repository Explorer window.
- Select a connection, then click Edit.
- The Hadoop cluster window appears.
- Make changes, then click Test.
- Click the OK button.
Duplicate a Hadoop Cluster Connection
You can only duplicate or clone a Hadoop Cluster connection in the Spoon View tab.
- Click the Hadoop clusters folder in the View tab.
- Right-click a connection and select Duplicate.
- The Hadoop cluster window appears. Enter a different name in the Cluster Name field.
- Make changes, then click Test.
- Click the OK button.
Delete a Hadoop Cluster Connection
Deleted connections cannot be restored. But, you can still run transformations and jobs that reference them because deleted connections details are stored in the transformation and job metadata files.
You can delete Hadoop cluster connections in two places:
- View tab
- Repository Explorer window
View Tab
To delete Hadoop cluster connection in a transformation or job, complete these steps.
- Click the Hadoop clusters folder in the View tab.
- Right-click a Hadoop cluster connection and select Delete.
- A message appears asking whether you really want to delete the connection. Click Yes.
Repository Explorer
To delete Hadoop cluster connections from the Repository Explorer window, do the following.
- Connect to the Repository Explorer.
- Click the Hadoop Clusters tab.
- Select a Hadoop cluster connection, then click Delete.
- A message appears asking if you really want to delete the Hadoop cluster connection. Click Yes.