Big Data Issues
Follow the suggestions in these topics to help resolve common issues when working with Big Data:
- General Configuration Problems
- Cannot Access Cluster with Kerberos Enabled
- Cannot Access a Hive Cluster
- Cannot use Keytab File to Authenticate Access to PMR Cluster
- HBase Get Master Failed Error
-
Sqoop Import into Hive Fails
-
Pig Job Not Executing after Kerberos Authentication Fails
- Cannot Start Any Pentaho Components after Setting MapR as Active Hadoop Configuration
- Kettle Cluster on YARN Will Not Start
- Spark Issues
See Pentaho Troubleshooting articles for additional topics.
General Configuration Problems
The issues in this section explain how to resolve common configuration problems.
Shim and Configuration Issues
Symptoms | Common Causes | Common Resolutions |
---|---|---|
No shim |
|
|
Shim does not load |
|
|
The file system's URL does not match the URL in the configuration file. |
|
|
Sqoop "Unsupported major.minor version" Error |
In Pentaho 6.0, the Java version on your cluster is older than the Java version that Pentaho uses |
|
Connection Problems
Symptoms | Common Causes | Common Resolutions |
---|---|---|
Hostname does not resolve. |
|
|
Port number does not resolve. |
|
|
Cannot connect to the cluster |
|
|
Windows failure message: “java.io.FileNotFoundException: HADOOP_HOME” and “hadoop.home.dir are unset” |
|
Please follow the instructions at https://wiki.apache.org/hadoop/WindowsProblems and set the environment variable %HADOOP_HOME% to point to the directory path containing WINUTILS.EXE |
Cannot access a Hive database (Secured Clusters Only) |
|
The |
Directory Access or Permissions Issues
Symptoms | Common Causes | Common Resolutions |
---|---|---|
Cannot access directory |
|
|
Cannot create, read, update, or delete files or directories |
|
|
Test file cannot be overwritten. |
|
|
Oozie Issues
Symptoms | Common Causes | Common Resolutions |
---|---|---|
Cannot connect to Oozie |
|
|
Zookeeper Problems
Symptoms | Common Causes | Common Resolutions |
---|---|---|
Cannot connect to Zookeeper |
|
|
Zookeeper hostname or port not found or does not resolve properly. |
|
|
Kafka Problems
Symptoms |
Common Causes | Common Resolutions |
---|---|---|
Can't connect to Kafka. |
|
|
Cannot Access Cluster with Kerberos Enabled
If a step or entry cannot access a Kerberos authenticated cluster, review the steps in Use Impersonation to Access a MapR Cluster.
If this issue persists, verify the username, password, UID, and GID for each impersonated or spoofed user is the same on each node. When a user is deleted and recreated, it may then have different UIDs and GIDs causing this issue.
Cannot Access a Hive Cluster
If you cannot use Kerberos impersonation to authenticate and access a Hive cluster, review the steps in Use Impersonation to Access a MapR Cluster.
If this issue persists, copy the hive-site.xml
file on the Hive server to the MapR distribution in these directories:
-
Pentaho Server:
pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/[mapr distribution]
-
PDI Client:
data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/[mapr distribution]
If this still does not work, disable pooled connections for Hive.
Cannot use Keytab File to Authenticate Access to PMR Cluster
If you cannot authenticate and gain access to the PMR cluster, copy the keytab file to each task tracker node on the PMR cluster.
HBase Get Master Failed Error
If the HBase cannot negation the authenticated portion of the connection
error occurs, copy the hbase-site.xml
file from the HBase server to the MapR distribution in these directories:
-
Pentaho Server:
pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/[mapr distribution]
-
PDI Client:
data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/[mapr distribution]
Sqoop Import into Hive Fails
If executing a Sqoop import into Hive fails to execute on a remote installation, the local Hive installation configuration does not match the Hadoop cluster connection information used to perform the Sqoop job.
Verify the Hadoop connection information used by the local Hive installation is configured the same as the Sqoop job entry.
Cannot Start Any Pentaho Components after Setting MapR as Active Hadoop Configuration
If you set MapR to be your active Hadoop configuration, but you cannot start any Pentaho component (Pentaho Server, Spoon, Report Designer, or the Metadata Editor), make sure that you have completed proper configuration of MapR.
As you review the instructions for configuring MapR, make sure that you have copied the required JAR files to the pentaho-big-data-plugin/hadoop-configurations/mapr3x
folders for each component listed. For information on how to configure MapR, see the following references:
Pig Job Not Executing after Kerberos Authentication Fails
Your Pig job will not execute after the Kerberos authentication fails until you restart PDI. While PDI may continue to generate new Kerberos tickets and other Hadoop components may work, Pig continues to fail until PDI is restarted.
Problem
For authentication with Pig, Pentaho uses the UserGroupInformation wrapper around a JAAS Subject with username and password which is used for impersonation. The UserGroupInformation wrapper is stored in the KMSClientProvider constructor. When the Kerberos ticket expires, a new UserGroupInformation is created, but the instance stored in the KMSClientProvider constructor does not update. The Pig job fails when Pig cannot obtain delegation tokens to authenticate the job at execution time.
Solution
To resolve, set the key.provider.cache.expiry time to a value equal to or less than the duration time of the Kerberos ticket. By default, the key.provider.cache.expiry time is set to a value of 10 days.
This solution assumes you are using HortonWorks 2.5.
- Navigate to the hdfs-site.xml file location.
- In the PDI client, navigate to data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp25.
- For the Pentaho Server, navigate to pentaho-server\pentaho-solutions\system\kettle\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp25.
- Open the hdfs-site.xml file in a text editor.
- Adjust the key.provider.cache.expiry value (in milliseconds) so that it is less than the duration time of the Kerberos ticket.
You can view the Kerberos ticket duration time in the krb5.conf file.
<property> <name>dfs.client.key.provider.cache.expiry</name> <value>410000</value> </property>
The 'Group by' Step is not Supported in a Single Threaded Transformation Engine
If you have a job that contains both a Pentaho MapReduce entry and a Reducer transformation with a Group by step, you may receive a Step 'Group by' of type 'GroupBy' is not Supported in a Single Threaded Transformation Engine error message. This error can occurs if:
- An entire set of rows sharing the same grouping key are filtered from the transformation before the Group By step
- The Reduce single threaded option in the Pentaho MapReduce entry's Reducer tab is selected.
To fix this issue, open the Pentaho MapReduce entry and deselect the Reduce single threaded option in the Reducer tab.
Kettle Cluster on YARN Will Not Start
When you are using the Start a PDI Cluster on YARN job entry, the Kettle cluster may not start.
Verify the Default FS setting matches the configured hostname for the HDFS Name node, then try starting the kettle cluster again.
Spark Issues
Follow the suggestions in these topics to help resolve common issues with running transformations with Spark.
Steps Cannot Run in Parallel
Some steps cannot run in parallel (on multiple nodes in a cluster), and will produce unexpected results. However, these steps can run as a coalesced dataset on a single node in a cluster. To enable a step to run as a coalesced dataset, add the step ID as a property value in the configuration file for using the Spark engine.
Get the Step ID
Each PDI step has a step ID, a globally unique identifier of the step. Use either of the following two methods to get the ID of a step:
Method 1: Retrieve the ID from the log
You can retrieve a step ID though the PDI client with the following steps:
-
In the PDI client, create a new transformation and add the step to the transformation. For example, if you needed to know the ID for the Select values step, you would add that step to the new transformation.
- Set the log level to debug.
- Execute the transformation using the Spark engine. The step ID will display in the Logging tab of the Execution Results pane. For example, the log will display
Selected the SelectValues step to run in parallel as a GenericSparkOperation,
where SelectValues is the step ID.
Method 2: Retrieve the ID from the PDI plugin registry
If you are a developer, you can retrieve the step ID from the PDI plugin registry as described in Building Transformations Dynamically.
If you have created your own PDI transformation step plugin, the step ID is one of the annotation attributes that the developer supplies.
Add the Step ID to the Configuration File
The configuration file, org.pentaho.pdi.engine.spark.cfg
, contains the forceCoalesceSteps
property. The property is a pipe-delimited listing of all the IDs for the steps that should run with a coalesced dataset. Pentaho supplies a default set to which you can add IDs for steps that generate errors.
Perform the following steps to add another step ID to the configuration file:
- Navigate to the
data-integration/system/karaf/etc
folder and open theorg.pentaho.pdi.engine.spark.cfg
file. - Append your step ID to the forceCoalesceSteps property value list, using a pipe character separator between the step IDs.
- Save and close the file.
Table Input Step Fails
If you run a transform using the Table Input step with a large database, the step will not complete. Use one of the following methods to resolve the issue:
Method 1: Load the data to HDFS before running the transform
- Run a different transformation using the Pentaho engine to move the data to the HDFS cluster.
- Then use HDFS Input to run the transformation using the Spark engine.
Method 2: Increase the driver side memory configuration
- Navigate to the
data-integration/adaptive-execution/config
folder and open theapplication.properties
file. - Increase the value of the
sparkDriverMemory
parameter, then save and close the file.
User ID Below Minimum Allowed
If you are using the Spark engine in a secured cluster and an error about minimum user ID occurs, the user ID of the proxy user is below the minimum user ID required by the cluster. See Cloudera documentation for details.
To resolve, change the ID of the proxy user to be higher than the minimum user ID specified for the cluster.