Big Data issues

General configuration problems

The issues in this section explain how to resolve common configuration problems.

Shim and configuration issues

Symptoms	Common Causes	Common Resolutions
No shim	Active shim was not selected. Shim was installed in the wrong place. Shim name was not entered correctly in the plugin.properties file.	Verify that the plugin name that is in the plugin.properties file matches the directory name in the pentaho-big-data-plugin/hadoop-configurations directory. Verify the shim is installed in the correct place. Check the instructions for your Hadoop distribution in the Set up the Pentaho Server to connect to a Hadoop cluster article for more details on how to verify the plugin name and shim installation directory.
Shim does not load	Required licenses are not installed. You tried to load a shim that is not supported by your version of Pentaho. If you are using MapR, the client might not have been installed correctly. Configuration file changes were made incorrectly.	Verify the required licenses are installed and have not expired. Verify that the shim is supported by your version of Pentaho. Find your version of Pentaho, then look for the corresponding Components Reference for more details. Verify that configuration file changes were made correctly. Contact your Hadoop Administrator or see the Set up the Pentaho Server to connect to a Hadoop cluster article. If you are connecting to MapR, verify that the client was properly installed. See MapR documentation for details. Restart the PDI client (also known as Spoon), then test again. If this error continues to occur, files might be corrupted. Download a new copy of the shim from the Pentaho Customer Support Portal.
The file system's URL does not match the URL in the configuration file.	Configuration files (*-site.xml files) were not configured properly.	Verify that the configuration files were configured correctly. Verify that the core-site.xml file is configured correctly. See the instructions for your Hadoop distribution in the Set up the Pentaho Server to connect to a Hadoop cluster article for details.
Sqoop Unsupported major.minor version Error	In Pentaho 6.0, the Java version on your cluster is older than the Java version that Pentaho uses.	Verify that the JDK meets the requirements in the supported components matrix. Verify that the JDK on the Pentaho Server is the same major version as the JDK on your cluster.

Connection problems

Symptoms	Common Causes	Common Resolutions
Hostname does not resolve	No hostname has been specified. Hostname/IP Address is incorrect. Hostname is not resolving properly in the DNS.	Verify that the Hostname/IP address is correct. Check the DNS to make sure the Hostname is resolving properly.
Port number does not resolve	Port number is incorrect. Port number is not numeric. The port number is not necessary for HA clusters. No port number has been specified.	Verify that the port number is correct. Determine whether your cluster has been enabled for high availability (HA). If it has, then you do not need a port number. Clear the port number and retest the connection.
Cannot connect to the cluster	Firewall is a barrier to connecting. Other networking issues are occuring.	Verify that a firewall is not impeding the connection and there are no other network issues.
Windows failure message: “java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset	A configuration setting is required. Windows cannot locate the %HADOOP_HOME%\bin\winutils.exe file and the Hadoop_Home directory is not set.	Please follow the instructions at https://wiki.apache.org/hadoop/WindowsProblems and set the environment variable `%HADOOP_HOME%` to point to the directory path containing WINUTILS.EXE
Cannot access a Hive database (Secured Clusters Only)	To access Hive, you need to set two database connection parameters using the PDI client.	Open the hive-site.xml file on the Hive server host. Note the values for the kerberos.principal and the sasl.qop parameters, then close the file. Start the PDI client and open the Database Connection window. Click Options, and add the sasl.qop and principal parameters. Set them to the same values as they are in the hive-site.xml file. Save and close the file. NoteThe principal typically has a mapr prefix before the name, for example, `mapr/mapr31.pentaho@mydomain`.

Directory access or permissions issues

Symptoms	Common Causes	Common Resolutions
Cannot access directory	Authorization and/or authentication issues. Directory is not on the cluster.	Verify the user has been granted read, write, and execute access to the directory. Verify security settings for the cluster and shim allow access. Verify the hostname and port number are correct for the Hadoop File System's namenode.
Cannot create, read, update, or delete files or directories	Authorization and/or authentication issues.	Verify the user has been authorized execute access to the directory. Verify security settings for the cluster and shim allow access. Verify that the hostname and port number are correct for the Hadoop File System's namenode.
Test file cannot be overwritten	Pentahotest file is already in the directory.	The test was run, but the file was not deleted. You will need to manually delete the test file. Check the log for the test file name. A file with the same name as the Pentaho test file is already in the directory. The test file is used to make sure that the user can create, write, and delete in the user's home directory.

Oozie issues

Symptoms	Common Causes	Common Resolutions
Cannot connect to Oozie	Firewall issue. Other networking issues. Oozie URL is incorrect.	Verify that the Oozie URL was correctly entered. Verify that a firewall is not impeding the connection.

Zookeeper problems

Symptoms	Common Causes	Common Resolutions
Cannot connect to Zookeeper	Firewall is impeding connection with the Zookeeper service. Other networking issues.	Verify that a firewall is not impeding the connection.
Zookeeper hostname or port not found or does not resolve properly	Hostname/IP Address and Port name is missing or is incorrect.	Try to connect to the Zookeeper nodes using ping or another method. Verify that the Hostname/IP Address and Port numbers are correct.

Kafka problems

Symptoms	Common Causes	Common Resolutions
Cannnot connect to Kafka	Bootstrap server information is incorrect. Specified bootstrap server is down. Firewall issue.	Verify that the bootstrap server was correctly entered. Verify that the bootstrap server is running. Verify that a firewall is not blocking the connection.

Cannot access cluster with Kerberos enabled

If a step or entry cannot access a Kerberos authenticated cluster, review the steps in Use Kerberos with MapR.

If this issue persists, verify the username, password, UID, and GID for each impersonated or spoofed user is the same on each node. When a user is deleted and recreated, it may then have different UIDs and GIDs causing this issue.

Cannot access a Hive cluster

If you cannot use Kerberos impersonation to authenticate and access a Hive cluster, review the steps in Use Kerberos with MapR

If this issue persists, copy the hive-site.xml file on the Hive server to the MapR distribution in these directories:

Pentaho Server:
pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/[mapr distribution]
PDI client:
data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/[mapr distribution]

If this still does not work, disable pooled connections for Hive.

Cannot use keytab file to authenticate access to PMR cluster

If you cannot authenticate and gain access to the PMR cluster, copy the keytab file to each task tracker node on the PMR cluster.

HBase Get Master Failed error

If the HBase cannot negation the authenticated portion of the connection error occurs, copy the hbase-site.xml file from the HBase server to the MapR distribution in these directories:

Pentaho Server:
pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/[mapr distribution]
PDI client:
data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/[mapr distribution]

Sqoop import into Hive fails

If executing a Sqoop import into Hive fails to execute on a remote installation, the local Hive installation configuration does not match the Hadoop cluster connection information used to perform the Sqoop job.

Verify the Hadoop connection information used by the local Hive installation is configured the same as the Sqoop job entry.

Cannot start any Pentaho components after setting MapR as active Hadoop configuration

If you set MapR to be your active Hadoop configuration, but you cannot start any Pentaho component (Pentaho Server, Spoon, Report Designer, or the Metadata Editor), make sure that you have completed proper configuration of MapR.

As you review the instructions for configuring MapR, make sure that you have copied the required JAR files to the pentaho-big-data-plugin/hadoop-configurations/mapr3x folders for each component listed. For information on how to configure MapR, see the following references:

Pig job not executing after Kerberos authentication fails

Your Pig job will not execute after the Kerberos authentication fails until you restart PDI. While PDI may continue to generate new Kerberos tickets and other Hadoop components may work, Pig continues to fail until PDI is restarted.

For authentication with Pig, Pentaho uses the UserGroupInformation wrapper around a JAAS Subject with username and password which is used for impersonation. The UserGroupInformation wrapper is stored in the KMSClientProvider constructor. When the Kerberos ticket expires, a new UserGroupInformation is created, but the instance stored in the KMSClientProvider constructor does not update. The Pig job fails when Pig cannot obtain delegation tokens to authenticate the job at execution time.

To resolve, set the key.provider.cache.expiry time to a value equal to or less than the duration time of the Kerberos ticket. By default, the key.provider.cache.expiry time is set to a value of: 10 days

NoteThis solution assumes you are using HortonWorks 2.5.

Procedure

Navigate to the hdfs-site.xml file location.
- In the PDI client, navigate to: data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp25
- For the Pentaho Server, navigate to: pentaho-server\pentaho-solutions\system\kettle\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp25
Open the hdfs-site.xml file in a text editor.
Adjust the key.provider.cache.expiry value (in milliseconds) so that it is less than the duration time of the Kerberos ticket.

NoteYou can view the Kerberos ticket duration time in the krb5.conf file.
```
<property>
<name>dfs.client.key.provider.cache.expiry</name>
<value>410000</value>
</property>
```

Group By step is not supported in a single threaded transformation engine

If you have a job that contains both a Pentaho MapReduce entry and a Reducer transformation with a Group by step, you may receive a Step 'Group by' of type 'GroupBy' is not Supported in a Single Threaded Transformation Engine error message. This error can occurs if:

An entire set of rows sharing the same grouping key are filtered from the transformation before the Group By step.

The Reduce single threaded option in the Pentaho MapReduce entry's Reducer tab is selected.

To fix this issue, open the Pentaho MapReduce entry and deselect the Reduce single threaded option in the Reducer tab.

Kettle cluster on YARN will not start

When you are using the Start a PDI Cluster on YARN job entry, the Kettle cluster may not start.

Verify in the File System Path (in the Files tab) that the Default FS setting matches the configured hostname for the HDFS Name node, then try starting the kettle cluster again.

Hadoop on Windows

If you have problems accessing the Hadoop file system on a Windows machine, see the Problems running Hadoop on Windows article on the Hadoop Wiki site.

Spark issues

Follow the suggestions in these topics to help resolve common issues with running transformations with Spark.

Steps cannot run in parallel
Table Input step fails
User ID below minimum allowed

Steps cannot run in parallel

Some steps cannot run in parallel (on multiple nodes in a cluster), and will produce unexpected results. However, these steps can run as a coalesced dataset on a single node in a cluster. To enable a step to run as a coalesced dataset, add the step ID as a property value in the configuration file for using the Spark engine.

Get the step ID

Each PDI step has a step ID, a globally unique identifier of the step. Use either of the following two methods to get the ID of a step:

Method 1: Retrieve the ID from the log

You can retrieve a step ID though the PDI client with the following steps:

Procedure

In the PDI client, create a new transformation and add the step to the transformation. For example, if you needed to know the ID for the Select values step, you would add that step to the new transformation.
Set the log level to debug.
Execute the transformation using the Spark engine.
The step ID will display in the Logging tab of the Execution Results pane. For example, the log will display:
```
Selected the SelectValues step to run in parallel as a GenericSparkOperation,
```
where SelectValues is the step ID.

Method 2: Retrieve the ID from the PDI plugin registry

If you are a developer, you can retrieve the step ID from the PDI plugin registry as described in: Dynamically build transformations.

If you have created your own PDI transformation step plugin, the step ID is one of the annotation attributes that the developer supplies.

Add the step ID to the configuration file

The configuration file, org.pentaho.pdi.engine.spark.cfg, contains the forceCoalesceSteps property. The property is a pipe-delimited listing of all the IDs for the steps that should run with a coalesced dataset. Pentaho supplies a default set to which you can add IDs for steps that generate errors.

Perform the following steps to add another step ID to the configuration file:

Procedure

Navigate to the data-integration/system/karaf/etc folder and open the org.pentaho.pdi.engine.spark.cfg file.
Append your step ID to the forceCoalesceSteps property value list, using a pipe character separator between the step IDs.
Save and close the file.

Table Input step fails

If you run a transform using the Table Input step with a large database, the step will not complete. Use one of the following methods to resolve the issue:

Method 1: Load the data to HDFS before running the transform

Run a different transformation using the Pentaho engine to move the data to the HDFS cluster.
Then use HDFS Input to run the transformation using the Spark engine.

Method 2: Increase the driver side memory configuration

Navigate to the data-integration/adaptive-execution/config folder and open the application.properties file.
Increase the value of the sparkDriverMemory parameter, then save and close the file.

User ID below minimum allowed

If you are using the Spark engine in a secured cluster and an error about minimum user ID occurs, the user ID of the proxy user is below the minimum user ID required by the cluster. See Cloudera documentation for details.

To resolve, change the ID of the proxy user to be higher than the minimum user ID specified for the cluster

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

General configuration problems

Shim and configuration issues

Connection problems

Directory access or permissions issues

Oozie issues

Zookeeper problems

Kafka problems

Cannot access cluster with Kerberos enabled

Cannot access a Hive cluster

Cannot use keytab file to authenticate access to PMR cluster

HBase Get Master Failed error

Sqoop import into Hive fails

Cannot start any Pentaho components after setting MapR as active Hadoop configuration

Pig job not executing after Kerberos authentication fails

Group By step is not supported in a single threaded transformation engine

Kettle cluster on YARN will not start

Hadoop on Windows

Spark issues

Steps cannot run in parallel

Get the step ID

Method 1: Retrieve the ID from the log

Method 2: Retrieve the ID from the PDI plugin registry

Add the step ID to the configuration file

Table Input step fails

Method 1: Load the data to HDFS before running the transform

Method 2: Increase the driver side memory configuration

User ID below minimum allowed