Big Data issues
Follow the suggestions in these topics to help resolve common issues when working with Big Data:
- General configuration problems
- Cannot access cluster with Kerberos enabled
- Cannot access the Hive service on a cluster
- HBase Get Master Failed error
- Sqoop import into Hive fails
- Kettle cluster on YARN will not start
- Group By step is not supported in a single threaded transformation engine
- Hadoop on Windows
- Legacy mode activated when named cluster configuration cannot be located
- Unable to read or write files to HDFS on the Amazon EMR cluster
- Use YARN with S3
- Data Catalog searches returning incomplete or missing data
See Pentaho Troubleshooting articles for additional topics.
General configuration problems
The issues in this section explain how to resolve common configuration problems.
When updating to Pentaho 9.0, you are required to perform a one-time operation to update your cluster configurations to use multiple cluster features. For more information, see Set up the Pentaho Server to connect to a Hadoop cluster. If you do not update your cluster configurations, then the legacy configuration from the Big Data plugin is used.
Driver and configuration issues
Symptoms | Common Causes | Common Resolutions |
Could not find cluster configuration file
config.properties for the cluster in expected metastore locations
or a legacy shim configuration. |
|
|
Could not find service for interface associated with named cluster. |
|
|
No driver. |
|
|
Driver does not load. |
|
|
The file system's URL does not match the URL in the configuration file. |
|
|
Sqoop Unsupported major.minor version Error. | In Pentaho 6.0, the Java version on your cluster is older than the Java version that Pentaho uses. |
|
Connection problems
Symptoms | Common Causes | Common Resolutions |
Hostname does not resolve. |
|
|
Port number does not resolve. |
|
|
Cannot connect to the cluster. |
|
|
Windows failure message: “java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset |
| Please follow the instructions at https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems and set the environment variable %HADOOP_HOME% to point to the directory path containing WINUTILS.EXE. |
Cannot access a Hive database (Secured Clusters Only). |
|
|
Directory access or permissions issues
Symptoms | Common Causes | Common Resolutions |
Access error when trying to reach the User Home Directory. |
|
|
Cannot access directory. |
|
|
Cannot create, read, update, or delete files or directories. |
|
|
Test file cannot be overwritten. |
|
|
Oozie issues
Symptoms | Common Causes | Common Resolutions |
Cannot connect to Oozie |
|
|
Zookeeper problems
Symptoms | Common Causes | Common Resolutions |
Cannot connect to Zookeeper |
|
|
Zookeeper hostname or port not found or does not resolve properly |
|
|
Kafka problems
Symptoms | Common Causes | Common Resolutions |
Cannnot connect to Kafka |
|
|
Cannot access cluster with Kerberos enabled
If this issue persists, verify that the username, password, UID, and GID for each impersonated or spoofed user is the same on each node. When a user is deleted and recreated, it may then have different UIDs and GIDs causing this issue.
Cannot access the Hive service on a cluster
If you cannot use Kerberos impersonation to authenticate and access the Hive service on a cluster, review the steps in Set Up Kerberos for Pentaho.
If this issue persists, copy the hive-site.xml file on the Hive server to the configuration directory of the named cluster connection in these directories:
Pentaho Server
pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]
PDI client
data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]
If the problem continues to persist, disable pooled connections for Hive.
HBase Get Master Failed error
- Pentaho Server:
pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]
- PDI client:
data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]
Sqoop export fails
If executing a Sqoop export job and the system generates the following error because a file already exists at the destination, then Sqoop failed to clear the compile directory:
Could not rename
\tmp\sqoop-devuser\compile\1894e2403c37a663c12c752ab11d8e6a\aggregatehdfs.java to
C:\Builds\pdi-ee-client-9.0.0.0-MS-550\data-integration\.\aggregatehdfs.java. Error:
Destination 'C:\Builds\pdi-ee-client-9.0.0.0-MS-550\data-integration\.\aggregatehdfs.java'
already exists.
Despite the error message, the job that generated it ended successfully. To stop this error message, you can add a Delete step to the job to remove the compile directory before execution of the Sqoop export step.
Sqoop import into Hive fails
Verify the Hadoop connection information used by the local Hive installation is configured the same as the Sqoop job entry.
Group By step is not supported in a single threaded transformation engine
- An entire set of rows sharing the same grouping key are filtered from the transformation before the Group By step.
- The Reduce single threaded option in the Pentaho MapReduce entry's Reducer tab is selected.
To fix this issue, open the Pentaho MapReduce entry and deselect the Reduce single threaded option in the Reducer tab.
Kettle cluster on YARN will not start
Verify in the File System Path (in the Files tab) that the Default FS setting matches the configured hostname for the HDFS Name node, then try starting the kettle cluster again.
Hadoop on Windows
If you are using Hadoop on Windows, you may get an "unexpected error" message. This message indicates that multiple cluster support across different versions of Hadoop is not available on Windows.
You are limited to using the same version of Hadoop for multiple cluster use on Windows. If you have problems accessing the Hadoop file system on a Windows machine, see the Problems running Hadoop on Windows article on the Hadoop Wiki site.
Legacy mode activated when named cluster configuration cannot be located
If you run a transformation or job for which PDI cannot locate and load a named configuration cluster, then PDI activates a legacy mode. This legacy, or fallback, mode is only available in Pentaho 9.0 and later.
When the legacy mode is activated, PDI attempts to run the transformation by finding any existing cluster configuration you have set up in the PDI Big Data plugin. PDI then migrates the existing configuration to the latest PDI instance that you are currently running.
The legacy mode is helpful for transformations built with previous versions of PDI and includes individual steps that are not associated to a named cluster. You can run the transformation in legacy mode without revising the cluster configuration in each individual step. For information about setting up a named cluster, see Connecting to a Hadoop cluster with the PDI client.
When legacy mode is active, the transformation log displays the following message:
Could not find cluster configuration file {0} for cluster {1} in expected metastore
locations or a legacy shim configuration.
If the Big Data plugin is present and PDI accesses it to successfully activate legacy mode, the transfomation log displays the following message:
Cluster configuration not found in expected location; trying legacy
configuration location.
For more information about working with clusters, see Get started with Hadoop and PDI.
Unable to read or write files to HDFS on the Amazon EMR cluster
When running a transformation on an EMR cluster, the transformation appears to run successfully, but an empty file is written to the cluster. When PDI is not installed on the Amazon EC2 instance where you are running your transformation, you are unable to read or write files to the HDFS cluster. Any files written to the cluster are empty.
To resolve this issue, perform the following steps to edit the hdfs-site.xml file on the PDI client
:Procedure
Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory.
Open the hdfs-site.xml file with any text editor.
Add the following code:
<property> <name>dfs.client.use.datanode.hostname</name> <value>true</value> </property>
Save and close the file.
Use YARN with S3
When using the Start a PDI cluster on YARN and Stop a PDI cluster on YARN job entries to run a transformation that attempts to read data from an Amazon S3 bucket, the transformation fails. The transformation fails because the Pentaho metastore is not accessible to PDI on the cluster. To resolve this problem, verify that the Pentaho metastore is accessible to PDI on the cluster.
Perform the following steps to make the Pentaho metastore accessible to PDI:
Procedure
Navigate to the <user>/.pentaho/metastore directory on the machine with the PDI client.
On the cluster where the Yarn server is located, create a new directory in the design-tools/data-integration/plugins/pentaho-big-data-plugin directory, then copy the metastore directory into this location. This directory is the <NEW_META_FOLDER_LOCATION> variable.
Navigate to the design-tools/data-integration directory and open the carte.sh file with any text editor.
Add the following code in the line before the
export OPT
line:OPT="$OPT -DPENTAHO_METASTORE_FOLDER=<NEW_META_FOLDER_LOCATION>"
, then save and close the file.Create a zip file containing the contents of the data-integration directory.
In your Start a PDI cluster on YARN job entry, go to the Files tab of the Properties window, then locate the PDI Client Archive field. Enter the filepath for the zip file.
Results
- Avro Input
- Avro Output
- Orc Input
- Orc Output
- Parquet Input
- Parquet Output
- Text File Input
- Text File Output
Data Catalog searches returning incomplete or missing data
If you have a transformation that contains the Catalog Input, Catalog Output, Read Metadata, or Write Metadata steps, there may be instances when a complete search of the records in Lumada Data Catalog (LDC) is not performed. This error can occur if:
The default limit provided to prevent PDI from exceeding memory limits or stop connection timeouts to LDC is too short for your environment.
To resolve this issue:
- Design your transformation.
- Right-click on the canvas to open the Transformation properties dialog box.
- In the Parameters tab, add the following parameter:
catalog-result-limit
- In the Default Value column, enter a number greater than the
default value of
25
, for example500
. - Run your transformation.