Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Big Data issues

 

Parent article

Follow the suggestions in these topics to help resolve common issues when working with Big Data:

See Pentaho Troubleshooting articles for additional topics.

General configuration problems

 

The issues in this section explain how to resolve common configuration problems.

When updating to Pentaho 9.0, you are required to perform a one-time operation to update your cluster configurations to use multiple cluster features. For more information, see Set up the Pentaho Server to connect to a Hadoop cluster. If you do not update your cluster configurations, then the legacy configuration from the Big Data plugin is used.

Driver and configuration issues

 
Symptoms Common Causes Common Resolutions
Could not find cluster configuration file config.properties for the cluster in expected metastore locations or a legacy shim configuration.
  • An incorrect cluster name was entered.
  • The named cluster configuration is missing addresses, port numbers, or security settings (if applicable).
  • The driver version setup is incorrect.
  • The Big Data plugin configuration is not valid for legacy mode.
  • The named cluster is missing cluster configuration files.
  • Verify the cluster name is correct.
  • Verify cluster configuration is complete with addresses, port numbers, and security settings (if applicable).
  • Verify that the driver version setup is correct.
  • If failing in legacy mode, update your transformations and jobs to use a named cluster definition.
  • Verify cluster configuration (*-site.xml) files are in the correct location. Refer to Set up the Pentaho Server to connect to a Hadoop cluster.
Could not find service for interface associated with named cluster.
  • An incorrect cluster name was entered.
  • The cluster driver is not installed.
  • After updating to Pentaho 9.0, your old shims cannot be used.
  • If using a cluster driver, you may need to install a newer version.
  • Verify that the cluster name is correct.
  • Install a Pentaho 9.0 version driver for the cluster. You can find the supported versions on the Components Reference.
  • Edit your Hadoop cluster information to add the vendor and driver version that you need.
  • If using a cluster driver, update to a newer version.
  • If using HDP 2.6, or CDH and older versions, you are not able to run Pentaho 9.0 until updating to the cluster driver version.
No driver.
  • Driver is installed in the wrong location.
Driver does not load.
  • Required licenses are not installed.
  • You tried to load a shim that is not supported by your version of Pentaho.
  • Configuration file changes were made incorrectly.
The file system's URL does not match the URL in the configuration file.
  • Configuration files (*-site.xml files) were not configured properly.
Sqoop Unsupported major.minor version Error. In Pentaho 6.0, the Java version on your cluster is older than the Java version that Pentaho uses.
  • Verify that the JDK meets the requirements in the supported components matrix.
  • Verify that the JDK on the Pentaho Server is the same major version as the JDK on your cluster.

Connection problems

 
Symptoms Common Causes Common Resolutions
Hostname does not resolve.
  • No hostname has been specified.
  • Hostname/IP Address is incorrect.
  • Hostname is not resolving properly in the DNS.
  • Verify that the Hostname/IP address is correct.
  • Check the DNS to make sure the Hostname is resolving properly.
Port number does not resolve.
  • Port number is incorrect.
  • Port number is not numeric.
  • The port number is not necessary for HA clusters.
  • No port number has been specified.
  • Verify that the port number is correct.
  • Determine whether your cluster has been enabled for high availability (HA). If it has, then you do not need a port number. Clear the port number and retest the connection.
Cannot connect to the cluster.
  • Firewall is a barrier to connecting.
  • Other networking issues are occurring.
  • A *-site.xml file is invalid.
  • The address or port number information is incorrect for the cluster or service you are trying to connect.
  • Verify that a firewall is not impeding the connection and there are no other network issues.
  • Verify that all *-site.xml files are well-formed, valid XML.
  • Verify that the addresses and port numbers for the cluster and services are correct.
Windows failure message: “java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset
  • A configuration setting is required. Windows cannot locate the %HADOOP_HOME%\bin\winutils.exe file and the Hadoop_Home directory is not set.
Please follow the instructions at https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems and set the environment variable %HADOOP_HOME% to point to the directory path containing WINUTILS.EXE.
Cannot access a Hive database (Secured Clusters Only).
  • To access Hive, you need to set two database connection parameters using the PDI client.
  1. Open the hive-site.xml file on the Hive server host. Note the values for the kerberos.principal and the sasl.qop parameters, then close the file.
  2. Start the PDI client and open the Database Connection window.
  3. Click Options, and add the sasl.qop and principal parameters. Set them to the same values as they are in the hive-site.xml file. Save and close the file.

Directory access or permissions issues

 
Symptoms Common Causes Common Resolutions
Access error when trying to reach the User Home Directory.
  • Authorization or authentication issues.
  • Verify you have a user account on the cluster that you are attempting to connect to.
  • Verify that the name on the cluster user account is the same name as that of the OS user account that runs Pentaho.

Cannot access directory.
  • Authorization or authentication issues.
  • Directory is not on the cluster.
  • Verify the user has been granted read, write, and execute access to the directory.
  • Verify the security settings for the cluster and driver allow access.
  • Verify the hostname and port number are correct for the Hadoop File System's namenode.
Cannot create, read, update, or delete files or directories.
  • Authorization or authentication issues.
  • Verify the user has been authorized execute access to the directory.
  • Verify security settings for the cluster and driver allow access.
  • Verify that the hostname and port number are correct for the Hadoop File System's namenode.
Test file cannot be overwritten.
  • Pentaho test file is already in the directory.
  • The test was run, but the file was not deleted. You need to manually delete the test file. Check the log for the test file name.
  • A file with the same name as the Pentaho test file is already in the directory. The test file is used to verify that the user can create, write, and delete in the user's home directory.

Oozie issues

 
Symptoms Common Causes Common Resolutions
Cannot connect to Oozie
  • Firewall issue.
  • Other networking issues.
  • Oozie URL is incorrect.
  • Verify that the Oozie URL was correctly entered.
  • Verify that a firewall is not impeding the connection.

Zookeeper problems

 
Symptoms Common Causes Common Resolutions
Cannot connect to Zookeeper
  • Firewall is impeding connection with the Zookeeper service.
  • Other networking issues.
  • Verify that a firewall is not impeding the connection.
Zookeeper hostname or port not found or does not resolve properly
  • Hostname/IP Address and Port name is missing or is incorrect.
  • Try to connect to the Zookeeper nodes using ping or another method.
  • Verify that the Hostname/IP Address and Port numbers are correct.

Kafka problems

 
Symptoms Common Causes Common Resolutions
Cannnot connect to Kafka
  • Bootstrap server information is incorrect.
  • Specified bootstrap server is down.
  • Firewall issue.
  • Verify that the bootstrap server was correctly entered.
  • Verify that the bootstrap server is running.
  • Verify that a firewall is not blocking the connection.

Cannot access cluster with Kerberos enabled

 
If a step or entry cannot access a Kerberos authenticated cluster, review the steps in Set Up Kerberos for Pentaho.

If this issue persists, verify that the username, password, UID, and GID for each impersonated or spoofed user is the same on each node. When a user is deleted and recreated, it may then have different UIDs and GIDs causing this issue.

Cannot access the Hive service on a cluster

 

If you cannot use Kerberos impersonation to authenticate and access the Hive service on a cluster, review the steps in Set Up Kerberos for Pentaho.

If this issue persists, copy the hive-site.xml file on the Hive server to the configuration directory of the named cluster connection in these directories:

  • Pentaho Server

    pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]

  • PDI client

    data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]

If the problem continues to persist, disable pooled connections for Hive.

HBase Get Master Failed error

 
If the HBase cannot establish the authenticated portion of the connection, then copy the hbase-site.xml file from the HBase server to the configuration directory of the named cluster connection in these directories:
  • Pentaho Server:

    pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]

  • PDI client:

    data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/[cluster distribution]

Sqoop export fails

 

If executing a Sqoop export job and the system generates the following error because a file already exists at the destination, then Sqoop failed to clear the compile directory:

Could not rename \tmp\sqoop-devuser\compile\1894e2403c37a663c12c752ab11d8e6a\aggregatehdfs.java to C:\Builds\pdi-ee-client-9.0.0.0-MS-550\data-integration\.\aggregatehdfs.java. Error: Destination 'C:\Builds\pdi-ee-client-9.0.0.0-MS-550\data-integration\.\aggregatehdfs.java' already exists.

Despite the error message, the job that generated it ended successfully. To stop this error message, you can add a Delete step to the job to remove the compile directory before execution of the Sqoop export step.

Sqoop import into Hive fails

 
If executing a Sqoop import into Hive fails to execute on a remote installation, the local Hive installation configuration does not match the Hadoop cluster connection information used to perform the Sqoop job.

Verify the Hadoop connection information used by the local Hive installation is configured the same as the Sqoop job entry.

Pig job not executing after Kerberos authentication fails

 
Your Pig job will not execute after the Kerberos authentication fails until you restart PDI. While PDI may continue to generate new Kerberos tickets and other Hadoop components may work, Pig continues to fail until PDI is restarted.

For authentication with Pig, Pentaho uses the UserGroupInformation wrapper around a JAAS Subject with username and password which is used for impersonation. The UserGroupInformation wrapper is stored in the KMSClientProvider constructor. When the Kerberos ticket expires, a new UserGroupInformation is created, but the instance stored in the KMSClientProvider constructor does not update. The Pig job fails when Pig cannot obtain delegation tokens to authenticate the job at execution time.

To resolve, set the key.provider.cache.expiry time to a value equal to or less than the duration time of the Kerberos ticket. By default, the key.provider.cache.expiry time is set to a value of: 10 days

NoteThis solution assumes you are using HortonWorks 3.0.

Procedure

  1. Navigate to the hdfs-site.xml file location.

    • In the PDI client, navigate to: data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp25
    • For the Pentaho Server, navigate to: pentaho-server\pentaho-solutions\system\kettle\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp25
  2. Open the hdfs-site.xml file in a text editor.

  3. Adjust the key.provider.cache.expiry value (in milliseconds) so that it is less than the duration time of the Kerberos ticket.

    NoteYou can view the Kerberos ticket duration time in the krb5.conf file.
    <property>
    <name>dfs.client.key.provider.cache.expiry</name>
    <value>410000</value>
    </property>

Group By step is not supported in a single threaded transformation engine

 
If you have a job that contains both a Pentaho MapReduce entry and a Reducer transformation with a Group by step, you may receive a Step 'Group by' of type 'GroupBy' is not Supported in a Single Threaded Transformation Engine error message. This error can occur if:
  • An entire set of rows sharing the same grouping key are filtered from the transformation before the Group By step.
  • The Reduce single threaded option in the Pentaho MapReduce entry's Reducer tab is selected.

To fix this issue, open the Pentaho MapReduce entry and deselect the Reduce single threaded option in the Reducer tab.

Kettle cluster on YARN will not start

 
When you are using the Start a PDI Cluster on YARN job entry, the Kettle cluster may not start.

Verify in the File System Path (in the Files tab) that the Default FS setting matches the configured hostname for the HDFS Name node, then try starting the kettle cluster again.

Hadoop on Windows

 

If you are using Hadoop on Windows, you may get an "unexpected error" message. This message indicates that multiple cluster support across different versions of Hadoop is not available on Windows.

You are limited to using the same version of Hadoop for multiple cluster use on Windows. If you have problems accessing the Hadoop file system on a Windows machine, see the Problems running Hadoop on Windows article on the Hadoop Wiki site.

Legacy mode activated when named cluster configuration cannot be located

 

If you run a transformation or job for which PDI cannot locate and load a named configuration cluster, then PDI activates a legacy mode. This legacy, or fallback, mode is only available in Pentaho 9.0 and later.

When the legacy mode is activated, PDI attempts to run the transformation by finding any existing cluster configuration you have set up in the PDI Big Data plugin. PDI then migrates the existing configuration to the latest PDI instance that you are currently running.

 

NoteYou cannot connect to more than one cluster.

The legacy mode is helpful for transformations built with previous versions of PDI and includes individual steps that are not associated to a named cluster. You can run the transformation in legacy mode without revising the cluster configuration in each individual step. For information about setting up a named cluster, see Connecting to a Hadoop cluster with the PDI client.

When legacy mode is active, the transformation log displays the following message:

Could not find cluster configuration file {0} for cluster {1} in expected metastore locations or a legacy shim configuration.

If the Big Data plugin is present and PDI accesses it to successfully activate legacy mode, the transfomation log displays the following message:

Cluster configuration not found in expected location; trying legacy configuration location.

For more information about working with clusters, see Get started with Hadoop and PDI.

Unable to read or write files to HDFS on the Amazon EMR cluster

 

When running a transformation on an EMR cluster, the transformation appears to run successfully, but an empty file is written to the cluster. When PDI is not installed on the Amazon EC2 instance where you are running your transformation, you are unable to read or write files to the HDFS cluster. Any files written to the cluster are empty.

To resolve this issue, perform the following steps to edit the hdfs-site.xml file on the PDI client

:

Procedure

  1. Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory.

  2. Open the hdfs-site.xml file with any text editor.

  3. Add the following code:

    <property>
         <name>dfs.client.use.datanode.hostname</name>
         <value>true</value>
    </property>
    
  4. Save and close the file.

Use YARN with S3

 

When using the Start a PDI cluster on YARN and Stop a PDI cluster on YARN job entries to run a transformation that attempts to read data from an Amazon S3 bucket, the transformation fails. The transformation fails because the Pentaho metastore is not accessible to PDI on the cluster. To resolve this problem, verify that the Pentaho metastore is accessible to PDI on the cluster.

Perform the following steps to make the Pentaho metastore accessible to PDI:

Procedure

  1. Navigate to the <user>/.pentaho/metastore directory on the machine with the PDI client.

  2. On the cluster where the Yarn server is located, create a new directory in the design-tools/data-integration/plugins/pentaho-big-data-plugin directory, then copy the metastore directory into this location. This directory is the <NEW_META_FOLDER_LOCATION> variable.

  3. Navigate to the design-tools/data-integration directory and open the carte.sh file with any text editor.

  4. Add the following code in the line before the export OPT line: OPT="$OPT -DPENTAHO_METASTORE_FOLDER=<NEW_META_FOLDER_LOCATION>", then save and close the file.

  5. Create a zip file containing the contents of the data-integration directory.

  6. In your Start a PDI cluster on YARN job entry, go to the Files tab of the Properties window, then locate the PDI Client Archive field. Enter the filepath for the zip file.

Results

This task resolves S3 access issues for the following tranformation steps:
  • Avro Input
  • Avro Output
  • Orc Input
  • Orc Output
  • Parquet Input
  • Parquet Output
  • Text File Input
  • Text File Output