Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Troubleshooting AEL

Parent article

Follow the suggestions in these topics to help resolve common issues with running transformations with the Adaptive Execution Layer.

Steps cannot run in parallel

If you are using the Spark engine to run a transformation with a step that cannot run in parallel, it generates errors in the log.

Some steps cannot run in parallel (on multiple nodes in a cluster), and will produce unexpected results. However, these steps can run as a coalesced dataset on a single node in a cluster. To enable a step to run as a coalesced dataset, add the step ID as a property value in the configuration file for using the Spark engine.

Get the step ID

Each PDI step has a step ID, a globally unique identifier of the step. Use one of the following methods to get the ID of a step:

Method 1: Retrieve the ID from the PDI client

You can retrieve a step ID though the PDI client with the following steps:

Procedure

  1. From the menu bar in the PDI client, select Tools Show plugin information.

    The Plugin browser appears.
  2. Select Step in the Plugin type menu to filter by step name, and find your step name in the table to obtain the related ID.

Method 2: Retrieve the ID from the log

You can retrieve a step ID though the PDI client logs with the following steps:

Procedure

  1. In the PDI client, create a new transformation and add the step to the transformation.

    For example, if you needed to know the ID for the Select values step, you would add that step to the new transformation.
  2. Set the log level to debug.

  3. Execute the transformation using the Spark engine.

    The step ID displays in the Logging tab of the Execution Results pane. For example, the log displays Selected the SelectValues step to run in parallel as a GenericSparkOperation, where SelectValues is the step ID.

Method 3: Retrieve the ID from the PDI plugin registry

If you are a developer, you can retrieve the step ID from the PDI plugin registry as described in Dynamically build transformations.

NoteIf you have created your own PDI transformation step plugin, the step ID is one of the annotation attributes that the developer supplies.

Add the step ID to the configuration file

The configuration file, org.pentaho.pdi.engine.spark.cfg, contains the forceCoalesceSteps property. The property is a pipe-delimited listing of all the IDs for the steps that should run with a coalesced dataset. Pentaho supplies a default set to which you can add IDs for steps that generate errors.

Perform the following steps to add another step ID to the configuration file:

Procedure

  1. Navigate to the data-integration/system/karaf/etc folder on the edge node running the AEL daemon and open the org.pentaho.pdi.engine.spark.cfg file.

  2. Append your step ID to the forceCoalesceSteps property value list, using a pipe character separator between the step IDs.

  3. Save and close the file.

Force coalesce and Spark tuning

Any steps to the org.pentaho.pdi.engine.spark.cfg force coalesce configuration file do a coalesce. If the stepTuningOverrideForceCoalesceList application.properties setting is set to true, then step tuning takes precedence over force coalesce.

Table Input step fails

If you run a transformation using the Table Input step with a large database, the step does not complete. Use one of the following methods to resolve the issue:

Method 1: Load the data to HDFS before running the transform

  1. Run a different transformation using the Pentaho engine to move the data to the HDFS cluster.

  2. Then use HDFS Input to run the transformation using the Spark engine.

Method 2: Increase the driver side memory configuration

Procedure

  1. Navigate to the config/ folder and open the application.properties file.

  2. Increase the value of the sparkDriverMemory parameter, then save and close the file.

Next steps

Method 3: Adjust JDBC tuning options

  1. Right-click on the Table Input step in the PDI client canvas, then select Spark tuning parameters from the menu.

    The Spark tuning parameters dialog box appears. See Setting PDI step Spark tuning options for instructions.
  2. Adjust the JDBC tuning options as needed.

    See JDBC tuning options for details.

User ID below minimum allowed

If you are using the Spark engine in a secured cluster and an error about minimum user ID occurs, the user ID of the proxy user is below the minimum user ID required by the cluster. See Cloudera documentation for details.

To resolve, change the ID of the proxy user to be higher than the minimum user ID specified for the cluster.

Hadoop version conflict

On an HDP cluster, if you receive the following message, your Hadoop library is in conflict and the AEL daemon along with the PDI client might stop working:

command hdp-select is not found, please manually *export HDP_VERSION* in spark-env.sh or current environment

To resolve the issue, you must export the HDP_VERSION variable using a command like the following example:

export HDP_VERSION=${HDP_VERSION:-2.6.0.3-8}

The HDP version number should match the HDP version number of the distribution on the cluster. You can check your HDP version with the hdp-select status hadoop-client command.

Hadoop libraries are missing

If you use the Spark libraries packaged with EMR, Cloudera, and Hortonworks’ distributions, you must add the Hadoop libraries to the classpath with the SPARK_DIST_CLASSPATH environment variable. These distributions are not packaged with the Hadoop libraries. For EMR, these libraries are required to access S3 resources.

Add the class path

The following command will add the libraries to the classpath:

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

You can add this command to the daemon.sh file so you do not have run this command every time you the start the AEL daemon.

Set Spark home variable

If you are using the Spark client from the Cloudera or Hortonworks Hadoop distributions, you may also receive the following log error:
Exception in thread "main" java.lang.NoSuchFieldError: TOKEN_KIND

If you received this log error, you must also complete the following steps for your Hadoop distribution:

Procedure

  1. Download the Spark client for your Hadoop cluster distribution (Cloudera or Hortonworks).

  2. Navigate to the adaptive-execution/config directory and open the application.properties file.

  3. Set the sparkHome location to where Spark 2 is located on your machine.

Example for Cloudera:

sparkhome = /opt/cloudera/parcels/SPARK2/lib/spark2

Example for Hortonworks:

sparkhome =/opt/horton/SPARK2/lib/spark2

Spark libraries conflict with Hadoop libraries

In some cases, library versions contained in JARs from PDI, Spark, Hadoop, AEL, and/or Kettle plugins may conflict with one another, causing general problems where Spark libraries conflict with Hadoop libraries and potentially creating AEL-specific problems. To read more about this issue, including how to address it, see the article AEL and Spark Library Conflicts on the Pentaho Community Wiki.

Failed to find AVRO files

If you are using the Spark engine with an EMR cluster, you may receive the following error message when trying to access AVRO files:

Failed to find data source: org.apache.spark.sql.avro.AvroFileFormat. Please find packages
        at http://spark.apache.org/third-party-projects.html

The libraries needed for accessing AVRO files on an EMR cluster are not included in Spark default class path. You must add them to the AEL daemon extra/ directory.

To resolve the issue, copy the vendor-supplied data source JAR libraries on the /usr/lib/spark/external/lib/ directory, such as "file spark-avro_2.11_2.4.2.jar" for example, to the AEL extra/ directory on the daemon, as shown in the following example:

cp /usr/lib/spark/external/lib/spark-avro_2.22_2.4.2.jar
        $AEL_DAEMON_DIRECTORY/data-integration/adaptive-execution/extra/

Unable to access Google Cloud Storage resources

You might receive an error message when trying to access Google Cloud Storage (GCS) resources. The URIs starting with gs://, such as "gs://mybucket/myobject.parquet" for example, require specific cluster or AEL configurations.

To resolve the issue, see Google Cloud Storage for instructions.

Unable to access AWS S3 resources

You might receive an error message when trying to access AWS S3 resources. The URIs starting with s3://, s3n://, or s3a://, such as "s3://mybucket/myobject.parquet" for example, require specific cluster configurations.

To resolve the issue for an EMR cluster, see Hadoop libraries are missing for instructions.

To resolve the issue for a Cloudera or Hortonworks cluster, see the following vendor-specific cluster documentation for details:

JAR file conflict in Kafka Consumer step

When using the Kafka Consumer step with HDP 3.x on AEL Spark, there is a known conflict with the JAR file /usr/hdp/3.x/hadoop-mapreduce/kafka-clients-0.8.2.1.jar

Use one of the following solutions to resolve the JAR conflict.

  • On HDP 3.x do not set the SPARK_DIST_CLASSPATH variable before running the Adaptive Execution Layer daemon. Otherwise, there may be issues in other AEL components.
  • Exclude the JAR file from the path on SPARK_DIST_CLASSPATH with the spark-dist-classpath.sh script. Create the script with any text editor and include the following code:
    #!/bin/sh
    ##
    ## helper script for setting up SPARK_DIST_CLASSPATH for AEL
    ## removes conflicting JAR files existing in HDP 3.x
    ## Using: call this the same way you use hadoop classpath, command, i.e.:
    ## export SPARK_DIST_CLASSPATH=$(spark-dist-classpath.sh)
    
    # grab hadoop classpath
    HCP=`hadoop classpath`
    
    ## expand it to grab all jar files
    (
      for entry in `echo "$HCP" | sed -e 's/:/\n/g'` ; do
         ## clean up dirs ending with *
         entryCleaned=`echo "$entry" | sed -e 's/\*$//'`
         ## if dir, expand it
         if test -d $entryCleaned ; then
           find $entryCleaned  
         else
           echo "$entry"
         fi 
      done 
    ) | grep -v kafka-clients-0.8.2.1.jar |  paste -s -d: 
    
    exit
    

Internet Address data type fails

When running an AEL transformation using an input step with the data type 'Internet Address' selected for a URL field, your transformation may not complete properly.

When you are using the Spark engine to run an AEL transformation, do not use the data type 'Internet Address' when entering a URL in a step. Instead, use the data type 'String' for the URL.

Message size exceeded

If you are using the Spark engine to run an AEL transformation and an error is generated indicating a decoded message was too big for the output buffer, you need to increase the maximum size (2 MB by default) of the message buffers for your AEL environment.

Perform the following steps to increase the message buffer limit:

Procedure

  1. Stop the AEL daemon.

  2. Navigate to the data-integration/adaptive-execution/config directory and open the application.properties file using a text editor.

  3. Enter the following incoming WebSocket message buffer properties, setting the same value for each property:

    PropertyValue
    daemon.websocket.maxMessageBufferSizeThe maximum size (in bytes) for the message buffer on the AEL daemon. For example, to allocate a 4 MB limit, set the memory value as shown here:
    daemon.websocket.maxMessageBufferSize=4000000
    driver.websocket.maxMessageBufferSizeThe maximum size (in bytes) for the message buffer on the AEL Spark driver. For example, to allocate a 4 MB limit, set the memory value as shown here:
    driver.websocket.maxMessageBufferSize=4000000
  4. Save and close the file.

  5. Restart the AEL daemon.

Results

When the AEL daemon submits the AEL Spark driver application, it passes the driver’s maximum message buffer size value as part of the submit; then, when the driver application is started, it receives the maximum buffer size value sent by the daemon.

Spark SQL catalyst errors using the Merge or Group By steps

If you are using the Spark engine with the Merge Rows (diff), Merge Join, or Group By step, you might receive an error similar to the following message:

org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)

Field names for join keys (values to compare or group) cannot contain special characters, such as whitespaces or dashes.

To resolve the issue, remove the special characters from the field names within your transformation.

Performance or memory issues

If you experience performance or memory issues while running your PDI transformation on the Spark engine, your transformation may not be efficiently using Spark execution resources.

To resolve or minimize the issue, apply and adjust application and PDI step Spark tuning parameters. See About Spark tuning in PDI for details.

Multiple steps in a transformation cannot generate files to the same location

If your transformation contains multiple steps that generate output files to the same destination folder, the files might be missing or the data might be missing.

Spark requires unique names for the files and folders generated by each step. To resolve this issue, send the files from each step to unique folders with unique filenames.

Cannot read footer in a Spark file

You might receive a “Could not read footer for file” error message from Spark when trying to access your data file while running on the Spark engine. This error occurs when Spark does not have an option for reading footer information from an input file. See https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html#csv-java.lang.String- for more information.

Perform the following steps to work around this issue:

Procedure

  1. Create a new PDI transformation containing either the Hadoop File Input and Hadoop File Output steps or the Text File Input and Text File Output steps.

  2. Use the footer option in the Content tab of the Hadoop File Input or Text File Input step to specify your footer data. See Using the Hadoop File Input step on the Pentaho engine or Using the Text File Input step on the Pentaho engine for details on the Content tab.

  3. Verify the Footer option in the Content tab of the Hadoop File Output or Text File Output step has been cleared so the data is not written out as a footer. See Using the Hadoop File Output step on the Pentaho engine or Using the Text File Output step on the Pentaho engine for details on the Content tab.

  4. Save the transformation and run it on the Pentaho engine.

Results

You can now read the file resulting from the output step in either Spark or AEL.

Errors when using Hive and AEL on a Hortonworks cluster

You may receive “Class not found” and “Hive database does not exist” error messages when running Spark with AEL and using Hive on a Hortonworks cluster.

Class not found exception

The "ClassNotFoundException" may occur in the AEL daemon log when using Hive and AEL on Hortonworks. This exception occurs when the HiveWarehouseSession class is not recognized by the daemon.

[2019-12-02 18:56:38.038] [INFO ] org.pentaho.di.engine.api.remote.ExecutionException:
[2019-12-02 18:56:38.038] [INFO ] java.lang.ClassNotFoundException: com.hortonworks.hwc.HiveWarehouseSession
[2019-12-02 18:56:38.038] [INFO ] com.hortonworks.hwc.HiveWarehouseSession

To resolve this issue, create a symbolic link in the /extra directory to the Hive Warehouse Connector Assembly and restart the daemon, as shown in the following example:

cd data-integration/adaptive-execution/extra
ln -s /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar

Hive database does not exist

The "Hivedb does not exist" exception may occur in the AEL daemon log when using Hive and AEL on Hortonworks. To clear this exception, you must set extra properties for the AEL daemon.

2019/12/04 16:42:01 - Table input.0 - ERROR (version 9.0.0.0-332, build 9.0.0.0-332 from
      2019-11-25 11.19.55 by buildguy) : org.pentaho.di.engine.api.remote.ExecutionException: hivedb
      does not exist. Check your Hive in application.properties.

To resolve this issue, set the required properties as shown in the following example:

# AEL Spark Hive Property Settings
enableHiveConnection=true
spark.driver.extraClassPath=/usr/hdp/current/spark2-client/conf/
spark.executor.extraClassPath=/usr/hdp/current/spark2-client/conf/
spark.sql.hive.hiveserver2.jdbc.url=jdbc:hive2://hito31-n3.cs1cloud.internal:2181,hito31-n2.cs1cloud.internal:2181,hito31-n1.cs1cloud.internal:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive
spark.datasource.hive.warehouse.metastoreUri=thrift://hito31-n2.cs1cloud.internal:9083
spark.datasource.hive.warehouse.load.staging.dir=/user/devuser/tmp
spark.hadoop.hive.llap.daemon.service.hosts=@llap0
spark.hadoop.hive.zookeeper.quoruma=hito31-n3.cs1cloud.internal:2181,hito31-n2.cs1cloud.internal:2181,hito31-n1.cs1cloud.internal:2181

You can find the value for the key in the application.properties file list for Ambari.

Errors when using Hive and AEL on an Amazon EMR cluster

You may receive date format, execution, or missing Hive database errors when running Spark with AEL while using Hive on an Amazon EMR cluster.

Date format does not work with the Table Output step

If an error occurs while you are trying to insert date data into a Hive table, the data values may not be in the correct format.

To resolve this issue, verify that the format of your date data values are YYYY-MM-DD.

Transformation does not complete execution

If you are using the Hive 2/3 connector with both the Table Input step and the Table Output step in the same transformation, your transformation does not complete its execution.

To resolve this issue, split the original transformation into two new transforms where one contains the Table Input step and other contains the Table Output step. You can control the order of execution with a PDI job. If you try to use a child transformation to control the execution order, the same issue occurs.

See Using Table input to Table output steps with AEL for managed tables in Hive for further instructions.

Hive database does not exist

If the following exception occurs in the AEL daemon log when using Hive and AEL on Amazon EMR, you must set extra properties for the AEL daemon.

2019/12/04 16:42:01 - Table input.0 - ERROR (version 9.0.0.0-332, build 9.0.0.0-332 from
      2019-11-25 11.19.55 by buildguy) : org.pentaho.di.engine.api.remote.ExecutionException: hivedb
      does not exist. Check your Hive in application.properties.

To resolve this issue, set the enableHiveConnection property in the application.properties file to true and verify that the extraClassPath property is set as shown in the following example:

# AEL Spark Hive Property Settings
enableHiveConnection=true
spark.driver.extraClassPath=/etc/spark/conf.dist/
spark.executor.extraClassPath=/etc/spark/conf.dist/

Restart the AEL daemon after you have saved the above changes.

Driver timeout and deployment errors with AEL on secured clusters

You may receive driver timeout and web socket deployment error messages when running Spark with AEL on a secured cluster.

Before trying to address any issues while running Spark with AEL on a secured cluster, verify the following:

  • The cluster has been secured by your cluster administrator.
  • The user associated with the AEL daemon has a valid Kerberos certificate.
If the expiration date and time returned by the klist command is less than the date and time returned by the date command, you must obtain a new ticket by using the kinit command. Contact your cluster administrator to determine what kinit approach was set up on your system.

Driver session timeout

The AEL daemon log may indicate that the “Server not found in Kerberos database” and the PDI client log may contain the following text:

2020/01/22 13:33:00 - HiveSmokeTest - Finalizing execution: Driver Session Timeout Expired
2020/01/22 13:33:00 - Spoon - The transformation has finished!!

This issue occurs because the websocketURL property is not set to a fully qualified host name for the node running the AEL daemon. To resolve this issue, obtain the fully qualified host name by using the hostname command, as shown in the following example:

[devuser@hito31-n2 adaptive-execution]$ hostname -f
hito31-n2.cs1cloud.internal

Then, set the resulting host name to the websocketURL property, as shown in the following example:

websocketURL=ws://hito31-n2.cs1cloud.internal:${ael.unencrypted.port}

Web socket deployment exception

You might execute a transformation and it completes immediately with the following error:
2020/01/22 14:40:01 - Spoon - Started the transformation execution.
2020/01/22 14:40:02 - Spoon - The transformation has finished!!
2020/01/22 14:40:02 - Spoon - ERROR (version 9.0.0.0-387, build 9.0.0.0-387 from 2020-01-09 11.20.10 by buildguy) : Error starting step threads
2020/01/22 14:40:02 - Spoon - ERROR (version 9.0.0.0-387, build 9.0.0.0-387 from 2020-01-09 11.20.10 by buildguy) : org.pentaho.di.core.exception.KettleException: 
2020/01/22 14:40:02 - Spoon - javax.websocket.DeploymentException: Connection failed.
2020/01/22 14:40:02 - Spoon - Connection failed.

This issue occurs because you do not have access to the AEL daemon, it is not running, or the Spark host URL setting is not correct in your run configuration.

Perform the following workflow to resolve this issue.

Procedure

  1. Verify that you can access the node running the AEL daemon.

    If you cannot access the node, contact your cluster administrator for further instructions.
  2. If you can access the node, verify that the AEL daemon is active (running) by using the status option while executing the daemon command from the data-integration/adaptive-execution directory, as shown in the following example:

    ./daemon.sh status
  3. If the daemon is not active, enter the start option of the daemon command, as shown in the following example:

    ./daemon.sh start
  4. If you can access the node and the daemon is active, but the error still occurs, verify that the port number specified in Spark host URL option of your run configuration for the Spark engine matches the ael.unencrypted.port property in the application.properties file.

    Also, verify that the hostname or IP address in the URL matches the hostname property in the application.properties file.

Steps cannot run with Spark on AEL

The following Flow category transformation steps do not run with Spark on AEL. When you include one of these steps in a transformation or job with Spark on AEL, the KTR or KJB may fail to execute without logging an error.

To avoid this issue, do not use these orchestration steps with Spark on AEL. If you need to organize the flow of your PDI transformations with Spark, the best practice is to use the Pentaho Server to orchestrate the KTRs using a Pentaho engine job.