HBase setup for Spark
The HBase Input and HBase Output steps can run on Spark with the Adaptive Execution Layer (AEL). These steps can be used with the supported versions of Cloudera Distribution for Hadoop (CDH) and Hortonworks Data Platform (HDP). To read or write data to HBase, you must have an HBase target table on the cluster. If one does not exist, you can create one using HBase shell commands.
This article explains how you can set up the Pentaho Server to run these steps.
Set up the application properties file
Perform the following steps to set up the application.properties file:
Procedure
Navigate to the design-tools/data-integration/adaptive-execution/config folder and open the application.properties file with any text editor.
Set the value of the hbaseConfDir property to the location of your hbase-site.xml file.
Set the value of the extraLib property to the location of the vendor-specific JARs.
The default value is ./extra.Save and close the file.
Set up the vendor-specified JARs
Perform the following steps to set up the vendor-specific JARs:
Procedure
Navigate to the design-tools/data-integration/adaptive-execution/extra directory and delete the three HBase JAR files.
Navigate to the design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations directory and locate your CDH or HDP distribution folder.
Locate the lib/pmr directory in your distribution folder.
Copy the six HBase files, along with the metrics-core file to the design-tools/data-integration/adaptive-execution/extra folder.
To complete your setup, you must restart the AEL daemon.
Using HBase steps with Amazon EMR 5.21
To use the HBase Input and HBase Output steps with EMR 5.21, you must add the following parameter:
spark.hadoop.validateOutputSpecs=false
You can use any of these methods to set the parameter:
- Specify the parameter in the properties file
- Specify the parameter in Transformation properties
- Specify the parameter as an environment variable in PDI
For more information about the properties file and processing Spark parameters, see Specify additional Spark properties.
Specify the parameter in the properties file
application.properties
file.Procedure
Navigate to the
design-tools/data-integration/adaptive-execution/config
folder and open theapplication.properties
file with any text editor.Find the section labeled as
#Base Configuration
.Add the following parameter:
spark.hadoop.validateOutputSpecs=false
Save and close the file.
Next steps
application.properties
file, see Specify additional Spark properties.Specify the parameter in Transformation properties
Procedure
Double-click anywhere on the transformation canvas.
The Transformation properties dialog box appears.Click the Parameters tab and enter the following information:
In the Parameter column, type spark.hadoop.validateOutputSpecs.
In the Default Value column, type false.
(Optional) Add a descriptive note about why the parameter is included.
Click OK to activate the parameter.
You can verify it is active in the transformation logging.
Next steps
Specify the parameter as an environment variable in PDI
Procedure
From the Edit menu, select Set Environment Variables.
The Set Environment Variables table appears.Enter the following information:
In the Name column, type spark.hadoop.validateOutputSpecs.
In the Value column, type false.
Click OK to activate the parameter.
You can verify it is active in the transformation logging.
Next steps