HBase Setup for Spark
The HBase Input and HBase Output steps can run on Spark with the Adaptive Execution Layer (AEL). These steps can be used with the supported versions of Cloudera Distribution for Hadoop (CDH) and Hortonworks Data Platform (HDP). To read or write data to HBase, you must have an HBase target table on the cluster. If one does not exist, you can create one using HBase shell commands.
Due to Cloudera limitations, the HBase Input step fails when using the specific configuration of Spark in YARN mode with Kerberos.
This article explains how you can set up the Pentaho Server to run these steps.
Perform the following tasks to use the HBase steps with Spark:
Set Up the Application Properties File
You must set up the application.properties file to permit Spark jobs on AEL to access the hbase-site.xml file from the HDFS cluster. This setup enables Spark jobs to connect to HBase from the Spark Executors. You must also specify the location of the vendor-specific jars described below so they can be loaded on the classpath.
Perform the following steps to set up the application.properties file:
- Navigate to the design-tools/data-integration/adaptive-execution/config folder and open the application.properties file with any text editor.
- Set the value of the hbaseConfDir property to the location of your hbase-site.xml file.
- Set the value of the extraLib property to the location of the vendor-specific jars. The default value is './extra'
- Save and close the file.
Set Up the Vendor-Specific JARs
Each vendor has differences in their byte conversion for HBase, so you must use the JAR files for the Hadoop distribution you are using.
Vendor-specific JARS for HBase are not shipped with Spark or HDFS.
Perform the following steps to set up the vendor-specific JARs:
- Navigate to the design-tools/data-integration/adaptive-execution/extra directory and delete the three hbase JAR files.
- Navigate to the design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations directory and locate your CDH or HDP distribution folder.
- Locate the lib/pmr directory in your distribution folder.
- Copy the six hbase files, along with the metrics-core file to the design-tools/data-integration/adaptive-execution/extra folder.
- To complete your setup, you must Restart the AEL daemon.