Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Using a Transformation Step to Load Data into HBase

In order to follow along with this tutorial, you will need
  • Hadoop
  • Pentaho Data Integration
  • HBase

This tutorial describes how to use data from a sample flat file to create a HBase table using a PDI transformation. For the sake of brevity, you will use a prepared sample dataset and a simple transformation to prepare and transform your data for HBase loads.

If not already running, start Hadoop, PDI, and HBase. Unzip the sample data files and put them in a convenient location:

  1. Create a HBase Table.
    1. Open the HBase shell by entering hbase shell at the command line.
    2. Create the table in HBase by entering create 'weblogs', 'pageviews' in the HBase shell. This creates a table named weblogs with a single column family named pageviews.
    3. Close the HBase shell by entering quit.
  2. From within the Spoon, create a new transformation by selecting File > New > Transformation.
  3. Identify the source where the transformation will get data from. For this tutorial your source is a text file (.txt). From the Input folder of the Design palette on the left, add a Text File Input step to the transformation by dragging it onto the canvas.
  4. Edit the properties of the Text file input step by double-clicking the icon. The Text file input dialog box appears.
  5. From the File tab, in the File or Directory field, click Browse and navigate to the weblog_hbase.txt file. Click Add.

    The file appears in the Selected files pane.

  6. Configure the contents of the file by switching to the Content tab.
    1. For Separator, clear the contents and click Insert TAB.
    2. Check the Header checkbox.
    3. For Format, Select Unix from the drop-down menu.
  7. Configure the input fields.
    1. From the Fields tab, select Get Fields to populate the list the available fields.
    2. A dialog box appears asking for Number of sample lines. Enter 100 and click OK.
    3. Change the Type of the field named key to String and set the Length to 20.

    Click OK to close the window.

  8. On the Design palette, under Big Data, drag the HBase Output to the canvas. Create a hop to connect your input and HBase Output step by hovering over the input step and clicking the output connector , then drag the connector arrow to the HBase Output step.
  9. Edit the HBase Output step by double-clicking it. You must now enter your Zookeeper host(s) and port number.
    1. For the Zookeeper hosts(s) field, enter a comma separated list of your HBase Zookeeper Hosts. For local single node clusters use localhost.
    2. For Zookeeper port, enter the port for your Zookeeper hosts. By default this is 2181.
  10. Create a HBase mapping to tell Pentaho how to store the data in HBase by switching to the Create/Edit mappings tab and changing these options.
    1. For HBase table name, select weblogs.
    2. For Mapping name, enter pageviews.
    3. Click Get incoming fields.
    4. For the alias key change the Key column to Y, clear the Column family and Column name fields, and set the Type field to String. Click Save mapping.
  11. Configure the HBase out to use the mapping you just created.
    1. Go back to the Configure connection tab and click Get table names.
    2. For HBase table name, enter weblogs.
    3. Click Get mappings for the specified table.
    4. For Mapping name, select pageviews. Click OK to close the window.
    Save the transformation by selecting Save as from the File menu. Enter load_hbase.ktr as the file name within a folder of your choice.
  12. Run the transformation by clicking the green Run button on the transformation toolbar , or by choosing Action > Run from the menu. The Execute a transformation window opens. Click Launch.

    An Execution Results panel opens at the bottom of the Spoon interface and displays the progress of the transformation as it runs. After a few seconds the transformation finishes successfully.

    If any errors occurred the transformation step that failed will be highlighted in red and you can use the Logging tab to view error messages.

  13. Verify the data was loaded by querying HBase.
    1. From the command line, open the HBase shell by entering this command.
      hbase shell
    2. Query HBase by entering this command.
      scan 'weblogs', {LIMIT => 10}

Ten rows of data are returned.