Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Using the HBase Output step on the Spark engine

Parent article

You can set up the HBase Output step to run on the Spark engine. Spark processes null values differently than the Pentaho engine, so you may need to adjust your transformation to process null values following Spark's processing rules. For specific instructions on using this step with Spark, see HBase setup for Spark.

General

Enter the following information in the transformation step name field.

  • Step name: Specifies the unique name of the transformation step on the canvas. The Step name is set to HBase Output by default.

Options

The HBase Output step features two tabs with fields. Each tab is described below.

Configure connection tab

This tab contains HBase connection information. You can configure a connection in one of two ways:

  • Using the Hadoop cluster properties, or
  • By using an hbase-site.xml and (an optional) hbase-default.xml configuration file.

Below the connection details are fields to specify which target HBase table to write to, along with a mapping by which to encode incoming field values. Configure connection tab

This tab includes the following fields:

OptionDefinition
Hadoop cluster

Click the Hadoop Cluster drop-down menu to select an existing Hadoop cluster configuration.

URL to hbase-site.xmlAddress of the hbase-site.xml file.
URL to hbase-default.xmlAddress of the hbase-default.xml file.
HBase table nameThe target HBase table you want to write data into. Click Get table names to populate the drop-down list of possible table names.
Mapping nameA mapping to decode and interpret column values. Click Get mappings for the specified table to populate the drop-down list of available mappings.
Store mapping info in step meta Specifies whether to store mapping information in the step's metadata instead of loading it from HBase when it runs.
Delete rows by mapping keySelect to instruct HBase to delete rows using the row key on the mapped input field.
Disable write to WAL

Disables writing to the Write Ahead Log (WAL).

The WAL is used as a failsafe to restore the status quo if the server goes down while data is being inserted. Disabling WAL will increase performance.

Not available when Delete rows by mapping is selected.

Size of write buffer (bytes)

The size of the write buffer used to transfer data to HBase.

A larger buffer consumes more memory (on both the client and server), but results in fewer remote procedure calls.

If you leave this field empty, the default value (specified in the hbase-default.xml file) is 2MB (2097152 bytes).

Create/Edit mappings tab

This tab creates or edits a mapping for a given HBase table. A mapping defines metadata about the values that are stored in the table. Since most information is stored as raw bytes in HBase, mapping allows PDI to decode values and execute meaningful comparisons for column-based result set filtering.

Before a value can be written to HBase, you must define to the step which column family the value belongs to and what its type is. You must also specify type information about the key of the table.

The names of fields entering the step must match the aliases of fields defined in the mapping. All incoming fields must have a matching counterpart in the mapping. There may be fewer incoming fields than defined in the mapping. If there are more incoming fields, then an error will occur. One of the incoming fields must match the key defined in the mapping.

This tab operates in a similar manner as the HBase Input step, with the exception that the HBase Output step allows the target HBase table to be created if it does not already exist. Furthermore, the fields coming into the step to define a mapping.

Select a table to populate the Mapping name drop-down box with the names of any mappings that exist for the table. If there are no mappings defined for the selected table, enter the name of a new mapping.

Enter information about the columns in the HBase table that you want to map. Selecting the name of an existing mapping will load the fields defined in that mapping into the fields area of the display.

Alternatively, you can create a new HBase table and mapping for it simultaneously by configuring the fields of the mapping and entering the name of a table that does not exist in the HBase table name drop-down box. Create/Edit mappings tab

This tab includes the following fields:

OptionDefinition
HBase table nameDisplays a list of table names. Connection information in the previous tab must be valid and complete for this drop-down list to populate. See the note in Performance considerations for more options.
Mapping nameNames of any mappings that exist for the table. This box is empty when there are no mappings defined for the selected table.

NoteYou can define multiple mappings on the same HBase table using different subsets of columns.
#The order of the mapping operation.
AliasThe name you want to assign to the HBase table key. This is required for the table key column, but optional for non-key columns.
KeyIndicates whether or not the field is the table's key.
Column familyThe column family in the HBase source table that the field belongs to. Non-key columns must specify a column family and column name.
Column nameThe name of the column in the HBase table.
Type

Data type of the column. When the key value is set to Y, the following key column values display in the drop-down list:

Key column types are:

  • String
  • Integer
  • UnsignedInteger
  • Long
  • UnsignedLong
  • Date
  • UnsignedDate
  • Binary

When the key value is set to N, the following key column values display in the drop-down list:

Non-key columns types are:

  • String
  • Integer
  • Long
  • Float
  • Double
  • Boolean
  • Date
  • BigNumber
  • Serializable
  • Binary
Indexed valuesEnter comma-separated data in this field to define values for string columns.
Get incoming fields (button)Retrieves a field list using the given HBase table and mapping names.
Create a tuple template (button)Select to create a mapping template to write tuples to HBase.
Save mapping (button)Saves the mapping. If there is any missing information in the mapping definition, you will be prompted to correct the mapping definition before the mapping is saved.
Delete mapping (button)Deletes the current named mapping in the current named table from the mapping table. Note that this does not delete the actual HBase table.

A valid mapping must define meta data for the key of the source HBase table. The key must have an Alias specified because there is no name given to the key of an HBase table. Non-key columns must specify the Column family that they belong to and the Column name. An Alias is optional. If not supplied, then the column name is used. All fields must have type information supplied.

For keys to sort properly in HBase, you must note the distinction between signed and unsigned numbers. Because of the way that HBase stores integer and long data internally, the sign bit must be flipped before storing the signed number so that positive numbers will sort after negative numbers. Unsigned integer and unsigned long data can be stored directly without inverting the sign.

For keys to sort properly in HBase, you must note the distinction between signed and unsigned numbers. Because of the way that HBase stores integer and long data internally, the sign bit must be flipped before storing the signed number so that positive numbers will sort after negative numbers. Unsigned integer and unsigned long data can be stored directly without inverting the sign.

  • String columns

    May optionally have a set of legal values defined for them by entering comma-separated data into the Indexed values column in the fields table.

  • Date keys

    Can be stored as either signed or unsigned long data types, with epoch-based timestamps. If you have a date key mapped as a String type, PDI can change the type to Date for manipulation in the transformation. No distinction is made between signed and unsigned numbers for the Date type because HBase only sorts on the key.

  • Boolean values

    May be stored in HBase as 0/1 integer/long or as strings (Y/N, yes/no, true/false, T/F).

  • BigNumber

    May be stored as either a serialized BigDecimal object or in string form (that is, a string that can be parsed by BigDecimal's constructor).

  • Serializable

    Any serialized Java object.

  • Binary

    A raw array of bytes.

To speed up the creation of a mapping, you can use the incoming fields to the step as the basis for the mapping. Click Get incoming fields to populate the mapping table with information from the fields entering the step. The Alias and Column name of each mapping field will be set to the name of an incoming field. The type information will be filled in automatically, and the Column family will be set to either the name of the first column family defined if the table already exists, or, a default value (Family1), which can be altered by the user to define their own families when the target table is created.

NoteThe step does not support adding new column families to an existing table.
ImportantThe names of fields entering the step are expected to match the aliases of fields defined in the mapping. All incoming fields must have a matching counterpart in the mapping. There may be fewer incoming fields than defined in the mapping but if there are more incoming fields then an error will be raised. Furthermore, one of the incoming fields must match the key defined in the mapping.

Performance considerations

The HBase Output step's Configure connection tab provides a field for setting the size of the write buffer used to transfer data to HBase. A larger buffer consumes more memory (on both the client and server), but results in fewer remote procedure calls. The default (defined in the hbase-default.xml file) is 2MB. When left blank, the buffer is 2MB, auto flush is enabled, and Put operations are executed immediately. This means that each row will be transmitted to HBase as soon as it arrives at the step. Entering a number (even if it is the same as the default) for the size of the write buffer will disable auto flush and will result in incoming rows only being transferred once the buffer is full.

There is also a checkbox for disabling writing to the Write Ahead Log (WAL). The WAL is used as a lifeline to restore the status quo if the server goes down while data is being inserted. However, the tradeoff for error-recovery is speed.

The Create/edit mappings tab has options for creating new tables. In the HBase table name field, you can suffix the name of the new table with parameters for specifying what kind of compression to use, and whether or not to use Bloom filters to speed up lookups. The options for compression are: NONE, GZ and LZO; the options for Bloom filters are: NONE, ROW, ROWCOL. If nothing is selected (or only the name of the new table is defined), then the default of NONE is used for both compression and Bloom filters. For example, the following string entered in the HBase table name field specifies that a new table called NewTable should be created with GZ compression and ROWCOL Bloom filters:

NewTable@GZ@ROWCOL
NoteDue to licensing constraints, HBase does not ship with LZO compression libraries; these must be manually installed on each node if you want to use LZO compression.

Metadata injection support

All fields of this step support metadata injection except for the Hadoop Cluster field. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.