Using the HBase Input step on the Pentaho engine

General

Enter the following information in the transformation step name field.

Step name: Specifies the unique name of the HBase Input step on the canvas. You can customize the name or leave it as the default.

Options

The HBase Input step features several tabs with fields. Each tab is described below.

Configure query tab

Before a value can be read from HBase, you must specify the type and column family of the value, and the type of the table key. You must define a mapping to use a source table. You can output some or all of the fields defined in the mapping. Rows from the table may be deleted to select a subset of the fields. Clearing all rows from the table indicates that all fields defined in the mapping should be output.

This tab contains connection details and basic query information. You can configure a connection by using the Hadoop cluster properties, or by using an hbase-site.xmland (an optional) hbase-default.xml configuration file.

Configure query tab

This tab includes the following fields:

Option	Description
Hadoop Cluster	Click the Hadoop Cluster drop-down menu to select an existing Hadoop cluster configuration. Click to Edit an existing Hadoop cluster configuration. Click New to add a new Hadoop cluster configuration. Refer to Connecting to a Hadoop cluster with the PDI client for information on creating and editing a Hadoop cluster.
URL to hbase-site.xml	Specify the address of the hbase-site.xml file by entering its path or clicking Browse.
URL to hbase-default.xml	Specify the address of the hbase-default.xml file by entering its path or clicking Browse.
HBase table name	Select the name of the source HBase table you want to read.
Get mapped table names (button)	Click to retrieve a list of all existing table names for the HBase table name field. Only table names that have been mapped are retrieved. If you enter the `namespace:tablename` in the HBase table name field, then click Get mapped table names, only the mapped table names in that namespace display. If you do not enter a namespace, all HBase tables across all namespaces are displayed. See Namespaces.
Mapping name	A mapping you can use to decode and interpret column values. Click Get mappings for the specified table to populate the drop-down list of available mappings.
Store mapping info in step meta data	Select this option to store mapping information in the step's metadata instead of loading it from HBase at runtime.
Start key value (inclusive) for table scan	Specifies the starting key value of a partial scan, including the value entered.
Stop key value (exclusing) for table scan	Specifies the stopping key value of a partial scan, excluding the value entered. The start key and stop key fields may be left blank. If the stop key field is left blank, then all rows beginning with and including the start key will be returned.
Scanner row cache size	The number of rows to cache each time a fetch request is made. See the Performance considerations section below for more information.

Key fields table

This table displays the metadata for the selected table.

Option	Description
#	The order of query limitation fields.
Alias	The name that the field will be given in the output stream.
Key	Indicates whether a field is the table's key field or not.
Column family	The column family in the HBase source table that the field belongs to.
Column name	The name of the column in the HBase table. The column family plus the column name uniquely identifies a column in an HBase table.
Type	The PDI data type for the field.
Format	Applies a formatting mask to the field. A formatting string must be provided for date values involved in a range scan (and optionally for numbers). There are two ways to provide this information in the dialog box: Configure the step with fields from the mapping, include the key in the fields to output from the step, then enter a formatting string in the Format cell in the row corresponding to the key field. If you have not opted to output the key from the step or have opted to output all fields in the mapping by leaving the fields table blank, then you can supply formatting information for the start and stop key values independently by suffixing the start or stop key value with the formatting string following a '@' separator character. For example, a date start key value of 1969-08-28 can be specified as `1969-08-28@yyy-MM-dd`.
Indexed values	An optional set of values you can define for string columns by entering comma-separated data in this field.
Get Key/Fields Info	Populates the field list and displays the name of the key as defined in the mapping when the connection information is complete and valid.

Create/Edit mappings tab

Use the fields on this tab to create or edit mappings for an HBase table. The mapping defines metadata about the values that are stored in the table. Since data is stored as raw bytes in HBase, PDI can decode values and execute comparisons for column-based result set filtering. The fields area of the tab is used to enter information about the columns in the HBase table that the user wants to map. Selecting the name of an existing mapping loads the fields defined in that mapping into the fields area of the display.

A valid mapping must define metadata for the key of the source HBase table. The key must have a value specified in the Alias column because a name is not given to the key of an HBase table. Non-key columns must specify the Column family and the Column name that they belong to. Non-key columns can have an optional alias; if one is not supplied, then the column name is used as an alias. All fields must have type information supplied.

Create/Edit mappings tab

This tab includes the following fields:

Option	Description
HBase table name	Displays a list of table names. Connection information in the previous tab must be valid and complete for this drop-down list to populate. Selecting a table here populates the Mapping name drop-down box with the names of available mappings for that table.
Get table names (button)	Click to retrieve a list of all existing table names for the HBase table name field, even if they do not have Pentaho mappings. If you enter a `namespace:filename` in the HBase table name field and click Get mapped table names, only mapped table names display in the list. If you do not enter a namespace, all HBase tables across all namespaces are displayed. See Namespaces.
Mapping name	Names of any mappings that exist for the table. This box is empty when there are no mappings defined for the selected table. NoteYou can define multiple mappings on the same HBase table using different subsets of columns.

Fields

Use these fields to specify values for the fields.

Field	Description
#	The order of the mapping operation.
Alias	The name you want to assign to the HBase table key. This value is required for the table key column, but optional for non-key columns.
Key	Indicates whether the field is the table's key. The values are Y and N.
Column family	The column family in the HBase source table that the field belongs to. Non-key columns must specify a column family and column name.
Column name	The name of the column in the HBase table.
Type	Data type of the column. When the key value is set to Y, the following key column values display in the drop-down list: Key column types are: String Integer UnsignedInteger Long UnsignedLong Date UnsignedDate Binary When the key value is set to N, the following key column values display in the drop-down list: Non-key columns types are: String Integer Long Float Double Boolean Date BigNumber Serializable Binary
Indexed values	Enter comma-separated data in this field to define values for string columns.
Save mapping (button)	Saves the mapping. If there is any missing information in the mapping definition, you will be prompted to correct the mapping definition before the mapping is saved.
Delete mapping (button)	Deletes the current named mapping in the current named table from the mapping table. Note that this does not delete the actual HBase table.
Create a tuple template (button)	Select to create a mapping template to extract tuples from HBase.

Additional notes on data types

For keys to sort properly in HBase, you must note the distinction between signed and unsigned numbers. Because of the way that HBase stores integer and long data internally, the sign bit must be flipped before storing the signed number so that positive numbers will sort after negative numbers. Unsigned integer and unsigned long data can be stored directly without inverting the sign.

String columns
May optionally have a set of legal values defined for them by entering comma-separated data into the Indexed values column in the fields table.
Date keys
Can be stored as either signed or unsigned long data types, with epoch-based timestamps. If you have a date key mapped as a String type, PDI can change the type to Date for manipulation in the transformation. No distinction is made between signed and unsigned numbers for the Date type because HBase only sorts on the key.
Boolean values
May be stored in HBase as 0/1 integer/long or as strings (Y/N, yes/no, true/false, T/F).
BigNumber
May be stored as either a serialized BigDecimal object or in string form (that is, a string that can be parsed by BigDecimal's constructor).
Serializable
Any serialized Java object.
Binary
A raw array of bytes.

Filter result set tab

Use the Filter result set tab fields to further refine the set of rows returned by specifying filtering operations on the values of columns other than the key. You can enter row-filtering criteria against one or more columns defined in the mapping.

Filter result set tab

This tab includes the following fields:

Option	Description
Match all / Match any	When multiple column filters have been defined, you have the option returning only those rows that match all filters, or any single filter. You can set bounded ranges on a single numeric column by defining upper bound and lower bound filters and selecting the Match all option. Open-ended ranges can be defined by selecting the Match any option.

Fields

Use these fields in this table to filter your results.

Option	Description
#	The order of the filter operation.
Alias	A drop-down menu of column alias names derived from the mapping.
Type	Data type of the column. This field is automatically populated when you select a field after choosing the alias.
Operator	A drop-down menu that containing equality/inequality operators for numeric, date, and Boolean fields; or substring and regular expression operators for string fields.
Comparison value	A comparison constant to use in conjunction with the operator.
Format	A formatting mask to apply to the field.
Signed comparison	Specifies whether the comparison constant and/or field values involve negative numbers for non-string fields. Because HBase stores numbers in two's complement form, the Filter result set tab includes the Signed comparison column for indicating whether the comparison involves signed numbers. If the field values and comparison constants are only positive for a given filter, then HBase's native lexicographical byte-based comparisons will produce accurate results. If the field contains negative numbers, then column values must be deserialized from bytes to actual numbers before performing the comparison.

NoteHBase Input ships with a custom comparator for deserializing column values before performing a comparison. This needs to be installed on each HBase node before signed comparisons will work correctly. Similarly, a special comparator for Boolean values is provided to implement deserializing and interpreting Boolean values from numbers and various string encodings.

Namespaces

You can use namespaces in the HBase table name field to create a logical grouping of your tables. For example, you can use one namespace for your development environment and another namespace for your production environment.

You must create a namespace before you can write to it. If you do not enter a namespace when creating a mapping, Pentaho uses the default namespace which is named default. See https://hbase.apache.org/book.html#_namespace for information on creating namespaces.

You can also use a variable for a namespace, which provides an easy way to move a transformation from your development environment to your production environment without having to change anything except the parameters. You can use transformation-level or system-level variables for a namespace.

The variable format is ${nsvarname}:

NoteEvery namespace has a pentaho_mappings table that stores the mappings metadata for the columns. This table is created automatically when you create mappings.

Performance considerations

In addition to the standard HBase server configuration and tuning options, two HBase Input factors can also affect performance. The first is the scanner row cache size setting on the Configure query tab. No caching is performed when this field is blank (default); one row is returned per fetch request. Setting a value in this field results in faster scans, but will consume more memory.

The second involves the selection of columns from the specified mapping to return from a query. Specifying fields in the Key fields table on the Configure query tab results in scans that return just those columns, requiring HBase to check each row to see if it contains a specific column. Checking each row creates more lookups, resulting in reduced speed. Enabling and using Bloom filters on the table can reduce the number of lookups. If you leave the Key fields table in the Configure query tab blank, the scan returns rows that contain all columns in every row, not just those defined in the mapping. However, HBase Input will only output those columns that are defined in the mapping as being used. When all columns are returned, HBase does not have to do any lookups.

Metadata injection support

All fields of this step support metadata injection except for the Hadoop Cluster field. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

General

Options

Configure query tab

Key fields table

Create/Edit mappings tab

Fields

Additional notes on data types

Filter result set tab

Fields

Namespaces

Performance considerations

Metadata injection support