Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

ElasticSearch Bulk Insert (deprecated)

 

Parent article

 

ImportantThis documentation applies to an earlier version of the step based on Elasticsearch transport client, version 6.4.2, which is deprecated. While the step will continue to be compatible with transformations created in Pentaho version 9.2 and earlier, you should use the Elasticsearch REST Bulk Insert step in your new transformations.

Elastic is a platform that consists of products that search, analyze, and visualize data. The Elastic platform includes ElasticSearch, which is a Lucene-based, multi-tenant capable, and distributed search and analytics engine. The ElasticSearch Bulk Insert step sends one or more batches of records to an ElasticSearch server for indexing. Because you can specify the size of a batch, you can use this step to send one, a few, or many records to ElasticSearch for indexing.

Use this step if you have records that you want to submit to an ElasticSearch server to be indexed. When record data flows out of the ElasticSearch Bulk Insert step, PDI sends it to ElasticSearch along with metadata that you indicate such as the index and type. This step is commonly used when you want to send a batch of data to an ElasticSearch server and create new indexes of a certain type (category). It is also used when you want to add a batch of data to an index or category.

Because this is an output step, it is often placed at the end of the transformation.

NoteSince ElasticSearch has a REST web interface you can also use the REST Client step to send data to an ElasticSearch server and to perform other REST functions.

Before you begin

 

You need the following items:

  • A working server that has ElasticSearch version 6.4.2 already installed. You should be able to connect to ElasticSearch from the computer that you are running PDI on.
  • Insert, Update, and Create privileges for the directories on the ElasticSearch server that you need to access.
  • Files or data you want ElasticSearch to index.

General

 

Enter the following information in the transformation step field.

  • Step Name: Specifies the unique name of the ElasticSearch Bulk Insert step on the canvas. You can customize the name or leave it as the default.

Options

 

The ElasticSearch Bulk Insert step consists of four tabs: General, Servers, Fields, and Settings.

General tab

 
ElasticSearch Bulk Insert General Tab
Option Description
Index Specifies the name of the index you want to add data to. If an index with that name doesn't yet exist in ElasticSearch, it creates one.
Type Indicates the category the data should be placed in. You define the category. In general practice, the type sometimes describes the data. For example, if the index is "twitter" the type might be tweet.
Test Index Checks whether the index exists in ElasticSearch.
Batch Size Indicates the number of items in the batch. (If you set the batch size is set to one, it is not a bulk insert, but setting it to a higher number is.)
Stop on Error Stops processing if there is an error, such as a problem with adding the document or the bulk push to the index or if the JSON is not well-formed. If this option is not selected, and an error occurs, the row is not processed, but the transformation keeps running so that other rows are processed.
Batch Timeout Indicates how long batch should be processed before the batch times out, and processing ends.
ID Field Indicates the name of the ID Field in the file.
Overwrite if exists If the output file exists because this transformation was run before, allows the output to be overwritten.
Output Rows Sends the rows that are successfully processed by ElasticSearch to the to the next step (or the output). If you've checked Stop on Error, the rows that were successful up until the time the error occurs is sent to the next step (or the output). Otherwise, rows successfully processed by Elastic search rows are sent to the next step (or the output).
ID Output Field Indicates the name if the ID field that is in the output. If this is left blank, the value in the ID Field is used instead.
JSON Input Indicates whether the input is a JSON file.
JSON Field Indicates the JSON node from which processing should begin.

Servers tab

 
ElasticSearch Bulk Insert Servers Tab
Option Description
# Number of the server entry.
Address IP address of the server you want to connect to.
Port Port number for the server you want to connect to.
Test Connection Verifies that the connection can be made to the servers listed in this tab.

Fields tab

 
ElasticSearch Bulk Insert Fields Tab
Option Description
# Number of the fields entry.
Name Name from the input.
Target Name Output field name.
Get Fields Retrieves the fields from the input.

Settings tab

 
ElasticSearch Bulk Insert Settings Tab
Option Description
# Number of the settings entry.
Setting Name of the batch.
Value Value for the batch.

Reference information

 

Elastic, which is the company that makes ElasticSearch, has an API as well as user documentation that can give you more background on the fields in this step.