Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

ElasticSearch Bulk Insert

Parent article

Elastic is a platform that consists of products that search, analyze, and visualize data. The Elastic platform includes ElasticSearch, which is a Lucene-based, multi-tenant capable, and distributed search and analytics engine. The ElasticSearch Bulk Insert step sends one or more batches of records to an ElasticSearch server for indexing. Because you can specify the size of a batch, you can use this step to send one, a few, or many records to ElasticSearch for indexing.

Use this step if you have records that you want to submit to an ElasticSearch server to be indexed. When record data flows out of the ElasticSearch Bulk Insert step, PDI sends it to ElasticSearch along with metadata that you indicate such as the index and type. This step is commonly used when you want to send a batch of data to an ElasticSearch server and create new indexes of a certain type (category). It is also used when you want to add a batch of data to an index or category.

Because this is an output step, it is often placed at the end of the transformation.

NoteSince ElasticSearch has a REST web interface you can also use the REST Client step to send data to an ElasticSearch server and to perform other REST functions.

Before you begin

You need the following items:

  • A working server that has ElasticSearch version 6.4.2 already installed. You should be able to connect to ElasticSearch from the computer that you are running PDI on.
  • Insert, Update, and Create privileges for the directories on the ElasticSearch server that you need to access.
  • Files or data you want ElasticSearch to index.

General

Enter the following information in the transformation step field.

  • Step Name: Specifies the unique name of the ElasticSearch Bulk Insert step on the canvas. You can customize the name or leave it as the default.

Options

The ElasticSearch Bulk Insert step consists of four tabs: General, Servers, Fields, and Settings.

General tab

ElasticSearch Bulk Insert General Tab
OptionDescription
IndexSpecifies the name of the index you want to add data to. If an index with that name doesn't yet exist in ElasticSearch, it creates one.
TypeIndicates the category the data should be placed in. You define the category. In general practice, the type sometimes describes the data. For example, if the index is "twitter" the type might be tweet.
Test IndexChecks whether the index exists in ElasticSearch.
Batch SizeIndicates the number of items in the batch. (If you set the batch size is set to one, it is not a bulk insert, but setting it to a higher number is.)
Stop on ErrorStops processing if there is an error, such as a problem with adding the document or the bulk push to the index or if the JSON is not well-formed. If this option is not selected, and an error occurs, the row is not processed, but the transformation keeps running so that other rows are processed.
Batch TimeoutIndicates how long batch should be processed before the batch times out, and processing ends.
ID FieldIndicates the name of the ID Field in the file.
Overwrite if existsIf the output file exists because this transformation was run before, allows the output to be overwritten.
Output RowsSends the rows that are successfully processed by ElasticSearch to the to the next step (or the output). If you've checked Stop on Error, the rows that were successful up until the time the error occurs is sent to the next step (or the output). Otherwise, rows successfully processed by Elastic search rows are sent to the next step (or the output).
ID Output FieldIndicates the name if the ID field that is in the output. If this is left blank, the value in the ID Field is used instead.
JSON InputIndicates whether the input is a JSON file.
JSON FieldIndicates the JSON node from which processing should begin.

Servers tab

ElasticSearch Bulk Insert Servers Tab
OptionDescription
#Number of the server entry.
AddressIP address of the server you want to connect to.
PortPort number for the server you want to connect to.
Test ConnectionVerifies that the connection can be made to the servers listed in this tab.

Fields tab

ElasticSearch Bulk Insert Fields Tab
OptionDescription
#Number of the fields entry.
NameName from the input.
Target NameOutput field name.
Get FieldsRetrieves the fields from the input.

Settings tab

ElasticSearch Bulk Insert Settings Tab
OptionDescription
#Number of the settings entry.
SettingName of the batch.
ValueValue for the batch.

Reference information

Elastic, which is the company that makes ElasticSearch, has an API as well as user documentation that can give you more background on the fields in this step.