Transformation Step Reference
Steps extend and expand the functionality of PDI transformations. This page contains the list of supported steps.
Steps: A - F
Name | Category | Description |
---|---|---|
Abort | Flow | Abort a transformation. |
Add a checksum | Transform | Add a checksum column for each input row. |
Add constants | Transform | Add one or more constants to the input rows. |
Add sequence | Transform | Get the next value from an sequence. |
Add value fields changing sequence | Transform |
Add sequence depending of fields value change. Each time value of at least one field change, PDI will reset sequence. |
Add XML | Transform | Encode several fields into an XML fragment. |
Analytic query | Statistics | Execute analytic queries over a sorted dataset (LEAD/LAG/FIRST/LAST). |
AMQP Consumer | Streaming | Pull streaming data from an AMQP broker or clients through an AMQP transformation. |
AMQP Producer | Streaming | Publish messages in near-real-time to an AMQP broker. |
Append streams | Flow | Append two streams in an ordered way. |
ARFF output | Data Mining | Write data in ARFF format to a file. |
Automatic Documentation Output | Output |
Generate documentation automatically based on input in the form of a list of transformations and jobs. |
Avro input | Big Data |
Decode binary or JSON Avro data and extracts fields from the structure it defines, either from flat files or incoming fields. |
Avro input (deprecated) | Deprecated | Replaced by Avro input. |
Avro output | Big Data | Serialize data into Avro binary or JSON format from the PDI data stream, then writes it to file. |
Block this step until steps finish | Flow | Block this step until selected steps finish. |
Blocking step | Flow |
Block flow until all incoming rows have been processed. Subsequent steps only recieve the last input row to this step. |
Calculator | Transform | Create new fields by performing simple calculations. |
Call DB Procedure | Lookup | Get back information by calling a database procedure. |
Call Endpoint | Pentaho Server | Call API endpoints from the Pentaho Server within a PDI transformation. |
Cassandra Input | Big Data | Read from a Cassandra column family. |
Cassandra Output | Big Data | Write to a Cassandra column family. |
Change file encoding | Utility | Change file encoding and create a new file. |
Check if a column exists | Lookup | Check if a column exists in a table on a specified connection. |
Check if file is locked | Lookup | Check if a file is locked by another process. |
Check if webservice is available | Lookup | Check if a webservice is available. |
Clone row | Utility | Clone a row as many times as needed. |
Closure Generator | Transform | Generate a closure table using parent-child relationships. |
Combination lookup/update | Data Warehouse |
Update a junk dimension in a data warehouse. Alternatively, look up information in this dimension. The primary key of a junk dimension are all the fields. |
Concat Fields | Transform |
Concatenate multiple fields into one target field. The fields can be separated by a separator and the enclosure logic is completely compatible with the Text File Output step. |
Copy rows to result | Job |
Write rows to the executing job. The information will then be passed to the next entry in this job. |
CouchDB Input | Big Data |
Retrieve all documents from a given view in a given design document from a given database. |
Credit card validator | Validation |
Determines: (1) if a credit card number is valid (uses LUHN10 (MOD-10) algorithm) (2) which credit card vendor handles that number (VISA, MasterCard, Diners Club, EnRoute, American Express (AMEX),...). |
CSV file input | Input | Read from a simple CSV file input. |
Data Grid | Input | Enter rows of static data in a grid, usually for testing, reference or demo purpose. |
Data Validator | Validation | Validates passing data based on a set of rules. |
Database join | Lookup | Execute a database query using stream values as parameters. |
Database lookup | Lookup | Look up values in a database using field values. |
De-serialize from file | Input | Read rows of data from a data cube. |
Delay row | Utility | Output each input row after a delay. |
Delete | Output | Delete data in a database table based upon keys. |
Detect empty stream | Flow |
Output one empty row if input stream is empty (ie when input stream does not contain any row). |
Dimension lookup/update | Data Warehouse |
Update a slowly changing dimension in a data warehouse. Alternatively, look up information in this dimension. |
Dummy (do nothing) | Flow |
Does not do anything. It is useful however when testing things or in certain situations where you want to split streams. |
Dynamic SQL row | Lookup |
Execute dynamic SQL statement build in a previous field. |
Edi to XML | Utility |
Convert an Edifact message to XML to simplify data extraction (Available in PDI 4.4, already present in CI trunk builds). |
ElasticSearch Bulk Insert | Bulk loading | Perform bulk inserts into ElasticSearch. |
Email messages input | Input |
Read POP3/IMAP server and retrieve messages. |
ESRI Shapefile Reader | Input |
Read shape file data from an ESRI shape file and linked DBF file. |
ETL Metadata Injection | Flow | Inject metadata into an existing transformation prior to execution. This allows for the creation of dynamic and highly flexible data integration solutions. |
Example step (deprecated) | Deprecated |
Is an example of a plugin test step. |
Execute a process | Utility |
Execute a process and return the result. |
Execute Row SQL Script | Scripting | Execute an SQL statement or file for every input row. |
Execute SQL script | Scripting | Execute an SQL script, optionally parameterized using input rows. |
File exists | Lookup | Check if a file exists. |
Filter Rows | Flow | Filter rows using simple equations. |
Fixed file input | Input | Read from a fixed file input. |
Formula | Scripting | Calculate a formula using Pentaho's libformula. |
Fuzzy match | Lookup |
Find the approximate matches to a string using matching algorithms. Read a field from a main stream and output approximative value from lookup stream. |
Steps: G - L
Name | Category | Description |
---|---|---|
Generate random credit card numbers | Input |
Generate random valide (luhn check) credit card numbers. |
Generate random value | Input |
Generate random value. |
Generate Rows | Input |
Generate a number of empty or equal rows. |
Get data from XML | Input |
Get data from XML file by using XPath. This step also allows you to parse XML defined in a previous field. |
Get File Names | Input |
Get file names from the operating system and send them to the next step. |
Get files from result | Job |
Read filenames used or generated in a previous entry in a job. |
Get Files Rows Count | Input |
Get files rows count. |
Get ID from slave server | Transform |
Retrieve unique IDs in blocks from a slave server. The referenced sequence needs to be configured on the slave server in the XML configuration file. |
Get records from stream | Streaming |
Return records that were previously generated by another transformation in a job. |
Get repository names | Input |
List detailed information about transformations and/or jobs in a repository. |
Get rows from result | Job |
Read rows from a previous entry in a job. |
Get Session Variables | Pentaho Server |
Retrieve the value of a session variable. |
Get SubFolder names | Input |
Read a parent folder and return all subfolders. |
Get System Info | Input |
Get information from the system like system date, arguments, etc. |
Get table names | Input |
Get table names from database connection and send them to the next step. |
Get Variables | Job |
Determine the values of certain (environment or Kettle) variables and put them in field values. |
Google Analytics | Input |
Fetch data from google analytics account. |
Greenplum Bulk Loader (deprecated) | Deprecated |
Bulk load Greenplum data. Replacement step is Greenplum Load. |
Greenplum Load | Bulk loading |
Bulk load Greenplum data. |
Group by | Statistics |
Build aggregates in a group by fashion. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly. |
GZIP CSV Input | Input |
Read in parallel from a GZIP CSV file. |
Hadoop File Input | Big Data |
Read data from a variety of different text-file types stored on a Hadoop cluster. |
Hadoop File Output | Big Data |
Write data to a variety of different text-file types stored on a Hadoop cluster. |
HBase Input | Big Data |
Read from an HBase column family. |
HBase Output | Big Data |
Write to an HBase column family. |
HBase Row Decoder | Big Data |
Decode an incoming key and HBase result object according to a mapping. |
HL7 Input | Input |
Read data from HL7 data streams. |
HTTP client | Lookup |
Call a web service over HTTP by supplying a base URL by allowing parameters to be set dynamically. |
HTTP Post | Lookup |
Call a web service request over HTTP by supplying a base URL by allowing parameters to be set dynamically. |
IBM Websphere MQ Consumer (deprecated) | Deprecated |
Receive messages from any IBM Websphere MQ Server. |
IBM Websphere MQ Producer (deprecated) | Deprecated |
Send messages to any IBM Websphere MQ Server. |
Identify last row in a stream | Flow |
Mark the last row. |
If field value is null | Utility |
Set a field value to a constant if it is null. |
Infobright Loader | Bulk loading |
Load data to an Infobright database table. |
Ingres VectorWise Bulk Loader | Bulk loading |
Interface with the Ingres VectorWise Bulk Loader "COPY TABLE" command. |
Injector | Inline |
Inject rows into the transformation through the java API. |
Insert / Update | Output |
Update or insert rows in a database based upon keys. |
Java Filter | Flow |
Filter rows using java code. |
JMS Consumer | Streaming |
Receive messages from a JMS server. |
JMS consumer (deprecated) | Deprecated | Replaced by JMS consumer. |
JMS Producer | Streaming |
Send messages to a JMS server. |
JMS producer (deprecated) | Deprecated | Replaced by JMS producer. |
Job Executor | Flow |
Run a PDI job, and passes parameters and rows. |
Join Rows (cartesian product) | Joins |
Output the cartesian product of the input streams. The number of rows is the multiplication of the number of rows in the input streams. |
JSON Input | Input |
Extract relevant portions out of JSON structures (file or incoming field) and output rows. |
JSON output | Output |
Create JSON block and output it in a field to a file. |
Kafka Consumer | Streaming |
Run a sub-transformation that executes according to message batch size or duration, letting you process a continuous stream of records in near-real-time. |
Kafka Producer | Streaming |
Publish messages in near-real-time across worker nodes where multiple, subscribed members have access. |
Knowledge Flow | Data Mining |
Executes a Knowledge Flow data mining process. |
LDAP Input | Input |
Read data from LDAP host. |
LDAP Output | Output |
Perform Insert, upsert, update, add or delete operations on records based on their DN (Distinguished Name). |
LDIF Input | Input |
Read data from LDIF files. |
Load file content in memory | Input |
Load file content in memory. |
LucidDB streaming loader (deprecated) | Deprecated |
Load data into LucidDB by using Remote Rows UDX. |
Steps: M - R
Name | Category | Description |
---|---|---|
Utility |
Send e-mail. |
|
Mail Validator | Validation |
Check if an email address is valid. |
Mapping (sub-transformation) | Mapping |
Run a mapping (sub-transformation), use MappingInput and MappingOutput to specify the fields interface. |
Mapping input specification | Mapping |
Specify the input interface of a mapping. |
Mapping output specification | Mapping |
Specify the output interface of a mapping. |
MapReduce Input | Big Data |
Enter Key Value pairs from Hadoop MapReduce. |
MapReduce Output | Big Data |
Exit Key Value pairs, then push into Hadoop MapReduce. |
MaxMind GeoIP Lookup | Lookup |
Lookup an IPv4 address in a MaxMind database and add fields such as geography, ISP, or organization. |
Memory Group by | Statistics |
Build aggregates in a group by fashion. This step doesn't require sorted input. |
Merge Join | Joins |
Join two streams on a given key and outputs a joined set. The input streams must be sorted on the join key. |
Merge Rows (diff) | Joins |
Merge two streams of rows, sorted on a certain key. The two streams are compared and the equals, changed, deleted and new rows are flagged. |
Metadata structure of stream | Utility |
Read the metadata of the incoming stream. |
Microsoft Access Input | Input |
Read data from a Microsoft Access file |
Microsoft Access Output | Output |
Store records into an MS-Access database table. |
Microsoft Excel Input | Input |
Read data from Excel and OpenOffice Workbooks (XLS, XLSX, ODS). |
Microsoft Excel Output | Output |
Store records into an Excel (XLS) document with formatting information. |
Microsoft Excel Writer | Output |
Write or appends data to an Excel file. |
Modified Java Script Value | Scripting |
Run JavaScript programs (and much more). |
Mondrian Input | Input |
Execute and retrieve data using an MDX query against a Pentaho Analyses OLAP server (Mondrian). |
MonetDB Agile Mart | Agile |
|
MonetDB Bulk Loader | Bulk loading |
Load data into MonetDB by using their bulk load command in streaming mode. |
MongoDB Input | Big Data |
Read all entries from a MongoDB collection in the specified database. |
MongoDB Output | Big Data |
Write to a MongoDB collection. |
MQTT Consumer | Streaming | Pull streaming data from an MQTT broker or clients through an MQTT transformation. |
MQTT Producer | Streaming | Publish messages in near-real-time to an MQTT broker. |
Multiway Merge Join | Joins |
Join multiple streams. This step supports INNER and FULL OUTER joins. |
MySQL Bulk Loader | Bulk loading |
Load data over a named pipe (not available on MS Windows). |
Null if... | Utility |
Set a field value to null if it is equal to a constant value. |
Number range | Transform |
Create ranges based on numeric field. |
OLAP Input | Input |
Execute and retrieve data using an MDX query against any XML/A OLAP datasource using olap4j. |
OpenERP object delete (deprecated) | Deprecated |
Delete data from the OpenERP server using the XMLRPC interface with the 'unlink' function. |
OpenERP object input (deprecated) | Deprecated |
Retrieve data from the OpenERP server using the XMLRPC interface with the 'read' function. |
OpenERP object output (deprecated) | Deprecated |
Update data on the OpenERP server using the XMLRPC interface and the 'import' function |
Oracle Bulk Loader | Bulk loading |
Use Oracle Bulk Loader to load data. |
ORC Input | Big Data | Read fields data from ORC files into a PDI data stream. |
ORC Output | Big Data | Serialize data from the PDI data stream into an ORC file format and writes it to a file. |
Output steps metrics | Statistics |
Return metrics for one or several steps. |
Palo cell input (deprecated) | Deprecated |
Retrieve all cell data from a Palo cube. |
Palo cell output (deprecated) | Deprecated |
Update cell data in a Palo cube. |
Palo dim input (deprecated) | Deprecated |
Return elements from a dimension in a Palo database. |
Palo dim output (deprecated) | Deprecated |
Create/update dimension elements and element consolidations in a Palo database. |
Parquet Input | Input |
Decode Parquet data formats and extracts fields from the structure it defines. |
Parquet Output | Output |
Map fields within data files and choose where you want to process those files. |
Pentaho Reporting Output | Output |
Execute an existing report file (.prpt). |
PostgreSQL Bulk Loader | Bulk loading |
Bulk load PostgreSQL data. |
Prioritize streams | Flow |
Prioritize streams in an order way. |
Process files | Utility |
Process one file per row (copy or move or delete). This step only accept filename in input. |
Properties Output | Output |
Write data to properties file. |
Property Input | Input |
Read data (key, value) from properties files. |
Python Executor | Scripting | Map upstream data from a PDI input step or execute a Python script to generate data. When you send all rows, Python stores the dataset in a variable that kicks off your Python script. |
R script executor | Statistics |
Execute an R script within a PDI transformation. |
Regex Evaluation | Scripting |
Evaluate regular expressions. This step uses a regular expression to evaluate a field. It can also extract new fields out of an existing field with capturing groups. |
Replace in string | Transform |
Replace all occurrences a word in a string with another word. |
Reservoir Sampling | Statistics |
Transform Samples a fixed number of rows from the incoming stream. |
REST Client | Lookup |
Consume RESTful services. REpresentational State Transfer (REST) is a key design idiom that embraces a stateless client-server architecture in which the web services are viewed as resources and can be identified by their URLs |
Row Denormaliser | Transform |
Denormalise rows by looking up key-value pairs and by assigning them to new fields in the output rows. This method aggregates and needs the input rows to be sorted on the grouping fields. |
Row Flattener | Transform |
Flatten consecutive rows based on the order in which they appear in the input stream. |
Row Normaliser | Transform |
Normalise de-normalised information. |
RSS Input | Input |
Read RSS feeds. |
RSS Output | Output |
Read RSS stream. |
Rule Executor | Scripting |
Execute a rule against each row (using Drools). |
Rule Accumulator | Scripting |
Execute a rule against a set of rows (using Drools). |
Run SSH commands | Utility |
Run SSH commands and returns result. |
Steps: S - Z
Name | Category | Description |
---|---|---|
S3 CSV Input | Input | Read from an S3 CSV file. |
S3 File Output | Output | Export data to a text file on an Amazon Simple Storage Service (S3). |
Salesforce Delete | Output | Delete records in a Salesforce module. |
Salesforce Input | Input | Read information from SalesForce. |
Salesforce Insert | Output | Insert records in a Salesforce module. |
Salesforce Update | Output | Update records in a Salesforce module. |
Salesforce Upsert | Output | Insert or update records in a Salesforce module. |
Sample rows | Statistics | Filter rows based on the line number. |
SAP input (deprecated) | Deprecated | Read data from SAP ERP, optionally with parameters. |
SAS Input | Input | Reads file in sas7bdat (SAS) native format. |
Script | Experimental | Calculate values by scripting in Ruby, Python, Groovy, Javascript, and other scripting languages. |
Select values | Transform | Select or remove fields in a row. Optionally, set the field meta-data: type, length and precision. |
Send message to Syslog | Utility | Send message to Syslog server. |
Serialize to file | Output | Write rows of data to a data cube. |
Set Field Value | Transform | Replace value of a field with another value field. |
Set Field Value to a Constant | Transform | Replace value of a field to a constant. |
Set files in result | Job | Set filenames in the result of this transformation. Subsequent job entries can then use this information. |
Set Session Variables | Pentaho Server | Set the value of session variable. |
Set Variables | Job | Set environment variables based on a single input row. |
SFTP Put | Experimental | Upload a file or a stream file to a remote host via SFTP. |
Simple Mapping | Mapping |
Turn a repetitive, re-usable part of a transformation (a sequence of steps) into a mapping (sub-transformation). |
Single Threader | Flow |
Execute a sequence of steps in a single thread. |
Socket reader | Inline | Read a socket. A socket client that connects to a server (Socket Writer step). |
Socket writer | Inline | Write a socket. A socket server that can send rows of data to a socket reader. |
Sort rows | Transform | Sort rows based upon field values (ascending or descending). |
Sorted Merge | Joins | Merge rows coming from multiple input steps providing these rows are sorted themselves on the given key fields. |
Split field to rows | Transform | Split a single string field by delimiter and creates a new row for each split term. |
Split Fields | Transform | Split a single field into more then one. |
Splunk Input | Transform | Read data from Splunk. |
Splunk Output | Transform | Write data to Splunk. |
SQL File Output | Output | Output SQL INSERT statements to a file. |
Stream lookup | Lookup | Look up values coming from another stream in the transformation. |
SSTable Output | Big Data | Write to a filesystem directory as a Cassandra SSTable. |
String operations | Transform | Apply certain operations like trimming, padding and others to string value. |
Strings cut | Transform | Cut a portion of a substring. |
Switch / Case | Flow | Switch a row to a certain target step based on the case value in a field. |
Synchronize after merge | Output | Perform insert/update/delete in one go based on the value of a field. |
Table Agile Mart | Agile | |
Table Compare | Utility |
Compare the data from two tables (provided they have the same lay-out). It'll find differences between the data in the two tables and log it. |
Table exists | Lookup | Check if a table exists on a specified connection. |
Table Input | Input | Read information from a database table. |
Table Output | Output | Write information to a database table. |
Teradata Fastload Bulk Loader | Bulk loading | Bulk load Teradata Fastload data. |
Teradata TPT Insert Upsert Bulk Loader | Bulk loading | Bulk load via TPT using the tbuild command. |
Text file input | Input |
Read data from a text file in several formats. This data can then be passed to your next step(s). |
Text file input (deprecated) | Deprecated | Replaced by Text file input. |
Text file output | Output |
Write rows to a text file. |
Text file output (deprecated) | Deprecated | Replaced by Text file output. |
Transformation Executor | Flow |
Run a Pentaho Data Integration transformation, sets parameters, and passes rows. |
Unique rows | Transform |
Remove double rows and leave only unique occurrences. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly. |
Unique rows (HashSet) | Transform |
Remove double rows and leave only unique occurrences by using a HashSet. |
Univariate Statistics | Statistics | Compute some simple stats based on a single input field. |
Update | Output | Update data in a database table based upon keys. |
User Defined Java Class | Scripting | Program a step using Java code. |
User Defined Java Expression | Scripting | Calculate the result of a Java Expression using Janino. |
Value Mapper | Transform | Map values of a certain field from one value to another. |
Vertica Bulk Loader | Bulk loading | Bulk load data into a Vertica table using their high performance COPY feature. |
Web services lookup | Lookup | Look up information using web services (WSDL). |
Write to log | Utility | Write data to log. |
XBase input | Input | Read records from an XBase type of database file (DBF). |
XML Input Stream (StAX) | Input | Process very large and complex XML files very fast. |
XML Join | Joins | Join a stream of XML-Tags into a target XML string. |
XML Output | Output | Write data to an XML file. |
XSD Validator | Validation | Validate XML source (files or streams) against XML Schema Definition. |
XSL Transformation | Transform | Transform XML stream using XSL (eXtensible Stylesheet Language). |
Yaml Input | Input |
Read YAML source (file or stream) parse them and convert them to rows and writes these to one or more output. |
Zip File | Utility | Create a standard ZIP archive from the data stream fields. |