Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Pentaho MapReduce

Parent article

This job entry executes transformations as part of a Hadoop MapReduce job in place of a traditional Hadoop Java class. A Hadoop MapReduce job is made up of any combination of following types of transformations:

  • The Mapper transformation takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). It performs filtering and sorting (such as sorting students by first name into queues, one queue for each name). It applies a given function to each element of a list, returning a list of results in the same order.
  • The Combiner transformation summarizes the map output records with the same key, which helps to reduce the amount of data written to disk, and transmitted over the network.
  • The Reducer transformation performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). It analyzes a recursive data structure and through use of a given combining operation, recombine the results of recursively processing its constituent parts, building up a return value.

NoteThis entry was formerly known as Hadoop Transformation Job Executor.

With the Pentaho MapReduce entry, you specify PDI transformations to use for the mapper, combiner, and/or reducer through their related tabs. The mapper transformation is required. The combiner and reducer transformations are optional. See Pentaho MapReduce workflow for details on how PDI works with Hadoop clusters.

NoteThe Hadoop job name field in the Cluster tab is required and must be specified for the Pentaho MapReduce entry to work.

General

Use the Entry Name field to specify the unique name of the job entry on the canvas. The Entry Name is set to Pentaho MapReduce by default.

Options

The Pentaho MapReduce job entry features several tabs to define your transformations and setup the connection with the Hadoop cluster. Each tab is described below.

Mapper tab

Mapper tab, Pentaho MapReduce

The following table describes the options for defining a mapper transformation, which is required by this entry:

OptionDefinition
Transformation

Specify the transformation that will perform the mapping functions for this job by entering its path or clicking Browse.

If you select a transformation that has the same root path as the current transformation, the variable ${Internal.Entry.Current.Directory} will automatically be inserted in place of the common root path. For example, if the current transformation's path is /home/admin/transformation.ktr and you select a transformation in the folder /home/admin/path/sub.ktr than the path will automatically be converted to ${Internal.Entry.Current.Directory}/path/sub.ktr.

If you are working with a repository, specify the name of the transformation in your repository. If you are not working with a repository, specify the XML file name of the transformation on your system.

NoteTransformations previously specified by reference are automatically converted to be specified by name within the Pentaho Repository.
Input step nameSpecify the name of the step that receives mapping data from Hadoop. It must be a MapReduce Input step.
Output step nameSpecify the name of the step that passes mapping output back to Hadoop. It must be a MapReduce Output step.

Combiner tab

Combiner tab, Pentaho MapReduce

The following table describes the options for defining a combiner transformation:

OptionDefinition
Transformation

Specify the transformation that will perform the combiner functions for this job by entering its path or clicking Browse.

You can use any internal variable to specify the path. For example, if you select a transformation that is located in the same folder as the current transformation, you can use the ${Internal.Entry.Current.Directory} internal variable to define the path.

If you are working with a repository, specify the name of the transformation in your repository. If you are not working with a repository, specify the XML file name of the transformation on your system.

NoteTransformations previously specified by reference are automatically converted to be specified by name within the Pentaho Repository.
Input step nameSpecify the name of the step that receives combiner data from Hadoop. It must be a MapReduce Input step.
Output step nameSpecify the name of the step that passes combiner output back to Hadoop. It must be a MapReduce Output step.
Use single threaded transformation engineSelect to indicate the Single Threaded transformation execution engine should be used to execute the combiner transformation. If not selected, the normal multi-threaded transformation engine will be used. The Single Threaded transformation execution engine reduces overhead when processing many small groups of output.

Reducer tab

Reducer tab, Pentaho MapReduce

The following table describes the options for defining a reducer transformation:

OptionDefinition
Transformation

Specify the transformation that will perform the reducer functions for this job by entering its path or clicking Browse.

You can use any internal variable to specify the path. For example, if you select a transformation that is located in the same folder as the current transformation, you can use the ${Internal.Entry.Current.Directory} internal variable to define the path.

If you are working with a repository, specify the name of the transformation in your repository. If you are not working with a repository, specify the XML file name of the transformation on your system.

NoteTransformations previously specified by reference are automatically converted to be specified by name within the Pentaho Repository.
Input step nameSpecify the name of the step that receives reducing data from Hadoop. It must be a MapReduce Input step.
Output step nameSpecify the name of the step that passes reducing output back to Hadoop. It must be a MapReduce Output step.
Use single threaded transformation engineSelect to indicate the Single Threaded transformation execution engine should be used to execute the reducer transformation. If not selected, the normal multi-threaded transformation engine will be used. The Single Threaded transformation execution engine reduces overhead when processing many small groups of output.

Job Setup tab

Job Setup tab, Pentaho MapReduce

The following table describes the options for setting up the inputs and outputs of the job:

OptionDefinition
Input pathEnter the path of the input directory, such as /wordcount/input, from your Hadoop cluster where the source data for the MapReduce job is stored. A comma-separated list can be used for multiple input directories.
Output path

Enter the path of the directory, such as /wordcount/output, on your Hadoop cluster where you want the output from the MapReduce job to be stored.

NoteThe output directory cannot exist prior to running the MapReduce job.
Remove output path before jobSelect to remove the specified output path before the MapReduce job is scheduled.
Input formatEnter the Apache Hadoop class name that describes the input specification for the MapReduce job. See InputFormat for more information.
Output formatEnter the Apache Hadoop class name that describes the output specification for the MapReduce job. See OutputFormatfor more information.

Ignore output of map key

Select to ignore the key output from the mapper transformation and replace it with NullWritable.
Ignore output of map valueSelect to ignore the value output from the mapper transformation and replace it with NullWritable.
Ignore output of reduce keySelect to ignore the key output from the combiner and/or reducer transformations and replace them with NullWritable. This requires a reducer transformation to be used, not the Identity Reducer.
Ignore output of reduce valueSelect to ignore the key output from the combiner and/or reducer transformations and replace them with NullWritable. This requires a reducer transformation to be used, not the Identity Reducer.

Cluster tab

Cluster tab, Pentaho MapReduce

The following table describes the options for setting up configurations for the Hadoop cluster connection:

OptionDefinition
Hadoop job nameEnter the name of the Hadoop job you are running. It is required for the Pentaho MapReduce entry to work.
Hadoop ClusterSpecify the configuration of your Hadoop cluster through the following options:
  • Select an existing configuration. If your configuration does not appear in this list, create it with the New button.
  • Click Edit to use the Hadoop cluster dialog box to modify an existing configuration. See the Hadoop cluster configuration section for further details on this dialog box.
  • Click New to use the Hadoop cluster dialog box to create a new configuration. See the Hadoop cluster configuration section for further details on this dialog box.

See Use Hadoop with Pentaho for general information on Hadoop cluster configurations.

Number of Mapper TasksEnter the number of mapper tasks you want to assign to this job. The size of the inputs should determine the number of mapper tasks. Typically, there should be between 10-100 maps per node, though you can specify a higher number for mapper tasks that are not CPU-intensive.
Number of Reducer TasksEnter the number of reducer tasks you want to assign to this job. Lower numbers mean that the reduce operations can launch immediately and start transferring map outputs as the maps finish. The higher the number, the quicker the nodes will finish their first round of reduces and launch a second round. Increasing the number of reduce operations increases the Hadoop framework overhead, but improves load balancing.

NoteIf this is set to 0, then no reduce operation is performed, and the output of the mapper becomes the output of the entire job. Combiner operations will also not be performed.
Logging IntervalEnter the number of seconds between log messages.
Enable BlockingSelect to forces the job to wait until each step completes before continuing to the next step. This is the only way for PDI to be aware of a Hadoop job's status.

NoteIf this option is not selected, the Hadoop job blindly executes, and PDI will move on to the next job entry. Error handling and routing will not work unless this option is selected.

Hadoop cluster configuration

When you click the Edit or New buttons next to the Hadoop Cluster field, the Hadoop cluster dialog box appears. Use this dialog box to specify configuration details such as host names and ports for HDFS, Job Tracker, and other big data cluster components. These configuration options are reused in the related transformation steps and job entries that support big data features.
OptionDefinition
Cluster NameEnter the name that you assign the cluster configuration.
Hostname (in HDFS section)Enter the hostname for the HDFS node in your Hadoop cluster.
Port (in HDFS section)Enter the port for the HDFS node in your Hadoop cluster.
Username (in HDFS section)Enter the username for the HDFS node.
Password (in HDFS section)Enter the password for the HDFS node.
Hostname (in JobTracker section)Enter the hostname for the JobTracker node in your Hadoop cluster. If you have a separate job tracker node, type in the hostname here. Otherwise, use the HDFS hostname.
Port (in JobTracker section)Enter the port for the JobTracker in your Hadoop cluster. Job tracker port number cannot be the same as the HDFS port number.
Hostname (in ZooKeeper section)Enter the hostname for the Zookeeper node in your Hadoop cluster.
Port (in Zookeeper section)Enter the port for the Zookeeper node in your Hadoop cluster.
URL (in Oozie section)Enter a URL of a valid Oozie location.

After you have finished setting these configuration options, perform the following steps:

Procedure

  1. Click Test to try your configurations on the Hadoop cluster. If you are unable to connect, see Connecting to a Hadoop cluster with the PDI client for further details on Hadoop cluster connections.

  2. Click OK to return to the Cluster tab.

User Defined tab

User Defined tab, Pentaho MapReduce

The following table describes the options for defining user-defined parameters and variables:

ColumnDefinition
NameEnter the name of the user-defined parameter or variable that you want to set. To set a java system variable, preface the variable name with java.system (java.system.SAMPLE_VARIABLE for example).

Kettle variables that are set here override the Kettle variables set in the kettle.properties file. For more information on how to set a kettle variable, see Kettle Variables.

ValueEnter the value of the user-defined parameter or variable that you want to set.