Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Group By

This step groups rows from a source, based on a specified field or collection of fields. A new row is generated for each group. It can also generate one or more aggregate values for the groups. Common uses are calculating the average sales per product and counting the number of an item you have in stock. 

The Group By step is designed for sorted inputs. If your input is not sorted, only double consecutive rows are grouped correctly. If you sort the data outside of PDI, the case sensitivity of the data in the fields may produce unexpected grouping results. 

You can use the Memory Group By step to handle non-sorted input.

General

Enter the information in the transformation step name field as shown in the following table:

Option     Description
Step Name Specifies the unique name of the Group By step on the canvas. You can customize the name or leave it as the default.
Include all rows?

Select if you want to include all rows in the output, instead of only the aggregate rows. The following options are not available unless the Include all rows option is selected:

  • Temporary files directory
  • TMP-file prefix
  • Add line number, restart in each group
  • Line number field name
Temporary files directory Specify the directory where the temporary files are stored. The default is the standard temporary directory for the system. You must specify a directory when the Include all rows option is selected and the number of grouped rows exceeds 5000 rows.
TMP-file prefix Specifies the file prefix for naming temporary files.
Add line number, restart in each group Adds a line number that restarts at 1 in each group. When Include all rows and this option are both selected, all rows are included in the output and have a line number for each row.
Line number field name Specifies the name of the field where you want to add line numbers for each new group.
Always give back a result row Returns a result row, even when there is no input row.  When there are no input rows, this option returns a count of zero (0).

    PDI_Group_By_Dlg.png
This step features two tables with fields for setting environments and defining results. Each table is described below.

The Fields that Make up the Group Table

Use this table to specify the fields you want to group. Click Get Fields to add all fields from the input stream(s). Right-click a row in the table to edit that row or all rows in the table. 

You can calculate aggregate functions over the whole dataset by leaving this table blank.

Aggregates Table

The Aggregates table specifies the group you want to aggregate, the aggregation method, and the name of the resulting new field. 

The Aggregates table options are:
 

Field Description
Name The name of the aggregate field.
Subject The subject on which you want to use an aggregation method.
Type

The aggregate method.

The aggregation methods are:

  • Sum
  • Average (Mean)
  • Median
  • Percentile
  • Minimum
  • Maximum
  • Number of values (N)
  • Concatenate strings separated by , (comma)
  • First non-null value
  • Last non-null value
  • First value (including null)
  • Last value (including null)
  • Cumulative sum (all rows option only)
  • Cumulative average (all rows option only)
  • Standard deviation
  • Concatenate strings that are separated by the character specified in the Value column
  • Number of distinct values 
  • Number of rows (without field argument).
Value The aggregate value

Examples

Examples included in the design-tools\data-integration\samples\transformations directory are:

  • Calculate median and percentiles using the group by steps.ktr
  • General - Repeat fields - Group by - Denormalize.ktr
  • Group By - Calculate standard deviation.ktr
  • Group By - include all rows without a grouping.ktr
  • Group by - include all rows and calculations .ktr.

 Use Group By with Spark

The following differences occur when the Group By step is used with Spark:

  • The field names cannot contain spaces, dashes, or special characters, and must start with a letter.
  • The rows are sorted by the grouping fields.
  • The Sort step before the Group By step is optional. Existing transformations that contain a Sort step before the Group By step will successfully run.
  • The Group By and the Memory Group By steps work the same.
  • If you select the Include All Rows option, you can not use the Number of distinct values aggregate type.
  • The Temporary files directory and TMP-file prefix options do not apply. Temporary files are not created when run with Spark.

Before you can use the Group By step with Spark, you must set up the Adaptive Execution Layer (AEL).

Metadata Injection Support

All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime. 

The metadata injection values for the aggregation type are: 

  • SUM
  • AVERAGE
  • MEDIAN
  • PERCENTILE
  • MIN
  • MAX
  • COUNT_ALL
  • CONCAT_COMMA
  • FIRST
  • LAST
  • FIRST_INCL_NULL
  • LAST_INCL_NULL
  • CUM_SUM
  • CUM_AVG
  • STD_DEV
  • CONCAT_STRING
  • COUNT_DISTINCT
  • COUNT_ANY

Metadata injection is not supported for steps running on the Adaptive Execution Layer (AEL)