Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Using the Group By step on the Pentaho engine

Parent article

If you are running your transformation on the Pentaho engine, use the following instructions to set up the Group By step.

General

Group By dialog

Enter the information in the options as shown in the following table:

Option Description
Step nameSpecifies the unique name of the Group By step on the canvas. You can customize the name or leave it as the default.
Include all rows?

Select this check box if you want to include all rows in the output. Clear this check box if you only want to output the aggregate rows.

The following options are not available unless the Include all rows option is selected:

  • Temporary files directory
  • TMP-file prefix
  • Add line number, restart in each group
  • Line number field name
Temporary files directorySpecify the directory where the temporary files are stored. The default is the standard temporary directory for the system. You must specify a directory when the Include all rows option is selected and the number of grouped rows exceeds 5000 rows.
TMP-file prefixSpecifies the file prefix for naming temporary files.
Add line number, restart in each groupAdds a line number that restarts at 1 in each group. When both Include all rows and this option are selected, all rows are included in the output with a line number for each row.
Line number field nameSpecifies the name of the field where you want to add line numbers for each new group.
Always give back a result rowSelect this check box to return a result row, even when there is no input row. When there are no input rows, this option returns a count of zero (0). Clear this check box if you only want to output a result row when there is an input row.

The fields that make up the group table

Use The fields that make up the group table to specify the fields you want to group. Click Get Fields to add all fields from the input stream(s) to the table. Right-click a row in the table to edit that row or all rows in the table.

NoteYou can calculate aggregate functions over the whole dataset by leaving this table blank.

Aggregates table

The Aggregates table specifies the group you want to aggregate, the aggregation method, and the name of the resulting new field.

The Aggregates table contains the following columns:

ColumnDescription
NameThe name of the aggregate field.
SubjectThe subject on which you want to use an aggregation method.
Type

The aggregate method.

The aggregation methods are:

  • Sum
  • Average (Mean)
  • Median
  • Percentile (linear interpolation)
  • Minimum
  • Maximum
  • Number of values (N)
  • Concatenate strings separated by , (comma)
  • First non-null value
  • Last non-null value
  • First value (including null)
  • Last value (including null)
  • Cumulative sum (all rows option only)
  • Cumulative average (all rows option only)
  • Standard deviation (Population)
  • Concatenate strings that are separated by the character specified in the Value column
  • Number of distinct values
  • Number of rows (without field argument)
  • Standard deviation (sample)
  • Percentile (nearest-rank method).
ValueThe aggregate value.

Examples

Examples included in the design-tools\data-integration\samples\transformations directory are:

  • Calculate median and percentiles using the group by steps.ktr
  • General - Repeat fields - Group by - Denormalize.ktr
  • Group By - Calculate standard deviation.ktr
  • Group By - include all rows without a grouping.ktr
  • Group by - include all rows and calculations .ktr.

Metadata Injection Support

All fields of this step support metadata injection. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.

The metadata injection values for the aggregation type are:

  • SUM
  • AVERAGE
  • MEDIAN
  • PERCENTILE
  • MIN
  • MAX
  • COUNT_ALL
  • CONCAT_COMMA
  • FIRST
  • LAST
  • FIRST_INCL_NULL
  • LAST_INCL_NULL
  • CUM_SUM
  • CUM_AVG
  • STD_DEV
  • CONCAT_STRING
  • COUNT_DISTINCT
  • COUNT_ANY