Merge rows (diff)
The Merge rows (diff) step compares and merges data within two rows of data. This step is useful for comparing data collected at two different times. For example, the source system of your data warehouse might not contain a timestamp of the last data update. You could use this step to compare the two data streams and merge the dates and timestamps in the rows.
Based on keys for comparison, this step merges reference rows (previous data) with compare rows (new data) and creates merged output rows. A flag in the row indicates how the values were compared and merged. Flag values include:
identical
The key was found in both rows, and the compared values are identical.
changed
The key was found in both rows, but one or more compared values are different.
new
The key was not found in the reference rows.
deleted
The key was not found in the compare rows.
If the rows are flagged as deleted
, the merged output rows are created based
upon the original reference rows stream.
For identical
, new
, or changed
rows, the
merged output rows are created based upon the original compare rows stream.
You can also send values from the merged and flagged rows to a subsequent step in your transformation, such as the Switch-Case step or the Synchronize after merge step. In the subsequent step, you can use the flag field generated by Merge rows (diff) to control updates/inserts/deletes on a target table.
Select an Engine
You can run the Merge rows (diff) step on the Pentaho engine or on the Spark engine. Depending on your selected engine, the transformation runs differently. Select one of the following options to view how to set up the Merge rows (diff) step for your selected engine.
- Using Merge rows (diff) on the Pentaho engine: Learn how to set up this step when using the Pentaho engine.
- Using Merge rows (diff) on the Spark engine: Learn how to set up this step when using the Spark engine.
For instructions on selecting an engine from your transformation, see Run configurations.