Merge & SCD2¶
Prerequisites · Destination is a lakehouse (Delta or Iceberg). Keys exist on both source and destination. End state · Incremental merges working correctly; SCD Type 2 history tracked.
merge_upsert — standard upsert¶
{
"destination": {
"connection": "bronze",
"schema_name": "sales",
"table": "customers",
"load_type": "merge_upsert",
"merge_keys": ["customer_id"]
}
}
Matched rows are updated, unmatched rows are inserted. Nothing is deleted. Ideal for CDC-style incremental loads where the source delivers only changed rows.
merge_overwrite — rolling window¶
For every (event_date, device_id) in the source, destination rows with the
same keys are deleted and the source rows are inserted. Use when the
source always holds the full current state for its window.
scd2 — Type 2 with history¶
{
"destination": {
"load_type": "scd2",
"merge_keys": ["customer_id"],
"configure": {
"scd2_effective_column": "updated_at"
}
}
}
scd2_effective_column names a source column whose value is taken as the row's
business validity start. The SCD2ColumnAdder transformer (order 60) adds
three framework columns before write:
__valid_from— copied fromscd2_effective_column__valid_to— NULL on new rows__is_current—trueon new rows
On write the engine's scd2_* method runs a two-step MERGE:
- Close step — for every source row whose
merge_keysmatch a target row where__is_current = trueand the source__valid_fromis later than the target__valid_from(late-arrival guard), set__valid_to = source.__valid_fromand__is_current = false. - Append step — insert all source rows as new versions.
There is no stored hash column. Change detection is driven by the
effective-date column you nominate — if the source delivers a row with a
newer updated_at than the current version, a new version is created. If it
delivers a row with an equal or older updated_at, the late-arrival guard
skips the close step.
Reading "current state only"¶
Reading "point-in-time"¶
SELECT * FROM customers_scd2
WHERE __valid_from <= CAST('2026-01-01' AS TIMESTAMP)
AND COALESCE(__valid_to, CAST('9999-12-31' AS TIMESTAMP)) > CAST('2026-01-01' AS TIMESTAMP)
Which strategy to pick¶
See the decision tree in Concepts · Load strategies.