Destination & load patterns¶

Prerequisites · Completed Source patterns.
End state · A correct destination block with the right load_type and any required supporting fields.

The destination block decides where data is written and how existing data is treated. In DataCoolie, those are separate decisions:

destination connection/format = physical target type
load_type = write semantics against that target
partition_columns and write_options = storage behavior modifiers

Destination at a glance¶

"destination": {
  "connection_name": "silver",
  "schema_name":     "sales",
  "table":           "orders",
  "load_type":       "merge_upsert",
  "merge_keys":      ["order_id"],
  "partition_columns": [
    { "column": "order_date", "expression": "CAST(created_at AS DATE)" }
  ],
  "configure": {
    "write_options": { "mergeSchema": true }
  }
}

Field	Required	Meaning
`connection_name`	yes	Must match a destination connection
`schema_name`	no	Output namespace / folder
`table`	yes	Output table or folder name
`load_type`	yes	`append`, `overwrite`, `full_load`, `merge_upsert`, `merge_overwrite`, `scd2`
`merge_keys`	conditional	Required for merge and SCD2 strategies
`partition_columns`	no	Partition output by existing or derived columns
`configure.write_options`	no	Per-dataflow write-option overrides
`configure.scd2_effective_column`	conditional	Required for `scd2`

connection.configure.write_options and destination.configure.write_options are merged, with destination overrides winning.

Which destination types are built in?¶

Built-in writers today support these destination families:

Destination family	Supported formats	Addressing style	Supported load types	Maintenance
Flat-file output	`parquet`, `csv`, `json`, `jsonl`, `avro`	Path-based	`append`, `overwrite`, `full_load`	No
Lakehouse table	`delta`, `iceberg`	Path-based or catalog/database/table	All load types	Yes

What is not built in

Excel is not a writable destination
database, API, and function destinations require custom destination plugins

Choose your strategy first¶

flowchart TD
    A["Do you need full change history?"] -->|yes| SCD2["scd2"]
    A -->|no| B["Can the source deliver the full current state?"]
    B -->|"Full table every run"| OW["overwrite"]
    B -->|"Only new/changed rows"| C["Are duplicates acceptable?"]
    C -->|yes| APP["append"]
    C -->|"No — need latest-wins upsert"| MU["merge_upsert"]
    C -->|"No — need full refresh of matched keys"| MO["merge_overwrite"]

`append` — Add rows only, never touch existing¶

When to use: event streams, log ingestion, any scenario where duplicates in the destination are acceptable or impossible.

"destination": {
  "connection_name": "bronze",
  "schema_name":     "events",
  "table":           "clicks",
  "load_type":       "append"
}

No extra fields required. Every run adds the rows returned by the source.

Works for both flat-file outputs and lakehouse tables.

`overwrite` — Replace everything¶

When to use: daily snapshots, reference tables, aggregates that are always rebuilt from scratch.

"destination": {
  "connection_name": "silver",
  "schema_name":     "sales",
  "table":           "daily_totals",
  "load_type":       "overwrite"
}

full_load is an alias

"load_type": "full_load" is equivalent. Use overwrite for new metadata.

This is the safest whole-table strategy for file outputs. If your destination format is parquet, csv, json, jsonl, or avro, overwrite / full_load and append are the built-in options.

`merge_upsert` — Upsert by key (SCD1 / CDC)¶

When to use: incremental CDC-style loads, dimension tables that change over time but do not need history, customer/product master tables.

Rows matching merge_keys are updated; rows with no match are inserted. Nothing is deleted.

"destination": {
  "connection_name": "silver",
  "schema_name":     "sales",
  "table":           "customers",
  "load_type":       "merge_upsert",
  "merge_keys":      ["customer_id"]
}

Field	Required	Notes
`merge_keys`	yes	List of column names that uniquely identify a row. Can be composite: `["order_id", "line_item_id"]`

Deduplication is often paired with merge_upsert

If your source can deliver duplicate rows for the same key, add transform.deduplicate_columns and transform.latest_data_columns to keep only the latest one before the merge. See Transform patterns.

!!! info "First load falls back to overwrite" If the target table does not exist yet, DataCoolie performs an initial overwrite-style write and only uses merge semantics on later runs.

`merge_overwrite` — Rolling overwrite by key¶

When to use: nightly snapshot that always holds the full current state for a rolling window. Existing target rows matching merge_keys are deleted then re-inserted from the source. Unlike merge_upsert this handles deletions in the source naturally.

"destination": {
  "connection_name": "silver",
  "schema_name":     "logistics",
  "table":           "active_shipments",
  "load_type":       "merge_overwrite",
  "merge_keys":      ["shipment_id"]
}

First load falls back to overwrite

Like merge_upsert, this strategy writes a brand-new table with overwrite semantics when the destination does not exist yet.

`scd2` — Slowly Changing Dimension Type 2¶

When to use: dimension tables where you need full change history — e.g. customers, products, employees where you need to know what value was current at any past point in time.

SCD2 stores one row per version of each entity. The framework automatically adds three audit columns to every version row:

Column	Meaning
`__valid_from`	When this version became current (copied from your date column)
`__valid_to`	When this version ended — `NULL` means it is still current
`__is_current`	`true` for the active version

"destination": {
  "connection_name": "gold",
  "schema_name":     "dims",
  "table":           "customer",
  "load_type":       "scd2",
  "merge_keys":      ["customer_id"],
  "configure":       { "scd2_effective_column": "updated_at" }
}

Field	Required	Notes
`merge_keys`	yes	The natural/business key of the entity
`configure.scd2_effective_column`	yes	The source column that timestamps when this version became effective

SCD2 tables grow over time

Each run appends new versions for changed rows. Plan your storage and run maintenance (vacuum/optimize) regularly.

!!! info "First load falls back to overwrite" On the first run, DataCoolie creates the destination table first and then switches to SCD2 versioning on subsequent runs.

`partition_columns` — Partition the output table¶

Any destination strategy can optionally write partitioned data. This is not a load type on its own — it is an addition to the destination block:

"destination": {
  "connection_name":   "silver",
  "schema_name":       "sales",
  "table":             "orders",
  "load_type":         "overwrite",
  "partition_columns": [
    { "column": "order_date", "expression": "CAST(created_at AS DATE)" }
  ]
}

expression is evaluated as SQL before the write. The computed column (order_date) is added to the DataFrame by PartitionHandler (transformer order 80) and used as the partition key.

Partition by an existing column — no expression needed:

"partition_columns": [
  { "column": "region" }
]

Multi-level partitioning:

"partition_columns": [
  { "column": "order_year",  "expression": "EXTRACT(YEAR FROM order_date)" },
  { "column": "order_month", "expression": "EXTRACT(MONTH FROM order_date)" }
]

Partition columns extend merge keys internally

For merge-style destinations, DataCoolie automatically appends destination partition columns to the internal merge-key set when they are not already present. This keeps merge semantics aligned with the physical partitioning.

Use CAST, not date(...), for Polars portability

For partition expressions, prefer CAST(created_at AS DATE) over date(created_at) so the metadata works in Polars as well as Spark.

See How-to · Partitioning & sanitization for full detail.

Flat-file outputs with date folders¶

Flat-file destinations have one more path-shaping option on the connection:

{
  "name":            "curated_parquet",
  "connection_type": "file",
  "format":          "parquet",
  "configure": {
    "base_path":              "data/output/curated",
    "date_folder_partitions": "{year}/{month}/{day}"
  }
}

This writes under a dated subfolder such as data/output/curated/sales/orders/2026/05/09.

Use this when the partitioning is based on load time rather than a column in the DataFrame.

partition_columns wins over date_folder_partitions

For flat-file destinations, if you configure both, DataCoolie uses partition_columns and ignores the date-folder pattern.

Write options¶

Put write-engine options at the connection level when most dataflows should use them, and at the destination level when only one dataflow needs them.

Connection-level defaults:

{
  "name":            "silver",
  "connection_type": "lakehouse",
  "format":          "delta",
  "configure": {
    "base_path": "data/output/silver",
    "write_options": {
      "mergeSchema": true
    }
  }
}

Destination-level override:

"destination": {
  "connection_name": "silver",
  "schema_name":     "sales",
  "table":           "orders",
  "load_type":       "append",
  "configure": {
    "write_options": {
      "compression": "zstd"
    }
  }
}

Advanced lakehouse registration options¶

These options matter mostly for Delta/Iceberg deployments with metastore or AWS catalog integration:

Option	Where	What it does
`catalog` / `database`	connection field or `configure`	Registers or addresses the destination by qualified name instead of path only
`athena_output_location`	`connection.configure`	After Delta writes and maintenance, registers a native Delta table through Athena DDL
`generate_manifest`	`connection.configure`	Generates `_symlink_format_manifest/` after writes and maintenance
`register_symlink_table`	`connection.configure`	Registers a Glue symlink table; implies manifest generation
`symlink_database_prefix`	`connection.configure`	Prefix for the generated symlink database name

When catalog or database is present, DataCoolie identifies the physical destination by qualified table name. Otherwise it identifies it by path. That distinction matters for maintenance deduplication and fan-in orchestration.

Maintenance support¶

Flat-file destinations do not support maintenance.
Delta and Iceberg destinations do.
Maintenance is dispatched per physical destination, not per metadata row, so duplicate dataflows targeting the same table/path are deduplicated.

See How-to · Maintenance (vacuum/optimize) for the operational workflow.

Common mistakes¶

Symptom	Likely cause	Fix
`FileWriter only supports ['append', 'full_load', 'overwrite']`	Tried `merge_upsert`, `merge_overwrite`, or `scd2` on a flat-file destination	Use a Delta/Iceberg destination for merge semantics, or switch to `append` / `overwrite`
`merge_keys required` error	Used `merge_upsert` or `scd2` without `merge_keys`	Add `"merge_keys": ["your_key_column"]`
SCD2 columns not added	`scd2_effective_column` missing from `configure`	Add `"configure": { "scd2_effective_column": "updated_at" }` in destination
Full table replaced when you wanted upsert	`load_type` is `overwrite` instead of `merge_upsert`	Change `load_type`
Duplicate rows in destination after merge	Source delivers multiple rows for same key; dedup not configured	Add `transform.deduplicate_columns` — see Transform patterns
Partition expression fails on Polars	Used `date(col)` or another unsupported SQL function	Use `CAST(col AS DATE)` or `EXTRACT(...)`
Maintenance skipped or fails on file outputs	Flat-file destinations do not implement maintenance	Run maintenance only on Delta/Iceberg destinations

Next¶

→ Transform patterns