Skip to content

Partitioning & column sanitization

Prerequisites · Destination is a lakehouse or file format that supports partitioning. End state · Partition columns derived via SQL expressions; column names normalised for all downstream tools.

Declarative partition columns

"destination": {
  "partition_columns": [
    {"column": "order_date", "expression": "date(updated_at)"},
    {"column": "region"}
  ]
}

Two shapes:

  • {"column": "x"} — use the existing column x as a partition column.
  • {"column": "x", "expression": "..."} — derive x from a SQL expression, then partition on it.

The PartitionHandler transformer (order 80) runs after SystemColumnAdder (order 70), so you can partition on framework audit columns like __updated_at:

{"column": "etl_date", "expression": "CAST(__updated_at AS DATE)"}

SQL expression portability

Expressions go through engine.add_column(df, name, expression) which uses:

  • Spark: F.expr(expression) — full Spark SQL surface.
  • Polars: pl.sql_expr(expression)subset of ANSI SQL.

Polars limitations to know

Spark works Polars equivalent
current_timestamp() "CAST('2026-01-01' AS TIMESTAMP)" — use a literal or the SystemColumnAdder's __updated_at.
NOW() Same as above.
year(col) EXTRACT(YEAR FROM col)
date_format(col, "yyyy-MM-dd") CAST(col AS DATE)

Pick expressions that work on both engines if you plan to swap.

Column name sanitization

Runs last (order 90) via ColumnNameSanitizer. Actions:

  • Lowercase by default (ColumnCaseMode.LOWER).
  • Replace spaces, dots, dashes with underscores.
  • Strip leading digits or invalid characters.

Override by subclassing DataCoolieDriver and setting self._column_name_mode.

Gotchas

Symptom Likely cause Fix
pl.sql_expr error "unknown function current_timestamp" Polars engine Use __updated_at from SystemColumnAdder.
Partition column not present at write Expression references a source-only column dropped by ColumnAdder Move expression into additional_columns instead.
"yyyy-MM-dd" format error Java-style format Remove format string; Polars auto-detects ISO dates via CAST(... AS DATE).