Metadata schema¶

DataCoolie's metadata contract lives in datacoolie.core.models as CompatModel-backed dataclasses. This page is generated at docs-build time from those models — field descriptions, defaults, and validation rules come straight from source. Treat the models as the source of truth; this page as the rendered view.

Top-level run configuration¶

DataCoolieRunConfig `dataclass` ¶

DataCoolieRunConfig()

Validated execution parameters for a DataCoolie run.

Connection¶

Connection `dataclass` ¶

Connection()

Endpoint configuration for a data source or destination.

The configure JSON field stores type-specific settings (host, port, read_options, write_options, etc.). Frequently-used values are surfaced as computed properties.

base_path `property` ¶

base_path: Optional[str]

Base storage path (e.g. abfss://container@storage/).

database_type `property` ¶

database_type: Optional[str]

Database type (mysql, mssql, postgresql, oracle, sqlite).

url `property` ¶

url: Optional[str]

Explicit URL / connection string from configure.

driver `property` ¶

driver: Optional[str]

JDBC driver class name.

athena_output_location `property` ¶

athena_output_location: Optional[str]

S3 path for Athena DDL query results.

When set, the writer always registers a native Delta table via Athena DDL (DROP + CREATE EXTERNAL TABLE ... TBLPROPERTIES ('table_type'='DELTA')) after every write and maintenance.

generate_manifest `property` ¶

generate_manifest: bool

Generate _symlink_format_manifest/ after writes and maintenance.

register_symlink_table `property` ¶

register_symlink_table: bool

Register a SymlinkTextInputFormat table in Glue after writes.

Implies :attr:generate_manifest.

symlink_database_prefix `property` ¶

symlink_database_prefix: str

Prefix for symlink Glue database name. Default "symlink_".

date_backward `property` ¶

date_backward: Optional[Dict[str, Any]]

Backward look-back offset for date-folder partition discovery.

Reads backward_days, backward_months, backward_hours as top-level keys from config, or a nested backward dict.

Strategies:

Fixed offset — subtract days / months / hours from watermark::

config:
  backward_days: 7
  # or
  backward: {days: 7, months: 1}

Closing-day — monthly period boundary based on current date::

config:
  backward: {closing_day: 10}

refresh_from_configure ¶

refresh_from_configure() -> None

Unconditionally sync database and catalog from configure.

Unlike the model validator (which only sets empty fields at construction time), this always overwrites — call after secret resolution when configure values have been resolved from vault keys to real values.

Source¶

Source `dataclass` ¶

Source()

Read-side pipeline configuration.

namespace `property` ¶

namespace: Optional[str]

Namespace without the table: catalog.database.schema.

read_options `property` ¶

read_options: Dict[str, Any]

Merged read options: connection defaults + source overrides.

date_backward `property` ¶

date_backward: Optional[Dict[str, Any]]

Backward look-back offset, source-level overrides connection-level.

Reads from configure (same keys as :attr:Connection.date_backward). If no source-level config is present, falls back to the connection's value.

Example (YAML / source configure)::

configure:
  backward_days: 7         # overrides connection setting
  # or
  backward: {months: 1}
  # or closing-day strategy
  backward: {closing_day: 10}

Destination¶

Destination `dataclass` ¶

Destination()

Write-side pipeline configuration.

namespace `property` ¶

namespace: Optional[str]

Namespace without the table: catalog.database.schema.

destination_key `property` ¶

destination_key: str

Stable identity for this destination as a physical object.

Two destinations that resolve to the same physical object share the same key. Useful for orchestration concerns like deduplicating fan-in writes or scheduling maintenance at most once per object.

Identity priority:

Fully-qualified table name when catalog or database is set on the connection — this matches how Databricks Unity Catalog, Fabric Lakehouse, and AWS Glue address tables.
Storage path otherwise — covers unregistered Delta tables (local dev / tests).

Results are prefixed ("table:" / "path:") to prevent a path string from colliding with a qualified name, and lowercased for case-insensitive equivalence.

Raises:

Type	Description
`ConfigurationError`	When the destination has neither a catalog/database registration nor a storage path.

write_options `property` ¶

write_options: Dict[str, Any]

Merged write options: connection defaults + destination overrides.

merge_keys_extended `property` ¶

merge_keys_extended: List[str]

Return merge keys extended with partition columns.

scd2_effective_column `property` ¶

scd2_effective_column: Optional[str]

SQL expression used as __valid_from for SCD2 loads.

Read from destination.configure["scd2_effective_column"]. Returns None when not set (non-SCD2 destinations).

Transform¶

Transform `dataclass` ¶

Transform()

Transformation rules applied between source read and destination write.

convert_timestamp_ntz `property` ¶

convert_timestamp_ntz: bool

Whether to convert timestamp_ntz columns to timestamp.

Reads convert_timestamp_ntz from :attr:configure. Defaults to True.

Example (YAML / metadata)::

transform:
  configure:
    convert_timestamp_ntz: false

deduplicate_by_rank `property` ¶

deduplicate_by_rank: bool

Whether to use RANK-based deduplication instead of ROW_NUMBER.

Reads deduplicate_by_rank from :attr:configure. Defaults to False.

Example (YAML / metadata)::

transform:
  configure:
    deduplicate_by_rank: true

deduplicate_column_names ¶

deduplicate_column_names(merge_keys: List[str] | None = None) -> List[str]

Return dedup columns, falling back to merge_keys.

DataFlow¶

DataFlow `dataclass` ¶

DataFlow()

Complete ETL pipeline configuration.

Composes :class:Source, :class:Destination, and :class:Transform.

order_columns `property` ¶

order_columns: List[str]

Columns used to order rows during deduplication.

Returns transform.latest_data_columns when set, otherwise falls back to source.watermark_columns.

Supporting models¶

SchemaHint `dataclass` ¶

SchemaHint()

Column-level type hint for schema conversion.

PartitionColumn `dataclass` ¶

PartitionColumn()

Partition column definition.

expression is an optional SQL expression used to derive the partition value (e.g. "year(event_date)").

AdditionalColumn `dataclass` ¶

AdditionalColumn()

Computed column added during the transform phase.

Enums¶

LoadType ¶

Supported load (write) strategies.

Format ¶

Supported data formats.

ConnectionType ¶

Connection endpoint categories.

ProcessingMode ¶

ETL processing modes.

DataFlowStatus ¶

Dataflow execution statuses.

Metadata schema¶

Top-level run configuration¶

DataCoolieRunConfig dataclass ¶

Connection¶

Connection dataclass ¶

base_path property ¶

database_type property ¶

url property ¶

driver property ¶

athena_output_location property ¶

generate_manifest property ¶

register_symlink_table property ¶

symlink_database_prefix property ¶

date_backward property ¶

refresh_from_configure ¶

Source¶

Source dataclass ¶

namespace property ¶

read_options property ¶

date_backward property ¶

Destination¶

Destination dataclass ¶

namespace property ¶

destination_key property ¶

write_options property ¶

merge_keys_extended property ¶

scd2_effective_column property ¶

Transform¶

Transform dataclass ¶

convert_timestamp_ntz property ¶

deduplicate_by_rank property ¶

deduplicate_column_names ¶

DataFlow¶

DataFlow dataclass ¶

order_columns property ¶

Supporting models¶

SchemaHint dataclass ¶

PartitionColumn dataclass ¶

AdditionalColumn dataclass ¶

Enums¶

LoadType ¶

Format ¶

ConnectionType ¶

ProcessingMode ¶

DataFlowStatus ¶

DataCoolieRunConfig `dataclass` ¶

Connection `dataclass` ¶

base_path `property` ¶

database_type `property` ¶

url `property` ¶

driver `property` ¶

athena_output_location `property` ¶

generate_manifest `property` ¶

register_symlink_table `property` ¶

symlink_database_prefix `property` ¶

date_backward `property` ¶

Source `dataclass` ¶

namespace `property` ¶

read_options `property` ¶

date_backward `property` ¶

Destination `dataclass` ¶

namespace `property` ¶

destination_key `property` ¶

write_options `property` ¶

merge_keys_extended `property` ¶

scd2_effective_column `property` ¶

Transform `dataclass` ¶

convert_timestamp_ntz `property` ¶

deduplicate_by_rank `property` ¶

DataFlow `dataclass` ¶

order_columns `property` ¶

SchemaHint `dataclass` ¶

PartitionColumn `dataclass` ¶

AdditionalColumn `dataclass` ¶