Metadata schema¶
DataCoolie's metadata contract lives in datacoolie.core.models as
CompatModel-backed dataclasses. This page is generated at docs-build time
from those models — field descriptions, defaults, and validation rules come
straight from source. Treat the models as the source of truth; this page as the
rendered view.
Top-level run configuration¶
DataCoolieRunConfig
dataclass
¶
Validated execution parameters for a DataCoolie run.
Connection¶
Connection
dataclass
¶
Endpoint configuration for a data source or destination.
The configure JSON field stores type-specific settings (host, port,
read_options, write_options, etc.). Frequently-used values are
surfaced as computed properties.
database_type
property
¶
Database type (mysql, mssql, postgresql, oracle, sqlite).
athena_output_location
property
¶
S3 path for Athena DDL query results.
When set, the writer always registers a native Delta table via
Athena DDL (DROP + CREATE EXTERNAL TABLE ... TBLPROPERTIES
('table_type'='DELTA')) after every write and maintenance.
generate_manifest
property
¶
Generate _symlink_format_manifest/ after writes and maintenance.
register_symlink_table
property
¶
Register a SymlinkTextInputFormat table in Glue after writes.
Implies :attr:generate_manifest.
symlink_database_prefix
property
¶
Prefix for symlink Glue database name. Default "symlink_".
date_backward
property
¶
Backward look-back offset for date-folder partition discovery.
Reads backward_days, backward_months, backward_hours as
top-level keys from config, or a nested backward dict.
Strategies:
Fixed offset — subtract days / months / hours from watermark::
config:
backward_days: 7
# or
backward: {days: 7, months: 1}
Closing-day — monthly period boundary based on current date::
config:
backward: {closing_day: 10}
refresh_from_configure
¶
Unconditionally sync database and catalog from configure.
Unlike the model validator (which only sets empty fields at
construction time), this always overwrites — call after secret
resolution when configure values have been resolved from vault
keys to real values.
Source¶
Source
dataclass
¶
Read-side pipeline configuration.
read_options
property
¶
Merged read options: connection defaults + source overrides.
date_backward
property
¶
Backward look-back offset, source-level overrides connection-level.
Reads from configure (same keys as
:attr:Connection.date_backward). If no source-level config
is present, falls back to the connection's value.
Example (YAML / source configure)::
configure:
backward_days: 7 # overrides connection setting
# or
backward: {months: 1}
# or closing-day strategy
backward: {closing_day: 10}
Destination¶
Destination
dataclass
¶
Write-side pipeline configuration.
destination_key
property
¶
Stable identity for this destination as a physical object.
Two destinations that resolve to the same physical object share the same key. Useful for orchestration concerns like deduplicating fan-in writes or scheduling maintenance at most once per object.
Identity priority:
- Fully-qualified table name when
catalogordatabaseis set on the connection — this matches how Databricks Unity Catalog, Fabric Lakehouse, and AWS Glue address tables. - Storage path otherwise — covers unregistered Delta tables (local dev / tests).
Results are prefixed ("table:" / "path:") to prevent a
path string from colliding with a qualified name, and lowercased
for case-insensitive equivalence.
Raises:
| Type | Description |
|---|---|
ConfigurationError
|
When the destination has neither a catalog/database registration nor a storage path. |
write_options
property
¶
Merged write options: connection defaults + destination overrides.
merge_keys_extended
property
¶
Return merge keys extended with partition columns.
scd2_effective_column
property
¶
SQL expression used as __valid_from for SCD2 loads.
Read from destination.configure["scd2_effective_column"].
Returns None when not set (non-SCD2 destinations).
Transform¶
Transform
dataclass
¶
Transformation rules applied between source read and destination write.
convert_timestamp_ntz
property
¶
Whether to convert timestamp_ntz columns to timestamp.
Reads convert_timestamp_ntz from :attr:configure.
Defaults to True.
Example (YAML / metadata)::
transform:
configure:
convert_timestamp_ntz: false
deduplicate_by_rank
property
¶
Whether to use RANK-based deduplication instead of ROW_NUMBER.
Reads deduplicate_by_rank from :attr:configure.
Defaults to False.
Example (YAML / metadata)::
transform:
configure:
deduplicate_by_rank: true
deduplicate_column_names
¶
Return dedup columns, falling back to merge_keys.
DataFlow¶
DataFlow
dataclass
¶
Complete ETL pipeline configuration.
Composes :class:Source, :class:Destination, and :class:Transform.
order_columns
property
¶
Columns used to order rows during deduplication.
Returns transform.latest_data_columns when set, otherwise
falls back to source.watermark_columns.
Supporting models¶
PartitionColumn
dataclass
¶
Partition column definition.
expression is an optional SQL expression used to derive the partition
value (e.g. "year(event_date)").
Enums¶
LoadType
¶
Supported load (write) strategies.
Format
¶
Supported data formats.
ConnectionType
¶
Connection endpoint categories.
ProcessingMode
¶
ETL processing modes.
DataFlowStatus
¶
Dataflow execution statuses.