Metadata guide for new users¶
If you are a new Data Engineer or Data Analyst who just installed DataCoolie and is not sure how to configure your first pipeline, start here.
DataCoolie is driven entirely by metadata — a JSON (or YAML, or Excel) document that tells the framework what to read, how to transform it, where to write it, and when to re-run incrementally. You do not write Python for each pipeline; you fill in a structured document.
This guide walks through that document from zero, but it also covers the cases that usually appear right after the first successful run: incremental loads, query-based sources, API pagination, function sources, partitioning, merge strategies, secrets, and metadata-provider differences.
Sequence: read in this order¶
New user path
Work through the four pages in order on your first day. After that, the individual pages work as standalone references.
| Step | Page | You will learn |
|---|---|---|
| 1 | Build your first metadata file | Minimum valid document — two connections, one dataflow |
| 2 | Source patterns | How to configure any source type (file, database, API, Delta, Iceberg, function) |
| 3 | Destination & load patterns | Which load_type to pick and what extra fields it needs |
| 4 | Transform patterns | Cast types, deduplicate, add computed columns, partition output |
| 5 | Validation checklist | Catch mistakes before the first run |
Coverage map¶
| Area | Covered in this guide | Key cases |
|---|---|---|
| Metadata shape | Yes | connections[], dataflows[], optional orchestration fields, configure, secrets_ref |
| Metadata backends | Yes | JSON, YAML, Excel, database provider, API provider |
| Source types | Yes | File, Delta, Iceberg, database table/query, REST API, Python function |
| Destination types | Yes | File outputs, Delta, Iceberg, partitioned writes, lakehouse registration |
| Load strategies | Yes | append, overwrite, full_load, merge_upsert, merge_overwrite, scd2 |
| Transform features | Yes | schema_hints, deduplication, computed columns, partition expressions, SCD2/system columns |
| Validation & safety | Yes | dry_run, smoke tests, secret resolution, common errors |
Important edge cases
This guide covers the real behavior of the current framework, including:
connection_typecan be derived automatically fromformat- Excel is a supported source format, not a writable destination
- flat-file destinations support
append,overwrite, andfull_load, but not merge or SCD2 connection_type: "streaming"exists in the model but has no supported formats yet
Where metadata lives¶
DataCoolie supports three metadata backends. Choose one:
| Backend | Good for | Operational note | How-to |
|---|---|---|---|
| JSON / YAML / Excel file | Local dev, small teams, single-machine runs | JSON should stay canonical; YAML/Excel are alternative views or generated siblings | Configure file metadata |
| Relational database | Shared team configuration, multi-workspace governance | Rows are workspace-scoped via workspace_id |
Configure database metadata |
| REST API | Enterprise ops, Git-backed or approval-gated config | Endpoints are workspace-scoped under /workspaces/{workspace_id}/... |
Configure API metadata |
Recommendation for beginners
Start with a JSON file. The file backend requires no database and no
service — just create a .json file and point FileProvider at it.
You can migrate to the database or API backend later without changing a
single field in your metadata.
What metadata tells the framework¶
metadata.json
├── connections[] ← WHERE to read from and write to
│ ├── name / format / configure
│ ├── catalog / database / base_path
│ └── secrets_ref / is_active / workspace_id
└── dataflows[] ← HOW to move data
├── name / stage / description
├── group_number / execution_order / processing_mode / is_active
├── source ← which connection + table/query/function to read
├── destination ← which connection + table + load_type to write
└── transform ← schema hints, dedup, computed columns (optional)
Start with connections and dataflows. The rest is optional and you can
add it incrementally.
If you are unsure where a field belongs, use this rule:
Connection.configure= reusable endpoint defaultssource.configure/destination.configure= per-dataflow overridestransform.configure= transformer behavior flags
Quick example (30 seconds)¶
{
"connections": [
{
"name": "csv_input",
"connection_type": "file",
"format": "csv",
"configure": { "base_path": "data/input" }
},
{
"name": "bronze",
"format": "delta",
"configure": { "base_path": "data/output/bronze" }
}
],
"dataflows": [
{
"name": "orders_to_bronze",
"stage": "ingest",
"source": { "connection_name": "csv_input", "table": "orders" },
"destination": { "connection_name": "bronze", "schema_name": "sales", "table": "orders", "load_type": "append" }
}
]
}
This reads data/input/orders (a folder of CSV files) and appends to a Delta
table at data/output/bronze/sales/orders. In this example DataCoolie derives
connection_type: "lakehouse" from format: "delta".
→ Next: Build your first metadata file