Metadata guide for new users¶

If you are a new Data Engineer or Data Analyst who just installed DataCoolie and is not sure how to configure your first pipeline, start here.

DataCoolie is driven entirely by metadata — a JSON (or YAML, or Excel) document that tells the framework what to read, how to transform it, where to write it, and when to re-run incrementally. You do not write Python for each pipeline; you fill in a structured document.

This guide walks through that document from zero, but it also covers the cases that usually appear right after the first successful run: incremental loads, query-based sources, API pagination, function sources, partitioning, merge strategies, secrets, and metadata-provider differences.

Sequence: read in this order¶

New user path

Work through the four pages in order on your first day. After that, the individual pages work as standalone references.

Step	Page	You will learn
1	Build your first metadata file	Minimum valid document — two connections, one dataflow
2	Source patterns	How to configure any source type (file, database, API, Delta, Iceberg, function)
3	Destination & load patterns	Which `load_type` to pick and what extra fields it needs
4	Transform patterns	Cast types, deduplicate, add computed columns, partition output
5	Validation checklist	Catch mistakes before the first run

Coverage map¶

Area	Covered in this guide	Key cases
Metadata shape	Yes	`connections[]`, `dataflows[]`, optional orchestration fields, `configure`, `secrets_ref`
Metadata backends	Yes	JSON, YAML, Excel, database provider, API provider
Source types	Yes	File, Delta, Iceberg, database table/query, REST API, Python function
Destination types	Yes	File outputs, Delta, Iceberg, partitioned writes, lakehouse registration
Load strategies	Yes	`append`, `overwrite`, `full_load`, `merge_upsert`, `merge_overwrite`, `scd2`
Transform features	Yes	`schema_hints`, deduplication, computed columns, partition expressions, SCD2/system columns
Validation & safety	Yes	`dry_run`, smoke tests, secret resolution, common errors

Important edge cases

This guide covers the real behavior of the current framework, including:

connection_type can be derived automatically from format
Excel is a supported source format, not a writable destination
flat-file destinations support append, overwrite, and full_load, but not merge or SCD2
connection_type: "streaming" exists in the model but has no supported formats yet

Where metadata lives¶

DataCoolie supports three metadata backends. Choose one:

Backend	Good for	Operational note	How-to
JSON / YAML / Excel file	Local dev, small teams, single-machine runs	JSON should stay canonical; YAML/Excel are alternative views or generated siblings	Configure file metadata
Relational database	Shared team configuration, multi-workspace governance	Rows are workspace-scoped via `workspace_id`	Configure database metadata
REST API	Enterprise ops, Git-backed or approval-gated config	Endpoints are workspace-scoped under `/workspaces/{workspace_id}/...`	Configure API metadata

Recommendation for beginners

Start with a JSON file. The file backend requires no database and no service — just create a .json file and point FileProvider at it. You can migrate to the database or API backend later without changing a single field in your metadata.

What metadata tells the framework¶

metadata.json
├── connections[]            ← WHERE to read from and write to
│   ├── name / format / configure
│   ├── catalog / database / base_path
│   └── secrets_ref / is_active / workspace_id
└── dataflows[]              ← HOW to move data
  ├── name / stage / description
  ├── group_number / execution_order / processing_mode / is_active
  ├── source               ← which connection + table/query/function to read
  ├── destination          ← which connection + table + load_type to write
  └── transform            ← schema hints, dedup, computed columns (optional)

Start with connections and dataflows. The rest is optional and you can add it incrementally.

If you are unsure where a field belongs, use this rule:

Connection.configure = reusable endpoint defaults
source.configure / destination.configure = per-dataflow overrides
transform.configure = transformer behavior flags

Quick example (30 seconds)¶

{
  "connections": [
    {
      "name": "csv_input",
      "connection_type": "file",
      "format": "csv",
      "configure": { "base_path": "data/input" }
    },
    {
      "name": "bronze",
      "format": "delta",
      "configure": { "base_path": "data/output/bronze" }
    }
  ],
  "dataflows": [
    {
      "name": "orders_to_bronze",
      "stage": "ingest",
      "source":      { "connection_name": "csv_input", "table": "orders" },
      "destination": { "connection_name": "bronze", "schema_name": "sales", "table": "orders", "load_type": "append" }
    }
  ]
}

This reads data/input/orders (a folder of CSV files) and appends to a Delta table at data/output/bronze/sales/orders. In this example DataCoolie derives connection_type: "lakehouse" from format: "delta".

→ Next: Build your first metadata file