Build your first metadata file¶

Prerequisites · DataCoolie installed (pip install datacoolie).
End state · A valid metadata.json you can pass to FileProvider and run.

This page uses JSON because it is the easiest format to learn and the recommended canonical source. The same model also works in YAML, Excel, database-backed metadata, and API-backed metadata. Only the storage backend changes; the connection, source, destination, and transform fields stay the same.

Step 1 — Create the file skeleton¶

Create a file called metadata.json in your project folder. The document always has two top-level arrays:

{
  "connections": [],
  "dataflows":   []
}

connections describes where data lives.
dataflows describes how to move it.

You can stop there for a minimum-valid file. As your project grows, the same top-level document can also carry optional fields such as workspace_id, description, group_number, execution_order, processing_mode, is_active, and per-block configure overrides.

Step 2 — Add a source connection¶

A connection has four required fields in the beginner case:

{
  "name":            "my_source",
  "connection_type": "file",
  "format":          "csv",
  "configure":       { "base_path": "data/input" }
}

Field	Required	What to put there
`name`	yes	A unique label you pick — used to reference this connection from a dataflow
`connection_type`	usually	One of `file`, `lakehouse`, `database`, `api`, `function`. If omitted, DataCoolie derives it from `format`
`format`	yes	The data format — must match the connection type (see table below)
`configure`	yes	A JSON object of backend-specific settings. `base_path` is the root folder for file connections
`workspace_id`	no	Used by database/API metadata providers and in structured logging
`catalog` / `database`	no	Use for registered lakehouse tables (Databricks, Fabric, Glue) instead of path-only addressing
`secrets_ref`	no	Tells DataCoolie which `configure` fields must be resolved from a secret source
`is_active`	no	Set `false` to disable a connection without deleting it

Valid connection_type → format pairs¶

`connection_type`	Allowed `format` values
`file`	`csv`, `parquet`, `json`, `jsonl`, `avro`, `excel`
`lakehouse`	`delta`, `iceberg`
`database`	`sql`
`api`	`api`
`function`	`function`

Mismatched type + format

DataCoolie validates connection_type against format at load time. For example, connection_type: "file" with format: "delta" raises an error. Use connection_type: "lakehouse" for Delta/Iceberg.

connection_type can be derived

This is also valid:

{
  "name": "bronze",
  "format": "delta",
  "configure": { "base_path": "data/output/bronze" }
}

Because format: "delta" belongs to the lakehouse family, DataCoolie derives connection_type: "lakehouse" automatically.

Streaming is reserved, not active

The model defines connection_type: "streaming", but no formats are wired to it yet. Treat DataCoolie metadata today as batch-first: batch, microbatch, and streaming may appear as processing_mode values on a dataflow, but the connection-side streaming type is not yet usable.

Step 3 — Add a destination connection¶

Add a second connection for where you want to write:

{
  "name":            "bronze",
  "connection_type": "lakehouse",
  "format":          "delta",
  "configure":       { "base_path": "data/output/bronze" }
}

Your connections array now has two entries:

"connections": [
  {
    "name": "my_source", "connection_type": "file", "format": "csv",
    "configure": { "base_path": "data/input" }
  },
  {
    "name": "bronze", "connection_type": "lakehouse", "format": "delta",
    "configure": { "base_path": "data/output/bronze" }
  }
]

Step 4 — Add a dataflow¶

A dataflow is one read → transform → write unit. It refers to your connections by name:

{
  "name":  "orders_to_bronze",
  "stage": "ingest",
  "source":      { "connection_name": "my_source", "schema_name": "sales", "table": "orders" },
  "destination": { "connection_name": "bronze",    "schema_name": "sales", "table": "orders", "load_type": "append" }
}

Key dataflow fields¶

Field	Required	What to put there
`name`	yes	Unique label for this dataflow
`stage`	yes	Free string — the filter you pass to `driver.run(stage="ingest")`. Group logically related dataflows under the same stage name
`source.connection_name`	yes	Must match a `name` in your `connections` array
`source.schema_name`	no	Subdirectory or schema; combined with `table` to build the path. Omit if not needed
`source.table`	yes	The folder name (file sources) or table name (database/lakehouse)
`destination.connection_name`	yes	Must match a `name` in your `connections` array
`destination.schema_name`	no	Output subdirectory or schema
`destination.table`	yes	Output folder or table name
`destination.load_type`	yes	How to write — `append`, `overwrite`, `merge_upsert`, `merge_overwrite`, or `scd2`. See Destination & load patterns

Common optional dataflow fields¶

These are the fields teams typically add after the first successful run:

Field	What it does
`description`	Free-text description of the pipeline's business purpose
`workspace_id`	Important for database/API metadata backends and workspace-scoped logging
`group_number`	Groups dataflows for orchestration; lower groups run first
`execution_order`	Orders dataflows within a group
`processing_mode`	`batch` by default; `microbatch` and `streaming` are model values for future/specialized flows
`is_active`	Set `false` to keep the metadata but skip execution
`configure`	Arbitrary per-dataflow settings for custom readers/writers/extensions

Source block: more than `connection_name + table`¶

The minimum source block uses just a connection name and a table. The full model supports several additional cases:

Source field	Use it when
`schema_name`	The source is under a folder / schema namespace
`watermark_columns`	You want incremental reads
`query`	The source is a SQL query instead of a table
`python_function`	The source is a custom Python loader, for example `mypkg.loaders.load_orders`
`configure.read_options`	You need to override connection-level read options for only one dataflow
`configure.endpoint`, `configure.params`, `configure.pagination_*`	The source is an API and the per-dataflow endpoint differs

Examples:

Database query source:

"source": {
  "connection_name": "warehouse",
  "query": "SELECT * FROM sales.orders WHERE status = 'OPEN'",
  "watermark_columns": ["updated_at"]
}

Python function source:

"source": {
  "connection_name": "python_src",
  "python_function": "mypkg.loaders.load_orders",
  "watermark_columns": ["updated_at"]
}

Destination block: extra fields appear as complexity grows¶

Destination field	Use it when
`merge_keys`	`merge_upsert`, `merge_overwrite`, or `scd2` needs a business key
`partition_columns`	You want partitioned writes
`configure.write_options`	You need one dataflow to override connection-level write options
`configure.scd2_effective_column`	Required for `scd2`

Example destination for a merge:

"destination": {
  "connection_name": "silver",
  "schema_name": "sales",
  "table": "orders",
  "load_type": "merge_upsert",
  "merge_keys": ["order_id"],
  "configure": {
    "write_options": { "mergeSchema": true }
  }
}

How `base_path + schema_name + table` become a path¶

For file and lakehouse connections:

{base_path}/{schema_name}/{table}

Example:  data/output/bronze / sales / orders
Result:   data/output/bronze/sales/orders

If you omit schema_name, the path is {base_path}/{table}.

For catalog-registered tables, DataCoolie uses a qualified name instead of a path. Examples:

Databricks Unity Catalog: `catalog`.`database`.`table` when you leave schema_name empty
Fabric / Glue / other metastore-backed layouts: `catalog`.`database`.`schema_name`.`table` when all levels are used

Step 5 — Complete file and run¶

Here is the complete minimal document:

{
  "connections": [
    {
      "name": "my_source",
      "connection_type": "file",
      "format": "csv",
      "configure": { "base_path": "data/input" }
    },
    {
      "name": "bronze",
      "connection_type": "lakehouse",
      "format": "delta",
      "configure": { "base_path": "data/output/bronze" }
    }
  ],
  "dataflows": [
    {
      "name":  "orders_to_bronze",
      "stage": "ingest",
      "source":      { "connection_name": "my_source", "schema_name": "sales", "table": "orders" },
      "destination": { "connection_name": "bronze",    "schema_name": "sales", "table": "orders", "load_type": "append" }
    }
  ]
}

Run it with:

from datacoolie.engines.polars_engine import PolarsEngine
from datacoolie.metadata.file_provider import FileProvider
from datacoolie.platforms.local_platform import LocalPlatform

platform  = LocalPlatform()
engine    = PolarsEngine(platform=platform)
metadata  = FileProvider(config_path="metadata.json", platform=platform)

from datacoolie.orchestration.driver import DataCoolieDriver

with DataCoolieDriver(engine=engine, metadata_provider=metadata) as driver:
    result = driver.run(stage="ingest")

print(result)
# ExecutionResult(total=1, succeeded=1, failed=0, skipped=0)

Adding more dataflows¶

You can have as many dataflows as you need. Add them to the dataflows array. They can share connections and run in the same stage or different stages:

"dataflows": [
  { "name": "orders_to_bronze",   "stage": "ingest", ... },
  { "name": "customers_to_bronze","stage": "ingest", ... },
  { "name": "orders_to_silver",   "stage": "transform", ... }
]

Run one stage at a time:

driver.run(stage="ingest")
driver.run(stage="transform")

Or run multiple at once:

driver.run(stage=["ingest", "transform"])

If stage order matters, use group_number and execution_order explicitly instead of relying on array position alone.

"dataflows": [
  {
    "name": "orders_to_bronze",
    "stage": "daily",
    "group_number": 1,
    "execution_order": 10,
    "source": { "connection_name": "raw", "table": "orders" },
    "destination": { "connection_name": "bronze", "table": "orders", "load_type": "append" }
  },
  {
    "name": "orders_to_silver",
    "stage": "daily",
    "group_number": 2,
    "execution_order": 10,
    "source": { "connection_name": "bronze", "table": "orders" },
    "destination": { "connection_name": "silver", "table": "orders", "load_type": "merge_upsert", "merge_keys": ["order_id"] }
  }
]

Same metadata model in other backends¶

Once the JSON version works, the same logical metadata can live in other providers:

Backend	What changes	What stays the same
YAML file	Syntax only	Same fields and nesting
Excel file	Nested objects become JSON cells or flattened columns	Same connection/dataflow semantics
Database metadata	Rows are stored in SQL tables and filtered by `workspace_id`	Same model fields
API metadata	Objects are served over `/workspaces/{workspace_id}/...` endpoints	Same model fields

The practical rule: learn the JSON structure first, then move it to the provider your team needs.

What to read next¶

Goal	Next page
Configure a database or API source	Source patterns
Use merge or SCD2 instead of append	Destination & load patterns
Cast types, deduplicate, or add columns	Transform patterns
Check your file is correct before running	Validation checklist