Skip to content

Configure file metadata

Prerequisites · datacoolie[excel] or openpyxl if you want .xlsx generation · pyyaml if you want .yaml generation · a directory you control for metadata files. End state · Working FileProvider reading JSON (canonical) with optional generated YAML/Excel siblings.

The canonical format is JSON

Keep your source of truth as JSON. One file per use case:

metadata/
└── file/
    ├── orders_csv_to_parquet_full_load.json    ← canonical
    ├── orders_csv_to_parquet_full_load.yaml    ← generated
    └── orders_csv_to_parquet_full_load.xlsx    ← generated

See the usecase-sim file metadata folder for production-shape examples.

Minimal JSON

{
  "connections": [
    {
      "name": "src",
      "connection_type": "file",
      "format": "csv",
      "configure": {"base_path": "data/input"}
    },
    {
      "name": "bronze",
      "connection_type": "lakehouse",
      "format": "delta",
      "configure": {"base_path": "data/output/bronze"}
    }
  ],
  "dataflows": [
    {
      "name": "orders_to_bronze",
      "stage": "ingest2bronze",
      "source":      { "connection_name": "src", "schema_name": "sales", "table": "orders" },
      "destination": { "connection_name": "bronze", "schema_name": "sales", "table": "orders", "load_type": "append" }
    }
  ]
}

Minimal YAML

connections:
  - name: src
    connection_type: file
    format: csv
    configure:
      base_path: data/input

  - name: bronze
    connection_type: lakehouse
    format: delta
    configure:
      base_path: data/output/bronze

dataflows:
  - name: orders_to_bronze
    stage: ingest2bronze
    source:
      connection_name: src
      schema_name: sales
      table: orders
    destination:
      connection_name: bronze
      schema_name: sales
      table: orders
      load_type: append

Minimal Excel

Use a workbook with connections and dataflows sheets. schema_hints is optional for the minimal case.

connections sheet:

name connection_type format configure
src file csv { "base_path": "data/input" }
bronze lakehouse delta { "base_path": "data/output/bronze" }

dataflows sheet:

name stage source_connection_name source_schema_name source_table destination_connection_name destination_schema_name destination_table destination_load_type
orders_to_bronze ingest2bronze src sales orders bronze sales orders append

For a short workbook, keep nested JSON in the configure cell. The parser also accepts configure_* and transform_* columns when you need flatter editing.

Loading

from datacoolie.metadata.file_provider import FileProvider
from datacoolie.platforms.local_platform import LocalPlatform

platform = LocalPlatform()
provider = FileProvider(config_path="metadata/file/orders_csv_to_parquet_full_load.json", platform=platform)

FileProvider detects format by extension (.json, .yaml, .xlsx). Pass a single metadata file — one config_path per FileProvider instance.

Generating YAML + Excel from JSON

Use the usecase-sim setup script and target a single canonical JSON file:

python usecase-sim/scripts/setup_metadata.py --json usecase-sim/metadata/file/local_use_cases.json --targets file

It reads the JSON file passed via --json and emits sibling .yaml and .xlsx files with the same stem next to it. Regenerate after each JSON edit.

If pyyaml is not installed, YAML output is skipped with a warning. If openpyxl is not installed, XLSX output is skipped with a warning.

Gotchas

Symptom Cause Fix
All rows load as inactive Excel is_active treated as False when blank Leave is_active blank = unset (falls back to True); generator preserves this.
.yaml or .xlsx sibling was not created Required emitter dependency is missing Install pyyaml for YAML and openpyxl or datacoolie[excel] for XLSX, then rerun setup_metadata.py --targets file.
YAML or XLSX no longer matches JSON Generated siblings are not auto-synced after JSON edits Treat JSON as canonical and rerun setup_metadata.py --targets file after each JSON change.
Excel parse error in nested fields A JSON cell such as configure, secrets_ref, source_configure, destination_configure, or transform contains invalid JSON Fix the cell to valid JSON. configure_* and transform_* columns are supported, but any JSON cell must still be valid JSON.