Metadata providers¶

TL;DR Pick FileProvider for small fixed projects, DatabaseProvider for shared team metadata, APIClient when you already run a metadata service. All three implement the same BaseMetadataProvider contract, so you can swap them without changing pipeline code.

The contract¶

BaseMetadataProvider exposes:

get_connections() / get_connection_by_name(name)
get_dataflows(stage=..., active_only=True, attach_schema_hints=True)
get_watermark(dataflow_id: str) -> Optional[str] — raw JSON, not parsed
update_watermark(dataflow_id, watermark_value, *, job_id, dataflow_run_id)

The raw-JSON return of get_watermark is intentional — WatermarkManager does the deserialisation so providers don't need to depend on datetime handling. See Watermarks and ADR-0004.

Built-ins¶

Provider	Backend	Install	Good for
`FileProvider`	JSON · YAML · Excel	core + `[excel]` for .xlsx	Small projects, SCM-versioned metadata, demos
`DatabaseProvider`	Any SQLAlchemy dialect	`[db]`	Multi-team, mutable metadata, centralised governance
`APIClient`	REST	`[api]`	Existing metadata service, RBAC on metadata

File provider¶

Canonical source is JSON; YAML and Excel are generated equivalents.
One file per use case is the convention (orders_csv_to_parquet.{json,yaml,xlsx}).
Blank is_active in Excel means unset (not False). Generators preserve this nuance.
Watermarks default to {config_dir}/watermarks/{stage}_{name}_{dataflow_id}/watermark.json. If stage or name is missing, the folder falls back to {dataflow_id}.

Database provider¶

SQLAlchemy tables: dc_framework_connections, dc_framework_dataflows, dc_framework_watermarks, dc_framework_schema_hints.
All tables are workspace-scoped (workspace_id column) and honour soft-delete (deleted_at IS NULL).
Concurrency-safe writes: DatabaseProvider opens one short-lived connection per operation and does not hold a session across run() boundaries.

API provider¶

Client calls a REST service whose OpenAPI contract is published by usecase-sim/docker/pg_api_metadata_server.py as a reference implementation.
All endpoints are scoped under /workspaces/{workspace_id}/.
Read-through cache can be enabled via enable_cache=True (the default) to avoid hammering the service during parallel execution.

Picking a provider¶

flowchart TD
    A[Team size?] -->|solo / small| B[FileProvider]
    A -->|multi-team| C[Need RBAC on metadata?]
    C -->|yes| D[APIClient]
    C -->|no| E[Metadata mutability?]
    E -->|mostly read-only| B
    E -->|frequent updates| F[DatabaseProvider]