Skip to content

ADR-0004 — Metadata provider returns raw JSON watermark text

Status · Accepted

Context

Watermark payloads use engine-specific sentinels (__datetime__, __date__, __time__) that must round-trip through a deserialiser to become Python objects. Early API of BaseMetadataProvider.get_watermark returned a dict — the provider did the parsing.

Problems:

  • Three providers duplicated the same sentinel decoder.
  • Tests for watermark decoding proliferated across provider test files.
  • Custom providers had to know the sentinel format.

Decision

BaseMetadataProvider.get_watermark(dataflow_id: str) -> Optional[str] returns the raw JSON string (or None if none is persisted). WatermarkManager owns the decoding — it's the single place that knows the sentinel format.

Writing is symmetric: update_watermark(dataflow_id, watermark_value, *, job_id, dataflow_run_id) accepts a serialised string.

Consequences

  • One decoder, one encoder, one place to add new sentinels.
  • Custom metadata providers are simpler: just store/retrieve a string.
  • Existing databases store watermarks as TEXT / JSON-typed columns either way — no migration needed.