Skip to content

Logging layout

DataCoolie produces two independent log streams.

Directory layout

<base_log_path>/
├── etl_logs/
│   ├── debug_json/
│   │   └── job_run_log/
│   │       └── __run_date=yyyy-mm-dd/job_<stem>.jsonl
│   └── analyst/
│       ├── job_run_log/
│       │   └── __run_date=yyyy-mm-dd/job_<stem>.parquet
│       └── dataflow_run_log/
│           └── __run_date=yyyy-mm-dd/dataflow_<stem>.parquet
└── system_logs/
    └── __run_date=yyyy-mm-dd/system_log_<ts>_<job_num>_<job_index>_<job_id>.jsonl

Two loggers, two purposes

ETLLogger SystemLogger
Written by Driver, Stage, DataFlow, Watermark manager Everywhere — platform, engines, sources, destinations, transformers
Format Debug JSONL plus analyst Parquet JSONL, one event per line
Purpose Execution analytics, dashboards, troubleshooting Operational debugging
Retention Long-term (feeds dashboards) Short-term (rotate aggressively)

Partitioning

ETL logs are partitioned by purpose and log type, then by run date:

etl_logs/analyst/dataflow_run_log/__run_date=2026-01-03/dataflow_<stem>.parquet

Query them directly with Spark / Polars / Athena.

Configuring

driver = DataCoolieDriver(
    engine=engine,
    metadata_provider=metadata,
    base_log_path="s3://my-bucket/logs",  # or local path
)

Use log_config=LogConfig(...) when you need to override partition pattern, flush interval, or temporary storage mode.

Debug mode

When ETL logging is enabled, debug JSONL is written under the debug_json purpose folder. LogPurpose.DEBUG.value == "debug_json".

Downstream use

  • Build a dashboard from etl_logs/analyst/dataflow_run_log/ and etl_logs/analyst/job_run_log/.
  • Alert on dataflow_run_log.status = "failed".
  • If you run negative tests on purpose, suppress them with your own scenario or job naming convention; the current runner does not automatically mark a failure as expected.