Skip to content

Logging layout

DataCoolie produces two independent log streams.

Directory layout

<base_log_path>/
├── etl_logs/
│   ├── debug_json/
│   │   └── job_run_log/
│   │       └── __run_date=yyyy-mm-dd/job_<stem>.jsonl          ← appended
│   └── analyst/
│       ├── job_run_log/
│       │   └── __run_date=yyyy-mm-dd/job_run_log.jsonl         ← shared daily file, appended
│       └── dataflow_run_log/
│           └── __run_date=yyyy-mm-dd/dataflow_<stem>.parquet   ← per-run file
└── system_logs/
    └── __run_date=yyyy-mm-dd/system_log_<job_id>.log          ← plain text, appended

Two loggers, two purposes

ETLLogger SystemLogger
Written by Driver, Stage, DataFlow, Watermark manager Everywhere — platform, engines, sources, destinations, transformers
Format Debug JSONL + analyst JSONL/Parquet Plain text .log, one line per record
Purpose Execution analytics, dashboards, troubleshooting Operational debugging
Retention Long-term (feeds dashboards) Short-term (rotate aggressively)
Flush Periodic append_file + final on close Periodic append_file (timer) + final on close

SystemLogger levels

SystemLogger supports two independent log levels:

  • log_level (default INFO) — what is printed to the console. Set by the Driver configuration.
  • file_level (default DEBUG) — what is captured to the .log file. Captures all framework messages regardless of the console level, acting as a "black box recorder" for post-mortem diagnosis.

Analyst outputs

Log type Format File per … Query
job_run_log JSONL Day (shared, appended) Read one file per day for job history
dataflow_run_log Parquet (Snappy) Job run Scan with Spark / Polars / Athena

The job_run_log.jsonl is a shared daily file: every job run on the same date appends its summary line to the same file. This makes it efficient to query recent job history without listing many small per-run files. It is also hive-partition compatible (__run_date=yyyy-mm-dd) so Spark / Polars can discover the run_date column automatically.

Partitioning

ETL logs are partitioned by purpose and log type, then by run date:

etl_logs/analyst/dataflow_run_log/__run_date=2026-01-03/dataflow_<stem>.parquet
etl_logs/analyst/job_run_log/__run_date=2026-01-03/job_run_log.jsonl

Query them directly with Spark / Polars / Athena.

Configuring

driver = DataCoolieDriver(
    engine=engine,
    metadata_provider=metadata,
    base_log_path="s3://my-bucket/logs",  # or local path
)

Use log_config=LogConfig(...) when you need to override partition pattern, flush interval, temporary storage mode, or the file_level for SystemLogger.

Debug mode

When ETL logging is enabled, debug JSONL is written under the debug_json purpose folder. LogPurpose.DEBUG.value == "debug_json".

Downstream use

  • Build a dashboard from etl_logs/analyst/dataflow_run_log/ and etl_logs/analyst/job_run_log/.
  • Alert on dataflow_run_log.status = "failed".
  • If you run negative tests on purpose, suppress them with your own scenario or job naming convention; the current runner does not automatically mark a failure as expected.