Logging layout¶
DataCoolie produces two independent log streams.
Directory layout¶
<base_log_path>/
├── etl_logs/
│ ├── debug_json/
│ │ └── job_run_log/
│ │ └── __run_date=yyyy-mm-dd/job_<stem>.jsonl ← appended
│ └── analyst/
│ ├── job_run_log/
│ │ └── __run_date=yyyy-mm-dd/job_run_log.jsonl ← shared daily file, appended
│ └── dataflow_run_log/
│ └── __run_date=yyyy-mm-dd/dataflow_<stem>.parquet ← per-run file
└── system_logs/
└── __run_date=yyyy-mm-dd/system_log_<job_id>.log ← plain text, appended
Two loggers, two purposes¶
ETLLogger |
SystemLogger |
|
|---|---|---|
| Written by | Driver, Stage, DataFlow, Watermark manager | Everywhere — platform, engines, sources, destinations, transformers |
| Format | Debug JSONL + analyst JSONL/Parquet | Plain text .log, one line per record |
| Purpose | Execution analytics, dashboards, troubleshooting | Operational debugging |
| Retention | Long-term (feeds dashboards) | Short-term (rotate aggressively) |
| Flush | Periodic append_file + final on close |
Periodic append_file (timer) + final on close |
SystemLogger levels¶
SystemLogger supports two independent log levels:
log_level(defaultINFO) — what is printed to the console. Set by the Driver configuration.file_level(defaultDEBUG) — what is captured to the.logfile. Captures all framework messages regardless of the console level, acting as a "black box recorder" for post-mortem diagnosis.
Analyst outputs¶
| Log type | Format | File per … | Query |
|---|---|---|---|
job_run_log |
JSONL | Day (shared, appended) | Read one file per day for job history |
dataflow_run_log |
Parquet (Snappy) | Job run | Scan with Spark / Polars / Athena |
The job_run_log.jsonl is a shared daily file: every job run on the same
date appends its summary line to the same file. This makes it efficient to
query recent job history without listing many small per-run files. It is also
hive-partition compatible (__run_date=yyyy-mm-dd) so Spark / Polars can
discover the run_date column automatically.
Partitioning¶
ETL logs are partitioned by purpose and log type, then by run date:
etl_logs/analyst/dataflow_run_log/__run_date=2026-01-03/dataflow_<stem>.parquet
etl_logs/analyst/job_run_log/__run_date=2026-01-03/job_run_log.jsonl
Query them directly with Spark / Polars / Athena.
Configuring¶
driver = DataCoolieDriver(
engine=engine,
metadata_provider=metadata,
base_log_path="s3://my-bucket/logs", # or local path
)
Use log_config=LogConfig(...) when you need to override partition pattern,
flush interval, temporary storage mode, or the file_level for SystemLogger.
Debug mode¶
When ETL logging is enabled, debug JSONL is written under the debug_json
purpose folder. LogPurpose.DEBUG.value == "debug_json".
Downstream use¶
- Build a dashboard from
etl_logs/analyst/dataflow_run_log/andetl_logs/analyst/job_run_log/. - Alert on
dataflow_run_log.status = "failed". - If you run negative tests on purpose, suppress them with your own scenario or job naming convention; the current runner does not automatically mark a failure as expected.