Operations¶

Practical guidance for running DataCoolie pipelines reliably in production environments, from understanding log output to diagnosing failures.

Use this section after you already have a pipeline running. If you still need a first successful local run, go back to Getting started.

Start with the question you have¶

I need to understand what DataCoolie wrote to disk: Logging layout
I need help choosing Polars vs Spark for production workloads: Benchmarks
A run is failing and I need likely causes: Troubleshooting
I am changing or adding framework behavior and need test guidance: Testing strategy

Logging layout — How the ETL logger writes debug JSONL and analyst Parquet files, what the LogPurpose values mean, and how output is partitioned under <output_path>/<purpose>/<log_type>/.
Benchmarks — Polars vs Spark throughput and latency numbers from the reference usecase-sim testbed. Helps you choose the right engine for your row-count and latency targets.
Troubleshooting — Common failure patterns and how to diagnose them: watermark staleness, metadata provider errors, merge key mismatches, platform credential issues, and partition path conflicts.
Testing strategy — How the DataCoolie test suite is structured, coverage gates, mock engine patterns, and how to add tests for custom plugins.

Before running a pipeline in a new environment:

Verify the platform (LocalPlatform, AWSPlatform, etc.) can reach its file paths and resolve secrets.
Check that the metadata provider is reachable and returns at least one active dataflow.
Confirm the engine has the required extras installed (polars, spark, etc.).
Review the logging output path and ensure the destination directory is writable with the expected partition structure.