Operations¶
Practical guidance for running DataCoolie pipelines reliably in production environments, from understanding log output to diagnosing failures.
Use this section after you already have a pipeline running. If you still need a first successful local run, go back to Getting started.
Start with the question you have¶
- I need to understand what DataCoolie wrote to disk: Logging layout
- I need help choosing Polars vs Spark for production workloads: Benchmarks
- A run is failing and I need likely causes: Troubleshooting
- I am changing or adding framework behavior and need test guidance: Testing strategy
What's in this section¶
- Logging layout — How the ETL logger writes debug JSONL
and analyst Parquet files, what the
LogPurposevalues mean, and how output is partitioned under<output_path>/<purpose>/<log_type>/. - Benchmarks — Polars vs Spark throughput and latency numbers
from the reference
usecase-simtestbed. Helps you choose the right engine for your row-count and latency targets. - Troubleshooting — Common failure patterns and how to diagnose them: watermark staleness, metadata provider errors, merge key mismatches, platform credential issues, and partition path conflicts.
- Testing strategy — How the DataCoolie test suite is structured, coverage gates, mock engine patterns, and how to add tests for custom plugins.
Quick checklist¶
Before running a pipeline in a new environment:
- Verify the platform (
LocalPlatform,AWSPlatform, etc.) can reach its file paths and resolve secrets. - Check that the metadata provider is reachable and returns at least one active dataflow.
- Confirm the engine has the required extras installed (
polars,spark, etc.). - Review the logging output path and ensure the destination directory is writable with the expected partition structure.