DataCoolie vs Airflow / Prefect — ETL Framework vs Orchestrator¶
DataCoolie and Airflow (or Prefect) operate at different levels of the data stack. Airflow and Prefect are workflow orchestrators — they schedule tasks, manage dependencies, and handle retries across an entire pipeline graph. DataCoolie is an ETL execution framework — it handles the read → transform → write → watermark lifecycle inside each individual task.
This post explains the difference, when to use each, and how they work together.
What Airflow and Prefect Do¶
Apache Airflow and Prefect are workflow orchestration platforms. You define tasks as a directed acyclic graph (DAG), and the orchestrator:
- Schedules runs on a cron or event trigger
- Orders tasks based on declared dependencies
- Retries failed tasks with configurable backoff
- Monitors run history, SLAs, and alerting
Orchestrators do not care what each task does. A task might run a SQL query, call an API, train a model, or execute an ETL pipeline. The orchestrator manages when and in what order tasks run.
What DataCoolie Does¶
DataCoolie handles what happens inside a single ETL task. Given a dataflow defined in metadata, it:
- Reads from a source (connection)
- Applies transforms — schema hints, deduplication, computed columns, filters
- Writes to a destination using a load strategy — append, merge, SCD2
- Updates the watermark so the next run picks up only new data
DataCoolie does not schedule jobs, manage cross-task dependencies, or send alerts. It is the engine that runs inside a scheduled task.
Key Differences¶
| Aspect | Airflow / Prefect | DataCoolie |
|---|---|---|
| Scope | Workflow orchestration (DAG scheduling) | ETL execution (read → transform → write) |
| Scheduling | Built-in cron, sensors, event triggers | None — runs when called |
| Task dependencies | DAG-based ordering across tasks | Single-task execution |
| Data processing | Delegates to external tools | Native DataFrame processing (Polars / Spark) |
| Retry model | Task-level retry with backoff | Idempotent re-execution from watermark |
| Load strategies | Not applicable | append, full_load, merge_upsert, merge_overwrite, scd2 |
| Multi-engine | Not applicable | Same metadata on Polars and Spark |
| Alerting | Built-in | External (relies on orchestrator or logging) |
Using Them Together¶
DataCoolie is designed to run inside an orchestrator. The most common pattern:
- Airflow (or Prefect) schedules the DAG and manages dependencies
- DataCoolie executes each ETL stage as a task within the DAG
- DataCoolie's watermarks handle incremental state; Airflow handles scheduling and alerting
Example: DataCoolie Inside an Airflow DAG¶
# dags/datacoolie_pipeline.py (pseudocode — adapt to your environment)
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def run_datacoolie_stage(stage: str) -> None:
"""Execute a DataCoolie stage."""
from datacoolie.engines.polars_engine import PolarsEngine
from datacoolie.platforms.local_platform import LocalPlatform
from datacoolie.metadata.file_provider import FileProvider
from datacoolie.orchestration.driver import DataCoolieDriver
platform = LocalPlatform()
engine = PolarsEngine(platform=platform)
provider = FileProvider(
config_path="/opt/pipelines/metadata.json",
platform=platform,
)
with DataCoolieDriver(engine=engine, metadata_provider=provider) as driver:
result = driver.run(stage=stage)
if result.failed > 0:
raise RuntimeError(
f"Stage {stage}: {result.failed}/{result.total} dataflows failed"
)
with DAG(
"datacoolie_pipeline",
schedule="0 6 * * *", # daily at 06:00
start_date=datetime(2026, 1, 1),
catchup=False,
) as dag:
bronze = PythonOperator(
task_id="bronze_to_silver",
python_callable=run_datacoolie_stage,
op_args=["bronze2silver"],
)
silver = PythonOperator(
task_id="silver_to_gold",
python_callable=run_datacoolie_stage,
op_args=["silver2gold"],
)
bronze >> silver # silver runs after bronze completes
This DAG runs two DataCoolie stages in sequence. Airflow handles scheduling, retries, and alerting. DataCoolie handles the actual data processing — reading, transforming, writing, and tracking watermarks.
Pseudocode
The example above is simplified. In production, you would configure the platform and engine based on your deployment environment (Fabric, Databricks, AWS) and pass metadata paths via Airflow Variables or connections.
When to Use Airflow / Prefect Alone¶
An orchestrator without DataCoolie is fine when:
- Your tasks are simple SQL queries or API calls that do not need DataFrame processing
- You use dbt for transformations and just need scheduling
- Each task is a self-contained script with no shared metadata model
When to Add DataCoolie¶
Add DataCoolie when:
- You need multi-source ingestion (files, databases, APIs) with consistent load strategies
- You want engine portability — same pipeline on Polars for dev, Spark for prod
- You need merge/upsert or SCD2 load strategies with automatic watermark tracking
- You want a declarative metadata model instead of imperative task code
Summary¶
| Question | Answer |
|---|---|
| Do they compete? | No — different layers (orchestration vs execution) |
| Can I use both? | Yes — Airflow schedules, DataCoolie executes |
| Do I need Airflow? | Not required — DataCoolie runs standalone or inside any scheduler |
| Do I need DataCoolie? | Not required — Airflow works with any task code |