DataCoolie vs Airflow / Prefect — ETL Framework vs Orchestrator¶

DataCoolie and Airflow (or Prefect) operate at different levels of the data stack. Airflow and Prefect are workflow orchestrators — they schedule tasks, manage dependencies, and handle retries across an entire pipeline graph. DataCoolie is an ETL execution framework — it handles the read → transform → write → watermark lifecycle inside each individual task.

This post explains the difference, when to use each, and how they work together.

What Airflow and Prefect Do¶

Apache Airflow and Prefect are workflow orchestration platforms. You define tasks as a directed acyclic graph (DAG), and the orchestrator:

Schedules runs on a cron or event trigger
Orders tasks based on declared dependencies
Retries failed tasks with configurable backoff
Monitors run history, SLAs, and alerting

Orchestrators do not care what each task does. A task might run a SQL query, call an API, train a model, or execute an ETL pipeline. The orchestrator manages when and in what order tasks run.

What DataCoolie Does¶

DataCoolie handles what happens inside a single ETL task. Given a dataflow defined in metadata, it:

Reads from a source (connection)
Applies transforms — schema hints, deduplication, computed columns, filters
Writes to a destination using a load strategy — append, merge, SCD2
Updates the watermark so the next run picks up only new data

DataCoolie does not schedule jobs, manage cross-task dependencies, or send alerts. It is the engine that runs inside a scheduled task.

Key Differences¶

Aspect	Airflow / Prefect	DataCoolie
Scope	Workflow orchestration (DAG scheduling)	ETL execution (read → transform → write)
Scheduling	Built-in cron, sensors, event triggers	None — runs when called
Task dependencies	DAG-based ordering across tasks	Single-task execution
Data processing	Delegates to external tools	Native DataFrame processing (Polars / Spark)
Retry model	Task-level retry with backoff	Idempotent re-execution from watermark
Load strategies	Not applicable	append, full_load, merge_upsert, merge_overwrite, scd2
Multi-engine	Not applicable	Same metadata on Polars and Spark
Alerting	Built-in	External (relies on orchestrator or logging)

Using Them Together¶

DataCoolie is designed to run inside an orchestrator. The most common pattern:

Airflow (or Prefect) schedules the DAG and manages dependencies
DataCoolie executes each ETL stage as a task within the DAG
DataCoolie's watermarks handle incremental state; Airflow handles scheduling and alerting

Example: DataCoolie Inside an Airflow DAG¶

# dags/datacoolie_pipeline.py  (pseudocode — adapt to your environment)
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime


def run_datacoolie_stage(stage: str) -> None:
    """Execute a DataCoolie stage."""
    from datacoolie.engines.polars_engine import PolarsEngine
    from datacoolie.platforms.local_platform import LocalPlatform
    from datacoolie.metadata.file_provider import FileProvider
    from datacoolie.orchestration.driver import DataCoolieDriver

    platform = LocalPlatform()
    engine = PolarsEngine(platform=platform)
    provider = FileProvider(
        config_path="/opt/pipelines/metadata.json",
        platform=platform,
    )

    with DataCoolieDriver(engine=engine, metadata_provider=provider) as driver:
        result = driver.run(stage=stage)
        if result.failed > 0:
            raise RuntimeError(
                f"Stage {stage}: {result.failed}/{result.total} dataflows failed"
            )


with DAG(
    "datacoolie_pipeline",
    schedule="0 6 * * *",  # daily at 06:00
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:
    bronze = PythonOperator(
        task_id="bronze_to_silver",
        python_callable=run_datacoolie_stage,
        op_args=["bronze2silver"],
    )

    silver = PythonOperator(
        task_id="silver_to_gold",
        python_callable=run_datacoolie_stage,
        op_args=["silver2gold"],
    )

    bronze >> silver  # silver runs after bronze completes

This DAG runs two DataCoolie stages in sequence. Airflow handles scheduling, retries, and alerting. DataCoolie handles the actual data processing — reading, transforming, writing, and tracking watermarks.

Pseudocode

The example above is simplified. In production, you would configure the platform and engine based on your deployment environment (Fabric, Databricks, AWS) and pass metadata paths via Airflow Variables or connections.

When to Use Airflow / Prefect Alone¶

An orchestrator without DataCoolie is fine when:

Your tasks are simple SQL queries or API calls that do not need DataFrame processing
You use dbt for transformations and just need scheduling
Each task is a self-contained script with no shared metadata model

When to Add DataCoolie¶

Add DataCoolie when:

You need multi-source ingestion (files, databases, APIs) with consistent load strategies
You want engine portability — same pipeline on Polars for dev, Spark for prod
You need merge/upsert or SCD2 load strategies with automatic watermark tracking
You want a declarative metadata model instead of imperative task code

Summary¶

Question	Answer
Do they compete?	No — different layers (orchestration vs execution)
Can I use both?	Yes — Airflow schedules, DataCoolie executes
Do I need Airflow?	Not required — DataCoolie runs standalone or inside any scheduler
Do I need DataCoolie?	Not required — Airflow works with any task code