Benchmarks¶
DataCoolie ships a reproducible benchmark harness under
usecase-sim/runner/run_perf_benchmark.py.
Running¶
python datacoolie/usecase-sim/runner/run_perf_benchmark.py --engine polars
python datacoolie/usecase-sim/runner/run_perf_benchmark.py --engine spark
Outputs JSON results to datacoolie/benchmark_results/:
polars_results.jsonspark_results.jsonperf_report.md— markdown summary (regenerated on each run).
What it measures¶
The harness exercises each load type and each format against synthetic data sized to match typical ingestion workloads. Per-run metrics:
- Rows read / written
- Wall-clock duration
- Read throughput (rows / s)
- Write throughput (rows / s)
- Peak memory (Linux only — via
resource)
Published report¶
See benchmark_results/perf_report.md
for the last committed run.
Interpretation caveats¶
- Numbers depend on disk type, CPU, and Spark cluster size — treat them as relative comparisons, not absolute guarantees.
- Polars is single-node; Spark numbers come from a
local[*]driver which is not representative of cluster performance. - Iceberg and Delta have different merge characteristics at scale — the harness runs both.