Installation¶
DataCoolie is published to PyPI. The base package stays light; engines, cloud SDKs, and metadata backends are opt-in extras so you only install what you need.
Recommended first install¶
For most new users, start with the smallest setup that can run a real pipeline:
If you already know Spark is your main runtime, install Spark + Delta instead:
Use datacoolie[all] for contributor machines or broad local experimentation.
The bare pip install datacoolie package is mainly useful for extension work,
API exploration, or environments where another package layer supplies the
runtime dependencies.
Quick decision table¶
| You want to do first | Install |
|---|---|
| Fastest first success on one machine | pip install "datacoolie[polars,deltalake]" |
| Spark/Fabric/Databricks-style local validation | pip install "datacoolie[spark,delta-spark]" |
| Try many engines, platforms, and metadata backends locally | pip install "datacoolie[all]" |
| Only inspect APIs or develop extensions | pip install datacoolie |
Pick your extras¶
# Minimal (no engine extras; rarely useful on its own)
pip install datacoolie
# Most common: one engine + one table format
pip install "datacoolie[polars,deltalake]"
pip install "datacoolie[spark,delta-spark]"
# Everything
pip install "datacoolie[all]"
Extras reference¶
| Extra | Installs | Use when |
|---|---|---|
spark |
pyspark>=3.5 |
You want the Spark engine. |
polars |
polars>=1.0 |
You want the Polars engine. |
delta-spark |
delta-spark>=3.0 |
Spark + Delta Lake. |
deltalake |
deltalake>=0.15 |
Polars + Delta Lake (delta-rs). |
iceberg |
pyiceberg>=0.6 |
Apache Iceberg tables (any engine). |
aws |
boto3>=1.28 |
AWSPlatform (S3, Secrets Manager, Glue). |
api |
httpx>=0.24 |
APIReader and the API metadata provider. |
db |
sqlalchemy>=2.0 |
DatabaseProvider for metadata stored in an RDBMS. |
excel |
fastexcel, openpyxl |
Reading Excel files as sources or metadata. |
fabric |
spark + delta-spark + polars + deltalake | Microsoft Fabric notebooks / Spark pools. |
databricks |
spark + delta-spark + polars + deltalake | Databricks Runtime. |
all |
everything above | Kitchen-sink local dev. |
System requirements¶
| Component | Minimum | Notes |
|---|---|---|
| Python | 3.11 | Strict — the framework uses PEP 604 unions and TypeVar defaults. |
| Java | 17 for Spark | Only required when using SparkEngine. |
| RAM | 4 GB (Polars) · 8 GB (Spark) | Per executor; scale with your data. |
| Disk | — | Depends on your lakehouse layout. |
Windows timezones
On Windows, Python's zoneinfo needs tzdata to resolve IANA zones.
tzdata is pulled in automatically via the sys_platform == 'win32' marker
in pyproject.toml. If you vendor a custom wheel, install tzdata explicitly.
Verify¶
import datacoolie
print(datacoolie.__version__)
# Listing registered plugins proves the install wired up entry points.
print(sorted(datacoolie.engine_registry.names()))
print(sorted(datacoolie.platform_registry.names()))
Expected output (with [all]):
If one of your engines is missing, the extra for it is not installed — see the table above.
Common beginner trap¶
If import datacoolie works but your quickstart still cannot create an engine
or read/write a table, you almost always installed the base package without the
engine or table-format extra you need.
Next¶
- Most new users: Quickstart · Polars
- Spark-first users: Quickstart · Spark