CLI¶
DataCoolie ships as a library — there is no datacoolie console script. The
canonical runner scripts live in the usecase-sim testbed and are the
template you should copy into your own repo.
Runners (from usecase-sim/runner/)¶
| Script | Purpose |
|---|---|
run.py |
Unified ETL runner for engine × metadata-source × platform combinations. Wraps DataCoolieDriver.run. |
maintenance.py |
Run OPTIMIZE / VACUUM via DataCoolieDriver.run_maintenance. |
run_scenario.py |
Dispatches to run.py or maintenance.py based on scenarios.json. |
run_perf_benchmark.py |
Generates a perf report comparing engines. |
run.py¶
Required flags:
| Flag | Values | Notes |
|---|---|---|
--engine |
polars · spark |
required |
--metadata-source |
file · database · api |
required |
--stage |
string or comma-list | required; pass "" to run all loaded dataflows |
Source-specific flags:
| Metadata source | Required flags |
|---|---|
file |
--metadata-path |
database |
--metadata-db-connection-string + --metadata-workspace-id |
api |
--metadata-api-url + --metadata-workspace-id |
Common optional flags include --platform, --column-name-mode, --dry-run,
--storage-options KEY=VALUE, --log-path, --max-workers,
--skip-api-sources, --catalog-preset, --iceberg-catalog-uri,
--uc-token, and --uc-credential.
Spark-only optional flags: --app-name and repeatable --spark-config KEY=VALUE.
Typical invocation:
# From datacoolie/
python usecase-sim/runner/run.py `
--engine polars `
--metadata-source file `
--metadata-path usecase-sim/metadata/file/orders_csv_to_parquet_full_load.json `
--stage ingest2bronze
maintenance.py¶
Required flags:
| Flag | Values | Notes |
|---|---|---|
--engine |
polars · spark |
required |
--metadata-path |
file path | required |
Common optional flags include --platform, --connection, --retention-hours,
--dry-run, --storage-options KEY=VALUE, --log-path,
--skip-api-sources, --catalog-preset, --iceberg-catalog-uri,
--uc-token, and --uc-credential.
Maintenance behavior toggles:
--no-compactdisables compaction.--no-cleanupdisables cleanup.
Spark-only optional flags: --app-name and repeatable --spark-config KEY=VALUE.
Typical invocation:
python usecase-sim/runner/maintenance.py `
--engine polars `
--metadata-path usecase-sim/metadata/file/local_use_cases.json `
--connection local_bronze `
--retention-hours 168
run_scenario.py¶
Dispatches named scenarios from usecase-sim/scenarios/scenarios.json.
Selection flags are mutually exclusive:
--scenario <name>--all--priority P0|P1|P2
Optional flag: --scenarios-path to point at a different scenario catalog.
run_perf_benchmark.py¶
Performance benchmark runner for the large perf metadata set.
Key flags:
--engine polars|sparkis required unless you use--report-only.--metadata-pathdefaults to the perf metadata file.--stagesaccepts a comma-separated stage list.--max-sizecaps the largest dataset size.--output-dirchooses where JSON results and the markdown report are written.--reset,--report-only, and--no-icebergcontrol benchmark behavior.
Scripts (one-shot helpers)¶
Under usecase-sim/scripts/:
| Script | Purpose |
|---|---|
setup_platform.py |
Bring up / down the Docker stack. |
setup_metadata.py |
Seed metadata into file, DB dialects, and the API. |
generate_data.py |
Produce sample inputs. |
generate_perf_data.py |
Larger dataset for benchmarks. |
reset_data.py, reset_perf_data.py, reset_watermarks.py |
Clean slate between runs. |
See the usecase-sim README for the full set and invocation examples.