CLI¶

DataCoolie ships as a library — there is no datacoolie console script. The canonical runner scripts live in the usecase-sim testbed and are the template you should copy into your own repo.

Runners (from `usecase-sim/runner/`)¶

Script	Purpose
`run.py`	Unified ETL runner for `engine × metadata-source × platform` combinations. Wraps `DataCoolieDriver.run`.
`maintenance.py`	Run `OPTIMIZE` / `VACUUM` via `DataCoolieDriver.run_maintenance`.
`run_scenario.py`	Dispatches to `run.py` or `maintenance.py` based on `scenarios.json`.
`run_perf_benchmark.py`	Generates a perf report comparing engines.

`run.py`¶

Required flags:

Flag	Values	Notes
`--engine`	`polars` · `spark`	required
`--metadata-source`	`file` · `database` · `api`	required
`--stage`	string or comma-list	required; pass `""` to run all loaded dataflows

Source-specific flags:

Metadata source	Required flags
`file`	`--metadata-path`
`database`	`--metadata-db-connection-string` + `--metadata-workspace-id`
`api`	`--metadata-api-url` + `--metadata-workspace-id`

Common optional flags include --platform, --column-name-mode, --dry-run, --storage-options KEY=VALUE, --log-path, --max-workers, --skip-api-sources, --catalog-preset, --iceberg-catalog-uri, --uc-token, and --uc-credential.

Spark-only optional flags: --app-name and repeatable --spark-config KEY=VALUE.

Typical invocation:

# From datacoolie/
python usecase-sim/runner/run.py `
    --engine polars `
    --metadata-source file `
    --metadata-path usecase-sim/metadata/file/orders_csv_to_parquet_full_load.json `
    --stage ingest2bronze

`maintenance.py`¶

Required flags:

Flag	Values	Notes
`--engine`	`polars` · `spark`	required
`--metadata-path`	file path	required

Common optional flags include --platform, --connection, --retention-hours, --dry-run, --storage-options KEY=VALUE, --log-path, --skip-api-sources, --catalog-preset, --iceberg-catalog-uri, --uc-token, and --uc-credential.

Maintenance behavior toggles:

--no-compact disables compaction.
--no-cleanup disables cleanup.

Spark-only optional flags: --app-name and repeatable --spark-config KEY=VALUE.

Typical invocation:

python usecase-sim/runner/maintenance.py `
    --engine polars `
    --metadata-path usecase-sim/metadata/file/local_use_cases.json `
    --connection local_bronze `
    --retention-hours 168

`run_scenario.py`¶

Dispatches named scenarios from usecase-sim/scenarios/scenarios.json.

Selection flags are mutually exclusive:

--scenario <name>
--all
--priority P0|P1|P2

Optional flag: --scenarios-path to point at a different scenario catalog.

`run_perf_benchmark.py`¶

Performance benchmark runner for the large perf metadata set.

Key flags:

--engine polars|spark is required unless you use --report-only.
--metadata-path defaults to the perf metadata file.
--stages accepts a comma-separated stage list.
--max-size caps the largest dataset size.
--output-dir chooses where JSON results and the markdown report are written.
--reset, --report-only, and --no-iceberg control benchmark behavior.

Scripts (one-shot helpers)¶

Under usecase-sim/scripts/:

Script	Purpose
`setup_platform.py`	Bring up / down the Docker stack.
`setup_metadata.py`	Seed metadata into file, DB dialects, and the API.
`generate_data.py`	Produce sample inputs.
`generate_perf_data.py`	Larger dataset for benchmarks.
`reset_data.py`, `reset_perf_data.py`, `reset_watermarks.py`	Clean slate between runs.

See the usecase-sim README for the full set and invocation examples.

CLI¶

Runners (from usecase-sim/runner/)¶

run.py¶

maintenance.py¶

run_scenario.py¶

run_perf_benchmark.py¶