CLI¶

DataCoolie ships as a library — there is no datacoolie console script. The canonical runner scripts live in the usecase-sim testbed and are the template you should copy into your own repo.

Runners (from `usecase-sim/runner/`)¶

Script	Purpose
`run.py`	Unified ETL runner for `engine × metadata-source × platform` combinations. Wraps `DataCoolieDriver.run`.
`maintenance.py`	Run `OPTIMIZE` / `VACUUM` via `DataCoolieDriver.run_maintenance`.
`run_scenario.py`	Dispatches to `run.py` or `maintenance.py` based on `scenarios.json`.
`run_perf_benchmark.py`	Generates a perf report comparing engines.

`run.py`¶

Required flags:

Flag	Values	Notes
`--engine`	`polars` · `spark`	required
`--metadata-source`	`file` · `database` · `api`	required
`--stage`	string or comma-list	required; pass `""` to run all loaded dataflows

Source-specific flags:

Metadata source	Required flags
`file`	`--metadata-path`
`database`	`--metadata-db-connection-string` + `--metadata-workspace-id`
`api`	`--metadata-api-url` + `--metadata-workspace-id`

Common optional flags include --platform, --column-name-mode, --dry-run, --storage-options KEY=VALUE, --log-path, --max-workers, --skip-api-sources, --catalog-preset, --iceberg-catalog-uri, --uc-token, and --uc-credential.

Optional flag	Values / Notes
`--platform`	`local` (default) · `aws`
`--column-name-mode`	`lower` (default) · `snake`
`--dry-run`	Flag — plan without reading or writing
`--storage-options KEY=VALUE`	Repeatable — e.g. `AWS_REGION=us-east-1`
`--log-path`	Output directory for ETL and system logs
`--max-workers`	Integer — overrides `DataCoolieRunConfig.max_workers`
`--skip-api-sources`	Flag — skip dataflows whose source `connection_type` is `api`
`--catalog-preset`	`local` (default) · `unity_catalog`
`--iceberg-catalog-uri`	Catalog URI for Iceberg
`--uc-token`	Unity Catalog access token
`--uc-credential`	Alternative Unity Catalog credential

Replay mode flags¶

When --replay-start and --replay-end are provided, run.py calls driver.run_replay() instead of driver.run().

Flag	Notes
`--replay-start`	Inclusive lower bound — ISO date/datetime string or integer
`--replay-end`	Exclusive upper bound — ISO date/datetime string or integer
`--replay-chunk-interval KEY=VALUE`	Repeatable — e.g. `days=1` or `months=1`
`--replay-save-watermark`	Flag — save watermark after each chunk (crash-resume)
`--replay-chunk-column`	Override auto-resolved chunk column

Example — replay Q1 2025 in monthly chunks:

python usecase-sim/runner/run.py `
    --engine polars `
    --metadata-source file `
    --metadata-path usecase-sim/metadata/file/local_use_cases.json `
    --stage bronze2silver `
    --replay-start 2025-01-01 `
    --replay-end 2025-04-01 `
    --replay-chunk-interval months=1

See How-to · Replay & backfill.

Spark-only optional flags: --app-name and repeatable --spark-config KEY=VALUE.

Typical invocation:

# From datacoolie/
python usecase-sim/runner/run.py `
    --engine polars `
    --metadata-source file `
    --metadata-path usecase-sim/metadata/file/orders_csv_to_parquet_full_load.json `
    --stage ingest2bronze

`maintenance.py`¶

Required flags:

Flag	Values	Notes
`--engine`	`polars` · `spark`	required
`--metadata-path`	file path	required

Common optional flags include --platform, --connection, --retention-hours, --dry-run, --storage-options KEY=VALUE, --log-path, --skip-api-sources, --catalog-preset, --iceberg-catalog-uri, --uc-token, and --uc-credential.

Maintenance behavior toggles:

--no-compact disables compaction.
--no-cleanup disables cleanup.

Spark-only optional flags: --app-name and repeatable --spark-config KEY=VALUE.

Typical invocation:

python usecase-sim/runner/maintenance.py `
    --engine polars `
    --metadata-path usecase-sim/metadata/file/local_use_cases.json `
    --connection local_bronze `
    --retention-hours 168

`run_scenario.py`¶

Dispatches named scenarios from usecase-sim/scenarios/scenarios.json.

Selection flags are mutually exclusive:

--scenario <name>
--all
--priority P0|P1|P2

Optional flag: --scenarios-path to point at a different scenario catalog.

`run_perf_benchmark.py`¶

Performance benchmark runner for the large perf metadata set.

Key flags:

--engine polars|spark is required unless you use --report-only.
--metadata-path defaults to the perf metadata file.
--stages accepts a comma-separated stage list.
--max-size caps the largest dataset size.
--output-dir chooses where JSON results and the markdown report are written.
--reset, --report-only, and --no-iceberg control benchmark behavior.

Scripts (one-shot helpers)¶

Under usecase-sim/scripts/:

Script	Purpose
`setup_platform.py`	Bring up / down the Docker stack.
`setup_metadata.py`	Seed metadata into file, DB dialects, and the API.
`generate_data.py`	Produce sample inputs.
`generate_perf_data.py`	Larger dataset for benchmarks.
`reset_data.py`, `reset_perf_data.py`, `reset_watermarks.py`	Clean slate between runs.

See the usecase-sim README for the full set and invocation examples.

CLI¶

Runners (from usecase-sim/runner/)¶

run.py¶