Skip to content

Use your own data after the quickstart

This page is for the most common next step: the sample pipeline worked, and now you want to point DataCoolie at your own data without changing engines, platforms, or orchestration code.

Start from

Keep these the same

  • Your chosen engine (PolarsEngine or SparkEngine)
  • Your platform (LocalPlatform in the quickstarts)
  • Your run.py structure

Change only three parts first

  1. The input connection
  2. The source selector
  3. The destination behavior

That is enough for most first real runs.

1. Put your input in a predictable folder

Start simple. For file sources, DataCoolie builds the read path from base_path, optional schema_name, and table.

my-pipeline/
  data/
    input/
      customers/
        customers.csv
    output/
  metadata/
    customers.json
  run.py

If your input is not CSV, keep the same folder idea and change the connection format to match the file type you are reading.

2. Update the input connection

Use the same shape as the quickstart and change only what is yours:

{
  "name": "local_input",
  "connection_type": "file",
  "format": "csv",
  "configure": {
    "base_path": "data/input"
  }
}

If you later need reader-specific options such as headers or delimiter changes, add them under configure.read_options.

3. Point the source at your folder or table

The smallest useful source block is:

{
  "connection_name": "local_input",
  "table": "customers"
}

That reads from data/input/customers when the connection base_path is data/input.

If you have a reliable incremental column, add it now:

{
  "connection_name": "local_input",
  "table": "customers",
  "watermark_columns": ["updated_at"]
}

If you do not have a reliable timestamp or sequence column yet, skip watermarks for the first run and add them later.

4. Pick the simplest destination first

When moving from the sample to your own data, prefer the easiest load behavior that matches what you know:

  • Use overwrite when you just need a clean successful run.
  • Use merge_upsert only when you already have stable business keys.
  • For flat-file destinations such as CSV, Parquet, JSON, or JSONL, stay with append, overwrite, or full_load.

Example destination for a first real run:

{
  "connection_name": "local_bronze",
  "schema_name": "sales",
  "table": "customers",
  "load_type": "overwrite"
}

When you are ready for keyed incremental merges:

{
  "connection_name": "local_bronze",
  "schema_name": "sales",
  "table": "customers",
  "load_type": "merge_upsert",
  "merge_keys": ["customer_id"]
}

5. Keep the runner unchanged

Do not redesign run.py yet. Reuse the quickstart runner and only point FileProvider at the new metadata file.

platform = LocalPlatform()
engine = PolarsEngine(platform=platform)
provider = FileProvider(config_path="metadata/customers.json", platform=platform)

with DataCoolieDriver(engine=engine, metadata_provider=provider) as driver:
    result = driver.run(stage="ingest2bronze")

If your quickstart already ran, keeping the runner fixed makes the next failure much easier to diagnose because only the metadata changed.

6. Add transforms only when you need them

Common first upgrades:

  • Add deduplicate_columns and latest_data_columns when the source can send duplicate business keys.
  • Add schema_hints when decimals, timestamps, or IDs need stable typing.
  • Add additional_columns only for business columns you derive before the write.

Use the detailed guides when you hit those needs:

Common mistakes on the first real run

  • Using merge_upsert before you have stable merge_keys
  • Expecting flat-file outputs to support merge-style loads
  • Adding watermarks before you have a reliable incremental column
  • Changing the engine, platform, and metadata all at once

Next