Validation checklist¶
Prerequisites · You have authored a metadata JSON file.
End state · Confidence that your metadata is correct before you press run.
Use this checklist before the first run on a new pipeline. You can also return to it whenever a run fails with an unexpected error.
1. Document / provider preflight¶
- If you use the file provider, JSON is your canonical source and any YAML or Excel sibling has been regenerated after the latest edit.
- If you use the database or API provider, you know which
workspace_idthe run should target. - Every connection has a unique
namewithin the metadata set. - Every dataflow has a unique
namewithin the metadata set. - Any nested JSON stored in Excel cells (
configure,secrets_ref,source_configure,destination_configure,transform) is valid JSON. - You are not expecting
connection_type: "streaming"to work yet; that model value exists, but no built-in formats are mapped to it.
2. Connection basics¶
-
connection_typeandformatare a valid pair:connection_typeValid formatfilecsvparquetjsonjsonlavroexcellakehousedeltaicebergdatabasesqlapiapifunctionfunction -
If
connection_typeis omitted,formatalone still identifies the intended connection family. -
configure.base_pathexists on disk (or the cloud path is reachable) forfileandlakehouseconnections. - Lakehouse connections using metastore registration have the right
catalog/databasevalues. - Database connections have either
configure.urlor a valid combination ofdatabase_type,host,port, anddatabase. - API connections use
configure.base_urlrather thanconfigure.url. -
secrets_refonly lists field names that actually exist inconfigure. - No
configurefield appears under two differentsecrets_refsources.
Quick database connectivity check:
```python
from sqlalchemy import create_engine, text
engine = create_engine("postgresql+psycopg2://user:pass@host:5432/db")
with engine.connect() as conn:
print(conn.execute(text("SELECT 1")).fetchone())
```
3. Dataflow envelope¶
- Every
source.connection_namematches anameinconnections. - Every
destination.connection_namematches anameinconnections. -
stageis set — it is the filter you pass todriver.run(stage=…). All dataflows in the same logical step should share the samestagestring. - If execution order matters,
group_numberandexecution_orderare set explicitly instead of relying on file order. -
processing_modeis left asbatchunless you intentionally need a specialized mode. -
is_activewas not accidentally set tofalseon the dataflow.
4. Source¶
- Each source uses the right selector style:
- file / lakehouse / database table mode →
source.table - database query mode →
source.query - function source →
source.python_function
- file / lakehouse / database table mode →
- If the source is in a sub-folder/schema,
source.schema_nameis set. - If you want incremental loads,
source.watermark_columnsis set and the column actually exists in the source data. - For database sources: the SQL schema (
source.schema_name) and table (source.table) exist in the target database. - For database query sources:
source.queryruns successfully by itself. - For API sources:
connection.configure.base_urlandsource.configure.endpointtogether form the correct URL. - For API sources: pagination keys (
pagination_type,page_size,cursor_path,next_link_path,total_path) match the actual response. - For function sources:
source.python_functionis a dotted path likemypkg.loaders.load_ordersand is allowed by runtime prefix rules if you useallowed_function_prefixes. - Any
source.configure.read_optionsoverride is intentional and engine-valid. - If the file source uses
date_folder_partitionsor backward replay, you have verified the folder layout matches the pattern.
5. Destination¶
- The destination format is supported by a built-in writer:
parquet,csv,json,jsonl,avro,delta, oriceberg. -
destination.load_typeis set to one of:append,overwrite,full_load,merge_upsert,merge_overwrite,scd2. - If the destination is a flat-file writer (
parquet,csv,json,jsonl,avro), the load type is onlyappend,overwrite, orfull_load. - If
load_typeismerge_upsert,merge_overwrite, orscd2:-
destination.merge_keysis set and is a list. - Every column in
merge_keysexists in the source data.
-
- If
load_typeisscd2:-
destination.configure.scd2_effective_columnis set. - The column named in
scd2_effective_columnexists in the source data.
-
- If
destination.partition_columnsare used:- Each
columneither already exists in the source data, or itsexpressionreferences columns that do.
- Each
- If
connection.configure.date_folder_partitionsis used for a flat-file destination, you understand thatpartition_columnstakes precedence when both are present. - Any
destination.configure.write_optionsoverride is intentional and engine-valid. - If you use
catalog/databaseregistration, the resulting qualified name resolves to the intended lakehouse table.
6. Transform¶
-
schema_hintsentries use a supporteddata_type:int,long,float,double,decimal,string,boolean,date,timestamp,binary. -
decimalhints have bothprecisionandscaleset. -
source.connection.use_schema_hintis not disabled if you expect schema hints to take effect. -
deduplicate_columnsandlatest_data_columnsreference columns that actually exist in the source data. - If you rely on deduplication but left
latest_data_columnsempty, you intentionally want ordering to fall back tosource.watermark_columns. - SQL expressions in
additional_columnsare valid for your engine:- Polars: use
EXTRACT(YEAR FROM col), notyear(col). - Polars: use
CAST(col AS DATE), notdate(col). - Both: standard SQL arithmetic,
CASE WHEN, string concatenation work.
- Polars: use
- You have not configured
__created_at,__updated_at, or__updated_byinadditional_columns— these are added automatically. - You are not trying to reference system columns inside
additional_columns; they are added later in the pipeline. -
transform.configure.convert_timestamp_ntzandtransform.configure.deduplicate_by_rankare only set when you want those behaviors. - Downstream expectations account for final lowercase column names after
ColumnNameSanitizerruns.
7. Secrets¶
- Credentials are not hardcoded in
configure.urlorconfigure.password. -
secrets_reflists the correct field names fromconfigure. - The current value of each
configurefield listed insecrets_refis a vault key or environment variable name, not the real credential. - The vault/environment variables are available in the execution environment.
- The same
configurefield is not listed under two different secret sources.
Environment-variable example:
Quick check — environment-variable secrets:
import os
# Every variable referenced indirectly by secrets_ref must be set:
print(os.environ.get("DC_POSTGRES_URL")) # should not be None
See Concepts · Secrets for the full secrets_ref
schema.
8. Load and run a quick smoke test¶
Before a full production run, test with a small subset. The quickest way is to
use dry_run=True — the driver validates and plans without writing anything:
from datacoolie.core.models import DataCoolieRunConfig
from datacoolie.orchestration.driver import DataCoolieDriver
with DataCoolieDriver(engine=engine, metadata_provider=metadata) as driver:
result = driver.run(stage="ingest", run_config=DataCoolieRunConfig(dry_run=True))
print(result)
# ExecutionResult(total=1, succeeded=0, failed=0, skipped=1)
# Skipped = dry run; no write attempted.
Or validate the metadata load alone:
from datacoolie.metadata.file_provider import FileProvider
from datacoolie.platforms.local_platform import LocalPlatform
provider = FileProvider(config_path="metadata.json", platform=LocalPlatform())
flows = provider.get_dataflows(stage="ingest")
print(flows) # list of DataFlow objects — inspect fields here
conns = provider.get_connections()
print(conns) # list of Connection objects
If either call raises, the error message points directly at the invalid field.
For merge-style destinations, remember that the first successful run may create the table with overwrite-style behavior before later runs switch to true merge semantics.
9. Common errors quick-reference¶
| Error message | Root cause | Fix |
|---|---|---|
Format 'delta' is not valid for connection_type 'file' |
Type/format mismatch | Use the valid pairs table in section 2 |
APIReader requires 'base_url' in connection.configure |
API connection used the wrong key | Put the root URL in connection.configure.base_url |
PythonFunctionReader requires source.python_function |
Function path is missing or was put on the connection | Put a dotted path on source.python_function |
connection 'X' not found |
connection_name typo in source or destination |
Check spelling against connections[].name |
Field 'url' listed in secrets_ref is missing from configure |
secrets_ref points at a non-existent config field |
Add configure.url first, then resolve it via secrets_ref |
MergeUpsertStrategy requires merge_keys |
Merge load type requires business keys | Add "merge_keys": [...] to destination |
SCD2Strategy requires scd2_effective_column |
SCD2 without effective date | Add "configure": {"scd2_effective_column": "..."} to destination |
FileWriter only supports ['append', 'full_load', 'overwrite'] |
Merge or SCD2 was configured on a flat-file destination | Use Delta/Iceberg for merge-style writes or switch the load type |
Column not found: updated_at |
Watermark or dedup column doesn't exist | Check actual column names in source data |
year() / date() fails on Polars |
Unsupported SQL helper was used in metadata expressions | Use EXTRACT(...) or CAST(... AS DATE) |
JSONDecodeError in Excel cell |
configure cell contains invalid JSON |
Fix the JSON in that cell; ensure it's a valid object |
| All dataflows skipped / 0 loaded | is_active is false, the stage filter did not match, or the source legitimately returned zero rows |
Check is_active, driver.run(stage=...), and the source query/path |
Ready to run¶
If all boxes above are checked, run your first stage:
with DataCoolieDriver(engine=engine, metadata_provider=metadata) as driver:
result = driver.run(stage="ingest")
assert result.failed == 0, f"Pipeline failed: {result}"
print(f"Processed {result.total} dataflows, {result.succeeded} succeeded")
→ Back to Metadata guide overview