Deploy to Databricks¶
Prerequisites · Databricks workspace with Unity Catalog · cluster or SQL Warehouse · pip install datacoolie on the cluster.
End state · DataCoolie pipeline running as a Databricks job with DatabricksPlatform, DBFS / UC Volumes I/O, and Databricks secrets.
1. Cluster library¶
Install datacoolie on the cluster via the Libraries tab, or %pip install
it in a notebook.
Add optional dependencies individually only when needed. Common examples:
sqlalchemyfor database metadata.httpxfor API metadata.openpyxlfor Excel metadata files.pyicebergif your pipeline uses PyIceberg-based operations.
Databricks already provides the Spark runtime, so a large platform bundle is often unnecessary.
2. Notebook / job code¶
from datacoolie.engines.spark_engine import SparkEngine
from datacoolie.platforms.databricks_platform import DatabricksPlatform
from datacoolie.metadata.database_provider import DatabaseProvider
from datacoolie.orchestration.driver import DataCoolieDriver
engine = SparkEngine(spark, platform=DatabricksPlatform())
metadata = DatabaseProvider(connection_string="jdbc:postgresql://…", workspace_id="your-workspace-id")
with DataCoolieDriver(engine=engine, metadata_provider=metadata,
base_log_path="/Volumes/main/logs/datacoolie") as driver:
driver.run(stage="ingest2bronze")
3. Paths¶
- Prefer UC Volumes:
/Volumes/<catalog>/<schema>/<volume>/... - DBFS still works:
dbfs:/Users/…or/dbfs/… DatabricksPlatformnormalises both.
4. Qualified table names¶
Unity Catalog is three-level (catalog.schema.table). In DataCoolie metadata,
that maps to catalog + database + table, so prefer leaving schema_name
empty for Databricks connections. The checked-in usecase-sim metadata follows
that pattern:
{
"connection_type": "lakehouse",
"format": "delta",
"catalog": "workspace",
"database": "default",
"configure": {
"base_path": "/Volumes/workspace/default/datacoolie_sim/delta"
}
}
With a destination table like orders_appended, DataCoolie resolves the target
as workspace.default.orders_appended. If you set schema_name, the generic
qualified-name builder will include it and produce a four-part name, which is
usually not what you want for Unity Catalog.
5. Secrets¶
DatabricksPlatform._fetch_secret always uses dbutils.secrets.get(scope,
key); there is no separate native secret backend on Databricks. Put the
secret key name in configure, then map the Databricks scope in
secrets_ref:
{
"configure": {
"password": "sample-db-password"
},
"secrets_ref": {
"datacoolie-scope": ["password"]
}
}
The checked-in sample_databricks_secrets.ipynb notebook validates both direct
provider access and resolve_secrets(...) resolution without printing raw
secret values.
6. Workflow Job Setup¶
The checked-in usecase-sim assets are notebook-based, so this repo verifies the Workflow / notebook-task path rather than a raw Jobs API payload. In practice, wrap the notebook as a Databricks Workflow job with:
- Spark version ≥ 13.3 LTS
- Python 3.11 (Databricks DBR 14+)
datacoolieas a cluster library, plus only the extra Python packages your metadata source or table format actually needs
If you later provision jobs through the Databricks Jobs API, mirror the same notebook path, runtime, and library list used by the Workflow job. There is no repo-specific Jobs API JSON example checked in today.
Reference workspace¶
Use the Databricks platform guide in usecase-sim for the current sample notebooks, metadata file, and setup notes: