Deploy to AWS Glue¶
Prerequisites · AWS account with Glue, S3, Secrets Manager · datacoolie installed in the Glue job's Python library path.
End state · DataCoolie pipeline running as a Glue Spark job writing Delta/Iceberg to S3 with Secrets Manager-backed credentials.
1. Package the library¶
Glue lets you provide a wheel via --additional-python-modules:
Then add only the extra Python packages your specific job needs, for example:
sqlalchemyfor database metadata.httpxfor API metadata.openpyxlfor Excel metadata files.pyicebergfor PyIceberg-based Iceberg workflows.
Glue already provides the Spark runtime, so treat DataCoolie extras as optional convenience, not the default install path.
2. Job script¶
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from datacoolie.engines.spark_engine import SparkEngine
from datacoolie.platforms.aws_platform import AWSPlatform
from datacoolie.metadata.file_provider import FileProvider
from datacoolie.orchestration.driver import DataCoolieDriver
spark = GlueContext(SparkContext.getOrCreate()).spark_session
platform = AWSPlatform(region="us-east-1")
engine = SparkEngine(spark, platform=platform)
metadata = FileProvider(config_path="s3://my-bucket/metadata/orders.json", platform=platform)
with DataCoolieDriver(engine=engine, metadata_provider=metadata,
base_log_path="s3://my-bucket/logs") as driver:
driver.run(stage="ingest2bronze")
3. Delta + Athena handoff¶
Connection.configure options that matter on AWS:
| Option | Purpose |
|---|---|
athena_output_location |
S3 path for Athena DDL results. When set, DataCoolie registers a native Delta table via Athena CREATE EXTERNAL TABLE ... TBLPROPERTIES('table_type'='DELTA') after every write and maintenance. |
generate_manifest |
Write a _symlink_format_manifest/ after writes (Presto/Athena-compatible). |
register_symlink_table |
Register a Glue SymlinkTextInputFormat table after writes. Implies generate_manifest. |
symlink_database_prefix |
Prefix for the symlink Glue database. Default "symlink_". |
The checked-in aws_glue_use_cases.json uses this Delta path: Delta
destinations carry athena_output_location, and AWSPlatform.register_delta_table(...)
does the Glue Catalog registration through Athena after Spark writes.
4. Secrets¶
AWSPlatform._fetch_secret hits AWS Secrets Manager:
{
"configure": {
"username": "db_user",
"password": "db_pass"
},
"secrets_ref": {
"arn:aws:secretsmanager:ap-southeast-1:123456789012:secret:de-dev-0001/datacoolie/rds": [
"username",
"password"
]
}
}
username and password are the configure fields. Their current values
(db_user, db_pass) are the JSON keys fetched from the Secrets Manager
secret. This is the same pattern used by aws_secrets_example_source in the
usecase-sim AWS metadata.
If you prefer a shorter form in your own metadata, use the actual secret name
you created, such as de-dev-0001/datacoolie/rds. A bare value like
datacoolie/rds only works when that is literally the secret's name. Use the
full ARN when you want an unambiguous reference across environments or accounts.
5. Delta Lake config on Glue¶
Glue Spark needs Delta Lake extensions activated at session start. The
repo-backed sample_aws_glue_spark.py uses explicit Spark properties, so that
is the documented baseline:
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Glue also offers job arguments such as --datalake-formats delta, but this
repo does not rely on them anywhere. Prefer the explicit Spark properties plus
the matching Delta runtime/JARs, because that is the path exercised by the
checked-in Glue sample.
6. Iceberg alternative¶
If you would rather use Iceberg than Delta on Glue, switch the connection to
fmt="iceberg", set catalog="glue_catalog", and keep a real Glue database
name (the AWS sample metadata uses database="datacoolie"). This skips the
Athena-native Delta registration path entirely and uses Glue Catalog-backed
Iceberg tables instead.
For Spark jobs, include the Glue Iceberg runtime/JARs. For Python Shell jobs,
the checked-in sample_aws_glue_polars.py uses pyiceberg, and Glue catalog
support requires PYICEBERG_CATALOG__GLUE__TYPE=glue plus the appropriate Glue
permissions.
Reference assets¶
Use the AWS platform guide in usecase-sim for the current sample scripts, metadata file, and setup notes: