refactor(jobs): run maintenance jobs on serverless compute#30
Merged
Conversation
The five maintenance jobs are pure psycopg + SDK (no SparkSession), so run them on serverless instead of requiring a provisioned cluster. - bundle: drop existing_cluster_id and the cluster_id variable; declare dependencies (psycopg[binary], databricks-sdk) per job in an `environments` block (reused via a YAML anchor) and reference it from each task via environment_key; task `libraries` no longer apply on serverless - partition_manager: fix stale docstring (job parameter / LAKETS_INSTANCE, not spark.conf.get) - docs/CHANGELOG: note jobs run on serverless compute
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The five LakeTS maintenance jobs are pure psycopg + Databricks SDK (no
SparkSession—cold_rollup_refreshuses a SQL warehouse via the SDK, not Spark). This moves them off a provisioned cluster onto serverless compute.Changes
Bundle (
databricks/bundles/databricks.yml)existing_cluster_idfrom every task and the now-unusedcluster_idvariable (and its dev/prod overrides).psycopg[binary],databricks-sdk) moved from tasklibraries(which don't apply on serverless) into a per-jobenvironmentsblock, reused across all five jobs via a YAML anchor and referenced from each task viaenvironment_key: lakets_env.run_asservice principal (prod) andsync.pathsare unchanged.Source
partition_manager.py: fixed a stale docstring that referencedspark.conf.get(...); the instance name comes from the job parameter (sys.argv[1]) orLAKETS_INSTANCE.Docs / CHANGELOG
environments.Test plan
tests/test_python_patterns.pypasses (11/11)environmentsspec, every task hasenvironment_key: lakets_env, noexisting_cluster_id, nocluster_idvar, no tasklibrariesdatabricks bundle deploy -t prod --var="service_principal_name=<sp>"provisions the jobs on serverless and a manual run of each completes against a Lakebase instanceclientversion ("3") is valid in the target workspace; bump if needed