skills(pipelines): tighten databricks-pipelines for correctness and density#102
Conversation
dustinvannoy-db
left a comment
There was a problem hiding this comment.
Rebase needed then address a few comments, it's looking really good though.
There was a problem hiding this comment.
Let's remove this before we merge PR.
…ness and density Rework of the databricks-pipelines skill: fixes correctness/consistency issues introduced during the initial port from ai-dev-kit, restructures the workflow narrative around DAB-vs-CLI iteration, and compresses ~40% of the line count without losing any concept. See REVIEW-NOTES-databricks-pipelines.md at the repo root for the full rationale and per-category change log. Co-authored-by: Isaac
5d98735 to
75984e4
Compare
|
@dustinvannoy-db thanks for the review — all four addressed in the force-pushed commit (75984e4):
Rebase took one conflict on |
|
I think this one is ready to merge based on my review and the test results. I ran all levels of skill eval using ground_truth from https://github.com/databricks-solutions/ai-dev-kit/tree/main/.test/skills/databricks-spark-declarative-pipelines and that gave good results, though ground truth was a bit overspecific so I have since modified that. Composite score: 83% |
dustinvannoy-db
left a comment
There was a problem hiding this comment.
LGTM, let's get this one approved and merged and sync other versions of the skill as needed.

Why this PR exists
The
databricks-pipelinesskill is the agent-facing reference for building Lakeflow Spark Declarative Pipelines. The initial port fromai-dev-kitgot the content into the repo — this PR is the second pass that turns it into a tight, internally consistent tool an LLM can rely on in a fresh session: every claim grounded in the current Databricks docs, every workflow with one clear entry point, every concept stated once.What this PR improves
One authoritative story per topic. SDP/LDP naming, the legacy-DLT migration map, the
start-updatepolling pattern witherror.exceptions[0].messageextraction, the dev/iteration canonical create JSON — each now lives in exactly one place and is referenced from everywhere else that needs it. The DAB workflows (A and B) and the no-bundle CLI workflow (C) have dedicated detail files; the SKILL.md entry point gives a one-liner per option and links out. The Common Traps / Common Issues split is now intent-driven: Traps cover design-time decisions, Issues cover concrete error → fix mappings agents will copy-paste.Verified against the current Databricks docs. Where official guidance is nuanced —
CREATE TEMPORARY VIEWdoesn't supportCONSTRAINTclauses, soCREATE LIVE VIEWis retained as the only path for expectations on a temp view — that nuance is captured explicitly with a citation.STREAM(table)(function form, with parens) is normalized everywhere for table sources;STREAM read_files(...)(no extra parens) for function sources, matching the docs.@dp.viewis correctly marked as legacy per the officialtemporary_viewreference.sequence_byis documented as accepting both string andColumnper the API.Dense and skimmable. Roughly 40% shorter (~5,700 → ~3,400 lines) with zero concept removed. SQL is canonical where SQL/Python differ only mechanically, with a one-line Python equivalent noted inline. Pattern-variation sections that repeated the same SQL with one different clause each are now single examples with the variations listed. Verbose boilerplate (full
pyproject.toml, identical second/third examples of the same pattern) leans on agent world-knowledge instead of restating it. Less context to load, less drift between repeated explanations.Tuned defaults for the loop agents actually run. The canonical
pipelines createJSON ships withdevelopment: trueand the retry overrides set to\"0\", so a doomed update fails fast (~30s) instead of retrying for 10+ min on serverless. A clearly-labelled "drop these for production" instruction is right next to the snippet so the iteration defaults can't quietly leak into prod pipelines.Summary of changes
1-project-initialization-with-dab.md(Workflows A + B),2-rapid-iteration-with-cli.md(Workflow C). Running-a-pipeline and refresh-mode guidance now explicitly distinguish bundle-deploy vs CLI-driven paths.FROM STREAM(table)for table sources,FROM STREAM read_files(...)for function sources,CREATE OR REFRESHon every example. Fixed the brokenjqfor FAILED-update debugging so it extracts the actual exception body. Resolved contradictions around@dp.view,apply_changesaliases,CREATE LIVE VIEW,sequence_by, andpartition_cols.import dlt, decorators, reads,apply_changes,LIVE.prefix,CREATE LIVE TABLE,partition_cols,input_file_name(), and thetarget=parameter — with theCREATE LIVE VIEW-for-expectations carve-out explicitly noted.pyproject.tomland option-table boilerplate into pointer-style content. ~40% line reduction overall, zero concept lost.streaming-table.md,expectations.md, etc.) — SKILL.md API tables now link straight at the-python.md/-sql.mdsiblings, with no broken subdirectory paths.CREATE OR REPLACErejection (standard SQL ≠ SDP),dbfs:prefix required for UC Volume paths,CLUSTER BYcolumn-type rules with theDELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTEDmapping, theerror.exceptions[0].messagedebug-extraction pattern, the upstream-trace protocol for validation failures, and the Gold-layer preserve-dimensions guidance.development: true+ retry overrides for fast-fail iteration, alongside a clearly-labelled prod conversion. Mirrored consistently in the SDK alternative and the workflow file.Reviewer aid
REVIEW-NOTES-databricks-pipelines.mdat the repo root walks through every category of change with file/line citations and links to the Databricks docs sections that motivated each correctness fix.This pull request and its description were written by Isaac.