Skip to content

skills(pipelines): tighten databricks-pipelines for correctness and density#102

Open
QuentinAmbard wants to merge 1 commit into
databricks:mainfrom
QuentinAmbard:skills/databricks-pipelines-density-and-correctness
Open

skills(pipelines): tighten databricks-pipelines for correctness and density#102
QuentinAmbard wants to merge 1 commit into
databricks:mainfrom
QuentinAmbard:skills/databricks-pipelines-density-and-correctness

Conversation

@QuentinAmbard
Copy link
Copy Markdown

Why this PR exists

The databricks-pipelines skill is the agent-facing reference for building Lakeflow Spark Declarative Pipelines. The initial port from ai-dev-kit got the content into the repo — this PR is the second pass that turns it into a tight, internally consistent tool an LLM can rely on in a fresh session: every claim grounded in the current Databricks docs, every workflow with one clear entry point, every concept stated once.

What this PR improves

One authoritative story per topic. SDP/LDP naming, the legacy-DLT migration map, the start-update polling pattern with error.exceptions[0].message extraction, the dev/iteration canonical create JSON — each now lives in exactly one place and is referenced from everywhere else that needs it. The DAB workflows (A and B) and the no-bundle CLI workflow (C) have dedicated detail files; the SKILL.md entry point gives a one-liner per option and links out. The Common Traps / Common Issues split is now intent-driven: Traps cover design-time decisions, Issues cover concrete error → fix mappings agents will copy-paste.

Verified against the current Databricks docs. Where official guidance is nuanced — CREATE TEMPORARY VIEW doesn't support CONSTRAINT clauses, so CREATE LIVE VIEW is retained as the only path for expectations on a temp view — that nuance is captured explicitly with a citation. STREAM(table) (function form, with parens) is normalized everywhere for table sources; STREAM read_files(...) (no extra parens) for function sources, matching the docs. @dp.view is correctly marked as legacy per the official temporary_view reference. sequence_by is documented as accepting both string and Column per the API.

Dense and skimmable. Roughly 40% shorter (~5,700 → ~3,400 lines) with zero concept removed. SQL is canonical where SQL/Python differ only mechanically, with a one-line Python equivalent noted inline. Pattern-variation sections that repeated the same SQL with one different clause each are now single examples with the variations listed. Verbose boilerplate (full pyproject.toml, identical second/third examples of the same pattern) leans on agent world-knowledge instead of restating it. Less context to load, less drift between repeated explanations.

Tuned defaults for the loop agents actually run. The canonical pipelines create JSON ships with development: true and the retry overrides set to \"0\", so a doomed update fails fast (~30s) instead of retrying for 10+ min on serverless. A clearly-labelled "drop these for production" instruction is right next to the snippet so the iteration defaults can't quietly leak into prod pipelines.

Summary of changes

Area What changed
Workflow narrative Split into a self-contained A/B/C entry block in SKILL.md, plus dedicated detail files: 1-project-initialization-with-dab.md (Workflows A + B), 2-rapid-iteration-with-cli.md (Workflow C). Running-a-pipeline and refresh-mode guidance now explicitly distinguish bundle-deploy vs CLI-driven paths.
Correctness Normalized SQL syntax against the current Databricks docs — FROM STREAM(table) for table sources, FROM STREAM read_files(...) for function sources, CREATE OR REFRESH on every example. Fixed the broken jq for FAILED-update debugging so it extracts the actual exception body. Resolved contradictions around @dp.view, apply_changes aliases, CREATE LIVE VIEW, sequence_by, and partition_cols.
API tables in SKILL.md Added a Description column so every row tells the reader what the feature is, not just the syntax. Dropped the Python/SQL deprecation columns. Added a single canonical "Legacy DLT Syntax — always migrate" table covering import dlt, decorators, reads, apply_changes, LIVE. prefix, CREATE LIVE TABLE, partition_cols, input_file_name(), and the target= parameter — with the CREATE LIVE VIEW-for-expectations carve-out explicitly noted.
Density Removed near-duplicate examples (same SQL with one different filter value), "Complete Example" sections that re-piled patterns shown individually above, and parallel SQL+Python code blocks for mechanically-translatable patterns. Compressed full pyproject.toml and option-table boilerplate into pointer-style content. ~40% line reduction overall, zero concept lost.
File layout Deleted ten redirect-only "parent stub" files (streaming-table.md, expectations.md, etc.) — SKILL.md API tables now link straight at the -python.md / -sql.md siblings, with no broken subdirectory paths.
High-value additions from a-d-k Merged in SDP-specific gotchas that weren't in the original DAS port: CREATE OR REPLACE rejection (standard SQL ≠ SDP), dbfs: prefix required for UC Volume paths, CLUSTER BY column-type rules with the DELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED mapping, the error.exceptions[0].message debug-extraction pattern, the upstream-trace protocol for validation failures, and the Gold-layer preserve-dimensions guidance.
Dev defaults Canonical create JSON now ships with development: true + retry overrides for fast-fail iteration, alongside a clearly-labelled prod conversion. Mirrored consistently in the SDK alternative and the workflow file.

Reviewer aid

REVIEW-NOTES-databricks-pipelines.md at the repo root walks through every category of change with file/line citations and links to the Databricks docs sections that motivated each correctness fix.

This pull request and its description were written by Isaac.

Copy link
Copy Markdown
Collaborator

@dustinvannoy-db dustinvannoy-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase needed then address a few comments, it's looking really good though.

Comment thread REVIEW-NOTES-databricks-pipelines.md Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this before we merge PR.

Comment thread skills/databricks-pipelines/SKILL.md Outdated
Comment thread skills/databricks-pipelines/SKILL.md Outdated
Comment thread manifest.json
…ness and density

Rework of the databricks-pipelines skill: fixes correctness/consistency
issues introduced during the initial port from ai-dev-kit, restructures
the workflow narrative around DAB-vs-CLI iteration, and compresses
~40% of the line count without losing any concept.

See REVIEW-NOTES-databricks-pipelines.md at the repo root for the full
rationale and per-category change log.

Co-authored-by: Isaac
@QuentinAmbard QuentinAmbard force-pushed the skills/databricks-pipelines-density-and-correctness branch from 5d98735 to 75984e4 Compare June 4, 2026 11:31
@QuentinAmbard QuentinAmbard requested a review from a team as a code owner June 4, 2026 11:31
@QuentinAmbard
Copy link
Copy Markdown
Author

@dustinvannoy-db thanks for the review — all four addressed in the force-pushed commit (75984e4):

  • REVIEW-NOTES-databricks-pipelines.md — removed from the repo root.
  • SKILL.md:143 (Reading Data APIs row)SELECT ... FROM STREAM nameSELECT ... FROM STREAM(name). Also fixed the same bareword form on line 182 (the LIVE. legacy-migration row) for consistency.
  • SKILL.md:176 (@dlt.view row) — split into its own row pointing at @dp.temporary_view(...), with an explicit "the modern API has no view decorator, only temporary_view" note. The other decorators stayed in the "same decorator name on dp.*" row since that rule does hold for them.
  • manifest.json format — rebased onto current upstream/main, regenerated with python3 scripts/skills.py generate. The "version": "2" header is gone; manifest now matches the canonical format.

Rebase took one conflict on manifest.json (took upstream's version, then regenerated cleanly). python3 scripts/skills.py validate passes; zero broken links across the skill.

@dustinvannoy-db
Copy link
Copy Markdown
Collaborator

I think this one is ready to merge based on my review and the test results.

I ran all levels of skill eval using ground_truth from https://github.com/databricks-solutions/ai-dev-kit/tree/main/.test/skills/databricks-spark-declarative-pipelines and that gave good results, though ground truth was a bit overspecific so I have since modified that.

Composite score: 83%
L5 results don't seem completely accurate for this run, we will keep improving evals.
databricks_pipelines_eval_summary

@dustinvannoy-db dustinvannoy-db self-requested a review June 7, 2026 08:25
Copy link
Copy Markdown
Collaborator

@dustinvannoy-db dustinvannoy-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's get this one approved and merged and sync other versions of the skill as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants