skills(pipelines): tighten databricks-pipelines for correctness and density by QuentinAmbard · Pull Request #102 · databricks/databricks-agent-skills

QuentinAmbard · 2026-05-27T15:43:59Z

Why this PR exists

The databricks-pipelines skill is the agent-facing reference for building Lakeflow Spark Declarative Pipelines. The initial port from ai-dev-kit got the content into the repo — this PR is the second pass that turns it into a tight, internally consistent tool an LLM can rely on in a fresh session: every claim grounded in the current Databricks docs, every workflow with one clear entry point, every concept stated once.

What this PR improves

One authoritative story per topic. SDP/LDP naming, the legacy-DLT migration map, the start-update polling pattern with error.exceptions[0].message extraction, the dev/iteration canonical create JSON — each now lives in exactly one place and is referenced from everywhere else that needs it. The DAB workflows (A and B) and the no-bundle CLI workflow (C) have dedicated detail files; the SKILL.md entry point gives a one-liner per option and links out. The Common Traps / Common Issues split is now intent-driven: Traps cover design-time decisions, Issues cover concrete error → fix mappings agents will copy-paste.

Verified against the current Databricks docs. Where official guidance is nuanced — CREATE TEMPORARY VIEW doesn't support CONSTRAINT clauses, so CREATE LIVE VIEW is retained as the only path for expectations on a temp view — that nuance is captured explicitly with a citation. STREAM(table) (function form, with parens) is normalized everywhere for table sources; STREAM read_files(...) (no extra parens) for function sources, matching the docs. @dp.view is correctly marked as legacy per the official temporary_view reference. sequence_by is documented as accepting both string and Column per the API.

Dense and skimmable. Roughly 40% shorter (~5,700 → ~3,400 lines) with zero concept removed. SQL is canonical where SQL/Python differ only mechanically, with a one-line Python equivalent noted inline. Pattern-variation sections that repeated the same SQL with one different clause each are now single examples with the variations listed. Verbose boilerplate (full pyproject.toml, identical second/third examples of the same pattern) leans on agent world-knowledge instead of restating it. Less context to load, less drift between repeated explanations.

Tuned defaults for the loop agents actually run. The canonical pipelines create JSON ships with development: true and the retry overrides set to \"0\", so a doomed update fails fast (~30s) instead of retrying for 10+ min on serverless. A clearly-labelled "drop these for production" instruction is right next to the snippet so the iteration defaults can't quietly leak into prod pipelines.

Summary of changes

Area	What changed
Workflow narrative	Split into a self-contained A/B/C entry block in SKILL.md, plus dedicated detail files: `1-project-initialization-with-dab.md` (Workflows A + B), `2-rapid-iteration-with-cli.md` (Workflow C). Running-a-pipeline and refresh-mode guidance now explicitly distinguish bundle-deploy vs CLI-driven paths.
Correctness	Normalized SQL syntax against the current Databricks docs — `FROM STREAM(table)` for table sources, `FROM STREAM read_files(...)` for function sources, `CREATE OR REFRESH` on every example. Fixed the broken `jq` for FAILED-update debugging so it extracts the actual exception body. Resolved contradictions around `@dp.view`, `apply_changes` aliases, `CREATE LIVE VIEW`, `sequence_by`, and `partition_cols`.
API tables in SKILL.md	Added a Description column so every row tells the reader what the feature is, not just the syntax. Dropped the Python/SQL deprecation columns. Added a single canonical "Legacy DLT Syntax — always migrate" table covering `import dlt`, decorators, reads, `apply_changes`, `LIVE.` prefix, `CREATE LIVE TABLE`, `partition_cols`, `input_file_name()`, and the `target=` parameter — with the `CREATE LIVE VIEW`-for-expectations carve-out explicitly noted.
Density	Removed near-duplicate examples (same SQL with one different filter value), "Complete Example" sections that re-piled patterns shown individually above, and parallel SQL+Python code blocks for mechanically-translatable patterns. Compressed full `pyproject.toml` and option-table boilerplate into pointer-style content. ~40% line reduction overall, zero concept lost.
File layout	Deleted ten redirect-only "parent stub" files (`streaming-table.md`, `expectations.md`, etc.) — SKILL.md API tables now link straight at the `-python.md` / `-sql.md` siblings, with no broken subdirectory paths.
High-value additions from a-d-k	Merged in SDP-specific gotchas that weren't in the original DAS port: `CREATE OR REPLACE` rejection (standard SQL ≠ SDP), `dbfs:` prefix required for UC Volume paths, `CLUSTER BY` column-type rules with the `DELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED` mapping, the `error.exceptions[0].message` debug-extraction pattern, the upstream-trace protocol for validation failures, and the Gold-layer preserve-dimensions guidance.
Dev defaults	Canonical create JSON now ships with `development: true` + retry overrides for fast-fail iteration, alongside a clearly-labelled prod conversion. Mirrored consistently in the SDK alternative and the workflow file.

Reviewer aid

REVIEW-NOTES-databricks-pipelines.md at the repo root walks through every category of change with file/line citations and links to the Databricks docs sections that motivated each correctness fix.

This pull request and its description were written by Isaac.

dustinvannoy-db

Rebase needed then address a few comments, it's looking really good though.

dustinvannoy-db · 2026-06-02T21:25:19Z

Let's remove this before we merge PR.

…ness and density Rework of the databricks-pipelines skill: fixes correctness/consistency issues introduced during the initial port from ai-dev-kit, restructures the workflow narrative around DAB-vs-CLI iteration, and compresses ~40% of the line count without losing any concept. See REVIEW-NOTES-databricks-pipelines.md at the repo root for the full rationale and per-category change log. Co-authored-by: Isaac

QuentinAmbard · 2026-06-04T11:33:09Z

@dustinvannoy-db thanks for the review — all four addressed in the force-pushed commit (75984e4):

REVIEW-NOTES-databricks-pipelines.md — removed from the repo root.
SKILL.md:143 (Reading Data APIs row) — SELECT ... FROM STREAM name → SELECT ... FROM STREAM(name). Also fixed the same bareword form on line 182 (the LIVE. legacy-migration row) for consistency.
SKILL.md:176 (@dlt.view row) — split into its own row pointing at @dp.temporary_view(...), with an explicit "the modern API has no view decorator, only temporary_view" note. The other decorators stayed in the "same decorator name on dp.*" row since that rule does hold for them.
manifest.json format — rebased onto current upstream/main, regenerated with python3 scripts/skills.py generate. The "version": "2" header is gone; manifest now matches the canonical format.

Rebase took one conflict on manifest.json (took upstream's version, then regenerated cleanly). python3 scripts/skills.py validate passes; zero broken links across the skill.

dustinvannoy-db · 2026-06-07T08:24:33Z

I think this one is ready to merge based on my review and the test results.

I ran all levels of skill eval using ground_truth from https://github.com/databricks-solutions/ai-dev-kit/tree/main/.test/skills/databricks-spark-declarative-pipelines and that gave good results, though ground truth was a bit overspecific so I have since modified that.

Composite score: 83%
L5 results don't seem completely accurate for this run, we will keep improving evals.

dustinvannoy-db

LGTM, let's get this one approved and merged and sync other versions of the skill as needed.

QuentinAmbard requested review from a team, camielstee-db, lennartkats-db and simonfaltum as code owners May 27, 2026 15:44

dustinvannoy-db requested changes Jun 2, 2026

View reviewed changes

QuentinAmbard force-pushed the skills/databricks-pipelines-density-and-correctness branch from 5d98735 to 75984e4 Compare June 4, 2026 11:31

QuentinAmbard requested a review from a team as a code owner June 4, 2026 11:31

dustinvannoy-db self-requested a review June 7, 2026 08:25

dustinvannoy-db approved these changes Jun 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skills(pipelines): tighten databricks-pipelines for correctness and density#102

skills(pipelines): tighten databricks-pipelines for correctness and density#102
QuentinAmbard wants to merge 1 commit into
databricks:mainfrom
QuentinAmbard:skills/databricks-pipelines-density-and-correctness

QuentinAmbard commented May 27, 2026

Uh oh!

dustinvannoy-db left a comment

Uh oh!

dustinvannoy-db Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

QuentinAmbard commented Jun 4, 2026

Uh oh!

dustinvannoy-db commented Jun 7, 2026

Uh oh!

dustinvannoy-db left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

QuentinAmbard commented May 27, 2026

Why this PR exists

What this PR improves

Summary of changes

Reviewer aid

Uh oh!

dustinvannoy-db left a comment

Choose a reason for hiding this comment

Uh oh!

dustinvannoy-db Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

QuentinAmbard commented Jun 4, 2026

Uh oh!

dustinvannoy-db commented Jun 7, 2026

Uh oh!

dustinvannoy-db left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants