experimental: add databricks-lakeflow-connect skill#103
Conversation
Initial scope-first commit for a draft PR. GA-first deep coverage, PuPr listed but deferred to follow-up commits. This commit includes: - SKILL.md (routing + 3-tier catalog: GA / PuPr / Beta-PrPr + workflow + key concepts + common issues) - references/4-ingestion-decision-tree.md (LFC vs Auto Loader vs Lakehouse Federation vs Delta Sharing vs Zerobus + cost considerations + escape hatches) - agents/openai.yaml + assets/ via scripts/skills.py generate - manifest.json updated To follow in subsequent commits: - references/1-saas-connectors.md (Salesforce, Workday Reports, ServiceNow, GA4, HubSpot, Confluence — all GA) - references/2-database-connectors.md (SQL Server cloud + on-prem + gateway pattern intro — GA) - references/5-troubleshooting-and-monitoring.md (GA-focused) Public Preview connectors (NetSuite, Dynamics 365, PG/MySQL CDC, query-based databases, Foreign Catalog query-based, SFTP) are production-supported and listed in SKILL.md; deep coverage will be added incrementally as PuPr connectors stabilize. SharePoint/Google Drive (Beta currently, GA Jun 1 target) and other Beta/PrPr connectors are not first-class in this skill. references/3-file-and-streaming-connectors.md will be created when SFTP + SharePoint/Drive get deep coverage (post-v1). Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
|
Talked with @jralfonsog and he will add in the GA connector references he has been working on as part of this PR before we review and finalize. |
Deep coverage for the six GA SaaS connectors (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence): unified UC connection + pipeline + schedule pattern, per-connector auth and limits, DAB stub, and common gotchas. Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
Deep coverage for SQL Server (cloud and on-prem): the gateway pattern, change tracking vs CDC, DAB stub with both gateway and ingestion pipelines, on-prem private networking, and gateway-specific gotchas. Brief pointer to Public Preview database connectors (Postgres/MySQL CDC, query-based, Foreign Catalog) pending deep coverage as they stabilize. Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
Event log queries (SaaS and database pipelines), nine common error / expected-behavior rows with resolutions, and escalation pointers (public docs hub, connector reference, workspace support). Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
Skillforge full evaluation (L1 - L5) — composite 0.76, PASSAfter all four content commits landed, ran the full Skillforge pyramid via the Pyramid summary
L3 audit trajectory across commits
L5 classification (95 checks across 8 test cases)
Per-case L5 response score
The two lowest-scoring cases ( L1+L2+L3 quick-eval baseline (run separately before L4/L5): composite 0.85 (L1=0.72, L2=1.00, L3=0.84). |
There was a problem hiding this comment.
Got some things to change and others to review, such as if GA connector examples should really use PREVIEW vs. CURRENT. Overall looking good.
Also, any links to other skills should match what stable skills are using, which is just the name without a link.
SUGGESTION: Align with the dominant convention. In Related Skills and inline mentions:
REPLACE: **[databricks-pipelines](../../skills/databricks-pipelines/SKILL.md)**
WITH: **databricks-pipelines**
|
@jralfonsog can you attach the report.html that's generated as well? |
Per @dustinvannoy-db review: - Add frontmatter parent / compatibility / metadata.version (aligns with databricks#105) - Drop channel from GA examples; GA connectors default to CURRENT (docs: ServiceNow/Salesforce GA omit channel, Confluence Beta uses PREVIEW). Reframe guidance to PREVIEW-only-for-Public-Preview/Beta. - Cross-skill references use plain bold names (stable-skill convention) - Remove the GA target date column and other specific dates (avoid maintenance churn) - Drop Zerobus from the connector catalog (separate skill; kept as a related-skill and decision-tree cross-reference) - Consolidate the SKILL.md Common Issues table into a pointer to the troubleshooting reference - Remove the top Documentation block (duplicated in Resources) - Fix openai.yaml default_prompt grammar and casing Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
|
Thanks for the thorough review. Pushed 13c5fc6 addressing everything. Summarizing here rather than replying on each thread, hope it's OK @dustinvannoy-db :
|
|
@auschoi96 attached the Skillforge report.html (zipped, since GitHub does not accept raw .html). I redacted the workspace URL and one local path the L5 run captured; the scores and judge feedback are intact. |
dustinvannoy-db
left a comment
There was a problem hiding this comment.
LGTM, let's confirm with @auschoi96 and someone from eng before we merge.
@auschoi96 Added you as reviewer since I'd like confirmation the L4 and L5 aren't anything to worry about here. We can always address some things in a follow up PR if needed. |
simonfaltum
left a comment
There was a problem hiding this comment.
We should use the up-to-date product names where possible
… Automation Bundles) Per @simonfaltum review: rename Databricks Asset Bundles to Declarative Automation Bundles in Required Tools and the bundle authoring section. Keeps the DAB acronym and the databricks-dabs skill reference. Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
|
Good call, thanks. Updated to "Declarative Automation Bundles" in both spots (Required Tools and the bundle authoring section) in afc23f7. Kept the "DAB" acronym and the databricks-dabs skill reference as-is since the acronym still fits. |
|
Went through the skill end to end. Overall it's in good shape: it lands governed UC Delta tables, leans on SDP + Auto-CDC + UC, and makes DABs the production authoring path. The highest-leverage improvements are in completing the authoring loop and removing the one dead-end, not adding scope. Ranked: 1. Add the run + poll commands so the workflow actually completes
Add the run + poll commands, anchored on the DAB path (also the recommended path): databricks bundle deploy -t dev
databricks pipelines run salesforce_ingestion # KEY = pipeline name in YAML; waits by default
databricks pipelines run salesforce_ingestion --no-wait
databricks pipelines get-update --pipeline-id <id> # poll run statusWorth stating the asymmetry explicitly: Small, additive, no eval needed. 2. Turn the OAuth U2M "plan for a human step" into a concrete hand-off + resume
Give the agent an exact hand-off plus a resume check:
databricks connections get <connection_name> # poll until "status": "READY"This closes a loop the skill already half-opens (the Small, additive. 3. Add the missing good/bad anti-pattern blocksThe troubleshooting tables are good, but there are no side-by-side correct-vs-wrong code blocks, and three are already in the skill as prose warnings that just need converting:
4. Move a compact "is Lakeflow Connect the right tool?" gate up into SKILL.mdThe Lakeflow-Connect-vs-Auto-Loader-vs-Federation-vs-Delta-Sharing tree in This changes routing, so validate with an eval / activation run before merging (routing tweaks are easy to regress). 5. Anchor "Lakeflow Spark Declarative Pipelines" to DLT
6. Tighten the descriptionThe frontmatter Minor
Nothing here is blocking. #1 and #2 are the two I'd prioritize: they turn a skill that describes the workflow into one that completes it. |
simonfaltum
left a comment
There was a problem hiding this comment.
Skill review: databricks-lakeflow-connect
Reviewed the full PR head in a worktree (not from memory), verifying load-bearing commands and fields against the Go SDK and live CLI. Verdict: fix-then-merge. 2 inline P1s, the rest P2/nit. No P0s.
The API homework is the strong point and I verified these are all real: gateway_definition, ingestion_definition, gateway_storage_{catalog,schema,name}, ingestion_gateway_id, connection_name, source_/destination_*, channel; the DAB pipeline resource supports ingestion_definition/gateway_definition; and databricks pipelines create --json + databricks connections create exist. 'Declarative Automation Bundles' is the current official name (confirmed via databricks bundle --help, the databricks-dabs skill, and PR #34) and is correct, not a regression.
Fix before merge (P1):
SKILL.md:80CLI version (v1.0.0+) contradicts the frontmatter (v0.294.0).SKILL.md:161+references/4-ingestion-decision-tree.md:60applyAsChangesFromis not a real identifier.
Worth your eyes (P2, inline): channel semantics (described as connector-GA, but it is the SDP runtime channel; please verify against connector docs, could be P1); db_owner grant ordering; continuous: false 'cron block'; description scope/length.
Checked and clean: python3 scripts/skills.py validate passes against the PR tree; all relative links resolve and all 6 referenced sibling skills exist; databricks-sdk>=0.85.0 is real (SDK at 0.105.0); strategy is strong (governed Delta tables via managed pipelines + DABs, idiomatic primitives, concrete escape hatches in the decision tree, all-public surface); patterns present (troubleshooting tables, decision table + 'do not use when', defer-to-docs, separation of concerns, gateway diagram, length parity). Eval evidence in the PR (stf 8.2 to 8.7, Skillforge 0.76 PASS) is exactly what this repo asks for.
Deferred (already tracked): 3-file-and-streaming-connectors.md + PuPr deep coverage.
|
|
||
| ## Required Tools | ||
|
|
||
| - **Databricks CLI v1.0.0+** for `databricks pipelines create` and `databricks connections create`. Verify with `databricks --version`. |
There was a problem hiding this comment.
P1 (CR-5/CR-12): contradicts the frontmatter. Line 4 sets compatibility: ... >= v0.294.0, but this says v1.0.0+. An agent reads top-down and will tell the user to install v1.0.0+ right after passing the v0.294.0 gate. >= v0.294.0 is the repo's most common compatibility value; v1.0.0+ is the outlier. Both commands (pipelines create --json, connections create) are verified working and long predate v1.0.0, so the real minimum is far lower. Fix: change this to v0.294.0+ to match the frontmatter (or drop the inline number).
| - **UC `CONNECTION` is the credential anchor** — every Lakeflow Connect pipeline points at a UC connection. The connection owns the auth; the pipeline references it by name. | ||
| - **Serverless ingestion pipeline + (optional) classic gateway** — SaaS connectors are pure serverless. Database connectors split into a customer-network gateway (classic) and a serverless ingestion pipeline (Delta-bound). | ||
| - **CDC and schema evolution are built in** — for sources that support change tracking or CDC, the connector applies changes incrementally and evolves the target schema. Data-type changes typically require a full snapshot reload. | ||
| - **Streaming Delta output** — destination tables are governed Delta tables with `applyAsChangesFrom` semantics for CDC sources. Compatible with downstream materialized views and Spark streaming. |
There was a problem hiding this comment.
P1 (CR-4): applyAsChangesFrom is not a real identifier. Verified against the SDK and the databricks-pipelines skill: the real APIs are apply_changes() / apply_changes_from_snapshot() / create_auto_cdc_flow() (Python) and APPLY CHANGES INTO / AUTO CDC INTO (SQL). Your own error code APPLY_CHANGES_FROM_SNAPSHOT_ERROR points at the real concept. It is a code-shaped token in backticks, so an agent may emit it and fail. Fix e.g.: 'applied with CDC semantics (APPLY CHANGES / AUTO CDC; apply_changes_from_snapshot for snapshot sources)'. Same fix needed at references/4-ingestion-decision-tree.md:60.
| - You need a governed Delta copy in your lakehouse for performance, ML training, or downstream pipelines. | ||
| - Query volume against the source data is high. | ||
| - The source is performance-sensitive (you don't want to add query load to your production OLTP). | ||
| - You need point-in-time history (CDC into a Delta table with `applyAsChangesFrom`). |
There was a problem hiding this comment.
P1 (CR-4): same invented identifier as SKILL.md:161. applyAsChangesFrom is not real; use apply_changes_from_snapshot / APPLY CHANGES here.
|
|
||
| - `ingestion_definition.connection_name` — the UC connection name (not URL, not ID). | ||
| - `objects[].table` — one entry per source table. Use `objects[].schema` to ingest a whole source schema in one block. | ||
| - `channel` — omit for GA connectors (defaults to `CURRENT`). Set `channel: PREVIEW` only for connectors still in Public Preview / Beta. |
There was a problem hiding this comment.
P2 (CR-5/CR-10), please verify. The SDK documents channel as the SDP runtime release channel ('SDP Release Channel that specifies which version to use'), not a per-connector GA indicator. Framing it as 'omit for GA connectors / set PREVIEW for preview connectors / switch to CURRENT once the connector is GA' gives the agent a wrong mental model (e.g. that CURRENT 'promotes' a connector). The same framing recurs at line 133 and references/5-troubleshooting-and-monitoring.md:40. Fix: describe channel as the runtime channel; if preview connectors require the preview runtime, say 'some preview connectors are only available in the PREVIEW runtime channel'. If preview connectors do not actually require channel: PREVIEW, this is wrong in 3 places and becomes P1, worth confirming against the connector docs.
There was a problem hiding this comment.
I'm not sure about this one. @jralfonsog Please check if valid and disregard if it's incorrect here.
| Prerequisites: | ||
|
|
||
| 1. **SQL Server 2012+** (cloud-managed: Azure SQL DB, Azure SQL MI, RDS for SQL Server). | ||
| 2. **A dedicated database user** with `db_owner` on the source database, or the minimum grants for CT/CDC (see the connector reference). |
There was a problem hiding this comment.
P2 (CR-8): leads with a broad grant. Secure-by-default wants least privilege first. This also conflates two roles: one-time CDC enablement (may legitimately need elevated rights, done by a DBA) vs the connector's ongoing read user (should be minimal). Fix: lead with the minimum CT/CDC grants for the runtime user; mention db_owner only as the broad-but-simple option, noted as needed transiently to enable CDC.
| - **OAuth U2M** connections (Salesforce, ServiceNow, HubSpot, Confluence) must be created in Catalog Explorer — the OAuth handshake requires a browser. CLI and DAB cannot bootstrap U2M. | ||
| - **API-key / basic / refresh-token** connections (Workday Reports, GA4 via service account, ServiceNow basic) can be created with `databricks connections create` or a DAB resource. | ||
| 2. **Create the ingestion pipeline** with `databricks pipelines create --json` (or DAB). The pipeline carries the `ingestion_definition` block that names the connection and lists the source objects to land. | ||
| 3. **Schedule the pipeline**. Lakeflow Connect supports triggered runs only — schedule with a Jobs `pipeline_task` or with the pipeline's own `continuous: false` cron block. |
There was a problem hiding this comment.
P2 (CR-5): continuous: false is not a 'cron block'. It only selects triggered mode and carries no schedule; scheduling is via a Jobs pipeline_task (your row at references/2-database-connectors.md:130 states this correctly). As written, an agent may think setting continuous: false schedules the pipeline and skip the Jobs task, leaving a pipeline that never runs on cadence. Same at SKILL.md:142. Fix: drop 'cron block'; say triggered pipelines are scheduled by a Jobs pipeline_task.
| @@ -0,0 +1,186 @@ | |||
| --- | |||
| name: databricks-lakeflow-connect | |||
| description: "Build managed ingestion pipelines into Databricks using Lakeflow Connect. Use when ingesting from SaaS apps (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence), databases (SQL Server cloud and on-prem; PostgreSQL/MySQL CDC in PuPr), or file sources (SharePoint, Google Drive, SFTP) into Unity Catalog with serverless pipelines. Covers the unified setup pattern (UC connection -> ingestion pipeline -> streaming Delta tables), the gateway pattern for database CDC, DAB-based authoring, and the decision between Lakeflow Connect, Auto Loader, Lakehouse Federation, and Delta Sharing." | |||
There was a problem hiding this comment.
P2 (CR-11/ST-6): two small things on the routing description. (1) It headlines 'file sources (SharePoint, Google Drive, SFTP)', but the skill defers all file connectors (SFTP brief, SharePoint/Drive Beta, no 3-... ref), so it fires for an intent it punts to docs. It is honest (routes to the connector reference), but consider trimming file sources from the primary triggers. (2) At ~612 chars it is the 2nd-longest description in the repo (target is <=200; several siblings are also 480-762, so not anomalous). Trigger-rich, so low priority; trim if convenient.
|
|
||
| ```sql | ||
| SELECT timestamp, level, message, error | ||
| FROM event_log("<pipeline-id-or-name>") |
There was a problem hiding this comment.
Nit: the event_log() TVF takes a pipeline ID (or a TABLE(<streaming_table>) reference), not a pipeline name; the arg shown as <pipeline-id-or-name> may mislead. Worth confirming the name form is accepted; if not, use <pipeline-id>.
…nnel reframe, complete authoring loop) P1 (must-fix before merge): - CLI version floor v1.0.0+ -> v0.294.0+ to match the compatibility frontmatter - Replace invented `applyAsChangesFrom` with real CDC identifiers (`APPLY CHANGES` / AUTO CDC, `apply_changes_from_snapshot`) in SKILL.md and the decision tree Channel reframe (P2, both reviewers; verified against docs): - Describe `channel` as the SDP runtime release channel, not a connector-GA switch, in 4 places. Preview connectors require PREVIEW because they run on the preview runtime; GA connectors run on CURRENT (omit) Other P2 + nit: - Lead SQL Server grants with least-privilege CT/CDC; `db_owner` only as the broad transient option - Drop 'cron block' wording: triggered pipelines are scheduled by a Jobs `pipeline_task` - Trim the description to its routing triggers (regenerate manifest) - `event_log()` takes a pipeline ID, not a name Additive (Simon-ranked): - Add run + poll commands (`bundle run`, `start-update`/`get-update`) - Concrete OAuth U2M hand-off + `connections get` resume check - Good/bad anti-pattern blocks (FROM CONNECTION, libraries, continuous:true) - Anchor SDP to 'formerly Delta Live Tables / DLT' on first mention - Downstream next-step pointing to databricks-pipelines for the medallion build Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
|
Hi team, thanks again for the careful review. Pushed 1ad42f6 addressing the round. Summarizing here rather than per thread: P1 (both fixed)
channel (P2, verified) Other P2 + nit
Your ranked improvements
Deferred to a follow-up PR
Let me know if anything else needs a pass. |
… gate Lift a compact use/do-not-use decision gate to the top of SKILL.md (files -> Auto Loader, zero-copy -> Lakehouse Federation, push -> Zerobus, partner share -> Delta Sharing), keeping the pointer to the full ingestion decision tree. Trim the file-source triggers from the description since the skill defers file connectors. Addresses Simon's review item databricks#4. Validated with a full Skillforge re-eval in authoring mode (composite 0.76 -> 0.83, L4 thinking 0.68 -> 0.94, gaps clean, skill-invocation judge pass), so the routing change is regression-checked. Regenerate manifest for the trimmed description. Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
Summary
New experimental skill
databricks-lakeflow-connectfor managed ingestion pipelines. GA-first deep coverage; PuPr connectors are listed inSKILL.mdas production-supported with deep coverage planned as they stabilize. Nodatabricks-pipelinesoverlap — Lakeflow Connect pipelines reuse the pipelines API surface viaingestion_definition, and this skill cross-links toskills/databricks-pipelines/from the decision tree and Related Skills.Changes
experimental/databricks-lakeflow-connect/SKILL.md(~200 lines) — routing + 3-tier catalog (GA / PuPr / Beta-PrPr) + workflow + key concepts + common issues.experimental/databricks-lakeflow-connect/references/1-saas-connectors.md(~135 lines) — six GA SaaS connectors (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence): unified UC connection + pipeline + schedule pattern, per-connector auth and limits, DAB stub, and common gotchas.experimental/databricks-lakeflow-connect/references/2-database-connectors.md(~145 lines) — SQL Server (cloud and on-prem): the gateway pattern, change tracking vs CDC, DAB stub with both gateway and ingestion pipelines, on-prem private networking, gateway-specific gotchas, brief pointer to PuPr database connectors.experimental/databricks-lakeflow-connect/references/4-ingestion-decision-tree.md(~130 lines) — Lakeflow Connect vs Auto Loader vs Lakehouse Federation vs Delta Sharing vs Zerobus + cost considerations + escape hatches. Cross-links to the Auto Loader work in databricks-solutions/ai-dev-kit#539.experimental/databricks-lakeflow-connect/references/5-troubleshooting-and-monitoring.md(~50 lines) — event log queries (SaaS and database pipelines), nine common error / expected-behavior rows with resolutions, and escalation pointers.experimental/databricks-lakeflow-connect/agents/openai.yaml+assets/databricks.{svg,png}— auto-generated viascripts/skills.py generate.manifest.json— updated byscripts/skills.py generateto register the new skill and its references.SharePoint / Google Drive (Beta as of May 2026; GA target Jun 1) are not first-class in v1 — they appear in the Beta/PrPr note in
SKILL.md.databricks-zerobus-ingestis pointed to from the catalog and decision tree (push-vs-pull dichotomy), not re-covered.To follow
references/3-file-and-streaming-connectors.md— created when SFTP + SharePoint/Drive get deep coverageCross-repo
#ai-dev-kit-teamSlack on 2026-05-27; maintainers signed off on Databricks Agent Skillsexperimental/as the destination.Test plan
python3 scripts/skills.py generateclean.python3 scripts/skills.py validatepasses (Everything is up to date.).skills/databricks-pipelines/,skills/databricks-dabs/,skills/databricks-jobs/,experimental/databricks-zerobus-ingest/,experimental/databricks-unity-catalog/).stf auditL3 trajectory across commits: 8.2 → 8.3 → 8.5 → 8.7 (all dimensions PASS at final).stf generate -n 8 --difficulty mixed, hand-curated tool-agnostic. See PR comment for L5 classification + per-case breakdown.stf auditper-dimension (L3, after all references)3-file-and-streaming-connectors.mdand PuPr deep coverage)This pull request was AI-assisted by Isaac.