Skip to content

experimental: add databricks-lakeflow-connect skill#103

Merged
dustinvannoy-db merged 8 commits into
databricks:mainfrom
jralfonsog:experimental/lakeflow-connect
Jun 8, 2026
Merged

experimental: add databricks-lakeflow-connect skill#103
dustinvannoy-db merged 8 commits into
databricks:mainfrom
jralfonsog:experimental/lakeflow-connect

Conversation

@jralfonsog

@jralfonsog jralfonsog commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

New experimental skill databricks-lakeflow-connect for managed ingestion pipelines. GA-first deep coverage; PuPr connectors are listed in SKILL.md as production-supported with deep coverage planned as they stabilize. No databricks-pipelines overlap — Lakeflow Connect pipelines reuse the pipelines API surface via ingestion_definition, and this skill cross-links to skills/databricks-pipelines/ from the decision tree and Related Skills.

Changes

  • experimental/databricks-lakeflow-connect/SKILL.md (~200 lines) — routing + 3-tier catalog (GA / PuPr / Beta-PrPr) + workflow + key concepts + common issues.
  • experimental/databricks-lakeflow-connect/references/1-saas-connectors.md (~135 lines) — six GA SaaS connectors (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence): unified UC connection + pipeline + schedule pattern, per-connector auth and limits, DAB stub, and common gotchas.
  • experimental/databricks-lakeflow-connect/references/2-database-connectors.md (~145 lines) — SQL Server (cloud and on-prem): the gateway pattern, change tracking vs CDC, DAB stub with both gateway and ingestion pipelines, on-prem private networking, gateway-specific gotchas, brief pointer to PuPr database connectors.
  • experimental/databricks-lakeflow-connect/references/4-ingestion-decision-tree.md (~130 lines) — Lakeflow Connect vs Auto Loader vs Lakehouse Federation vs Delta Sharing vs Zerobus + cost considerations + escape hatches. Cross-links to the Auto Loader work in databricks-solutions/ai-dev-kit#539.
  • experimental/databricks-lakeflow-connect/references/5-troubleshooting-and-monitoring.md (~50 lines) — event log queries (SaaS and database pipelines), nine common error / expected-behavior rows with resolutions, and escalation pointers.
  • experimental/databricks-lakeflow-connect/agents/openai.yaml + assets/databricks.{svg,png} — auto-generated via scripts/skills.py generate.
  • manifest.json — updated by scripts/skills.py generate to register the new skill and its references.

SharePoint / Google Drive (Beta as of May 2026; GA target Jun 1) are not first-class in v1 — they appear in the Beta/PrPr note in SKILL.md. databricks-zerobus-ingest is pointed to from the catalog and decision tree (push-vs-pull dichotomy), not re-covered.

To follow

Commit Content
(v2) PuPr deep coverage (NetSuite, Dynamics 365, PG/MySQL CDC, SFTP, query-based databases, Foreign Catalog query) as connectors stabilize
(v2) references/3-file-and-streaming-connectors.md — created when SFTP + SharePoint/Drive get deep coverage

Cross-repo

  • Tracking issue: databricks-solutions/ai-dev-kit#499.
  • Companion ai-dev-kit PR: databricks-solutions/ai-dev-kit#539 ships the Auto Loader reference in the SDP skill that this skill's decision tree cross-links to.
  • Scope-checked in #ai-dev-kit-team Slack on 2026-05-27; maintainers signed off on Databricks Agent Skills experimental/ as the destination.

Test plan

  • python3 scripts/skills.py generate clean.
  • python3 scripts/skills.py validate passes (Everything is up to date.).
  • All cross-skill links resolve against the DAS layout (skills/databricks-pipelines/, skills/databricks-dabs/, skills/databricks-jobs/, experimental/databricks-zerobus-ingest/, experimental/databricks-unity-catalog/).
  • stf audit L3 trajectory across commits: 8.2 → 8.3 → 8.5 → 8.7 (all dimensions PASS at final).
  • Full Skillforge pyramid L1 - L5 — composite 0.76, PASS. Per-level: L1=0.72 (36 checks), L2=1.00 (3 checks), L3=0.83 (15 checks), L4=0.68 (40 checks), L5=0.58 (95 checks). Ground truth = 8 cases generated with stf generate -n 8 --difficulty mixed, hand-curated tool-agnostic. See PR comment for L5 classification + per-case breakdown.
  • CI green.

stf audit per-dimension (L3, after all references)

Dimension Score Status
tool_accuracy 10.0 PASS
examples_valid 10.0 PASS
no_conflicts 9.0 PASS
llm_navigable 9.0 PASS
scoped_clearly 10.0 PASS
security 9.0 PASS
actionable_instructions 8.0 PASS
error_handling 8.0 PASS
no_hallucination_triggers 7.0 PASS
self_contained 7.0 PASS (climbed from 6.0 baseline as references landed; remaining headroom is the deferred 3-file-and-streaming-connectors.md and PuPr deep coverage)

This pull request was AI-assisted by Isaac.

Initial scope-first commit for a draft PR. GA-first deep coverage,
PuPr listed but deferred to follow-up commits.

This commit includes:
- SKILL.md (routing + 3-tier catalog: GA / PuPr / Beta-PrPr + workflow
  + key concepts + common issues)
- references/4-ingestion-decision-tree.md (LFC vs Auto Loader vs
  Lakehouse Federation vs Delta Sharing vs Zerobus + cost
  considerations + escape hatches)
- agents/openai.yaml + assets/ via scripts/skills.py generate
- manifest.json updated

To follow in subsequent commits:
- references/1-saas-connectors.md (Salesforce, Workday Reports,
  ServiceNow, GA4, HubSpot, Confluence — all GA)
- references/2-database-connectors.md (SQL Server cloud + on-prem +
  gateway pattern intro — GA)
- references/5-troubleshooting-and-monitoring.md (GA-focused)

Public Preview connectors (NetSuite, Dynamics 365, PG/MySQL CDC,
query-based databases, Foreign Catalog query-based, SFTP) are
production-supported and listed in SKILL.md; deep coverage will be
added incrementally as PuPr connectors stabilize. SharePoint/Google
Drive (Beta currently, GA Jun 1 target) and other Beta/PrPr connectors
are not first-class in this skill.

references/3-file-and-streaming-connectors.md will be created when SFTP
+ SharePoint/Drive get deep coverage (post-v1).

Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
@dustinvannoy-db

Copy link
Copy Markdown
Collaborator

Talked with @jralfonsog and he will add in the GA connector references he has been working on as part of this PR before we review and finalize.

Deep coverage for the six GA SaaS connectors (Salesforce, Workday Reports,
ServiceNow, Google Analytics 4, HubSpot, Confluence): unified UC connection
+ pipeline + schedule pattern, per-connector auth and limits, DAB stub, and
common gotchas.

Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
Deep coverage for SQL Server (cloud and on-prem): the gateway pattern,
change tracking vs CDC, DAB stub with both gateway and ingestion
pipelines, on-prem private networking, and gateway-specific gotchas.
Brief pointer to Public Preview database connectors (Postgres/MySQL CDC,
query-based, Foreign Catalog) pending deep coverage as they stabilize.

Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
Event log queries (SaaS and database pipelines), nine common
error / expected-behavior rows with resolutions, and escalation
pointers (public docs hub, connector reference, workspace support).

Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
@jralfonsog

Copy link
Copy Markdown
Contributor Author

Skillforge full evaluation (L1 - L5) — composite 0.76, PASS

After all four content commits landed, ran the full Skillforge pyramid via the ~/voodoo/skillforge/SKILL.md orchestrator pattern. Ground truth was bootstrapped with stf generate -n 8 --difficulty mixed, then hand-curated tool-agnostic.

Pyramid summary

Level Score Checks Status
L1 (unit / built-in) 0.72 36 PASS
L2 (integration / connectivity) 1.00 3 PASS
L3 (static / LLM judge) 0.83 15 PASS
L4 (thinking) 0.68 40 PASS
L5 (output WITH vs WITHOUT) 0.58 95 PASS
Composite 0.76 PASS

L3 audit trajectory across commits

After commit Overall self_contained actionable scoped_clearly security
1 (SKILL.md + decision tree) 8.2 6.0 7.0 9.0 8.0
2 (+ SaaS connectors ref) 8.3 7.0 7.0 9.0 8.0
3 (+ database connectors ref) 8.5 7.0 7.0 9.0 9.0
4 (+ troubleshooting ref) 8.7 7.0 8.0 10.0 9.0

L5 classification (95 checks across 8 test cases)

Classification Count %
POSITIVE (skill taught the agent something useful) 34 36%
NEUTRAL (agent already knew it; skill not needed here) 33 35%
NEEDS_SKILL (both WITH and WITHOUT missed; coverage gap) 22 23%
REGRESSION (skill made the agent worse) 4 4%
UNTAGGED 2 2%

Per-case L5 response score

Case ID Difficulty Score Notes
saas_oauth_u2m_d9e3 intermediate 0.83 OAuth U2M cannot be automated in DAB
dab_authoring_h6c5 hard 0.72 DAB conversion for Salesforce pipeline
federation_vs_connect_e5f7 intermediate 0.68 Snowflake — Federation vs Connect
too_many_tables_f2a4 intermediate 0.65 400-table partition workaround
salesforce_basic_a3f1 easy 0.63 Salesforce pipeline create
continuous_mode_error_c4d8 easy 0.63 Triggered-only constraint
sqlserver_cdc_gateway_b7c2 hard 0.20 Agent run cut by --timeout 300 mid-tool-use
no_data_flowing_g8b1 intermediate 0.02 Agent run cut by --timeout 300 mid-tool-use

The two lowest-scoring cases (sqlserver_cdc_gateway_b7c2, no_data_flowing_g8b1) were cut off by the 5-minute agent timeout while still in active tool_use. The 4 REGRESSIONs and 13 of the 22 NEEDS_SKILL are concentrated in those two truncated runs. Re-running them with a longer timeout is the obvious next step if reviewers want to clear the marginal pass.

L1+L2+L3 quick-eval baseline (run separately before L4/L5): composite 0.85 (L1=0.72, L2=1.00, L3=0.84).

@dustinvannoy-db dustinvannoy-db left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got some things to change and others to review, such as if GA connector examples should really use PREVIEW vs. CURRENT. Overall looking good.

Also, any links to other skills should match what stable skills are using, which is just the name without a link.

SUGGESTION: Align with the dominant convention. In Related Skills and inline mentions:
REPLACE: **[databricks-pipelines](../../skills/databricks-pipelines/SKILL.md)**
WITH: **databricks-pipelines**

Comment thread experimental/databricks-lakeflow-connect/agents/openai.yaml Outdated
Comment thread experimental/databricks-lakeflow-connect/SKILL.md Outdated
Comment thread experimental/databricks-lakeflow-connect/SKILL.md Outdated
Comment thread experimental/databricks-lakeflow-connect/SKILL.md Outdated
Comment thread experimental/databricks-lakeflow-connect/SKILL.md
Comment thread experimental/databricks-lakeflow-connect/SKILL.md Outdated
Comment thread experimental/databricks-lakeflow-connect/references/1-saas-connectors.md Outdated
Comment thread experimental/databricks-lakeflow-connect/references/1-saas-connectors.md Outdated
Comment thread experimental/databricks-lakeflow-connect/references/2-database-connectors.md Outdated
Comment thread experimental/databricks-lakeflow-connect/references/2-database-connectors.md Outdated
@auschoi96

Copy link
Copy Markdown

@jralfonsog can you attach the report.html that's generated as well?

Per @dustinvannoy-db review:
- Add frontmatter parent / compatibility / metadata.version (aligns with databricks#105)
- Drop channel from GA examples; GA connectors default to CURRENT (docs:
  ServiceNow/Salesforce GA omit channel, Confluence Beta uses PREVIEW).
  Reframe guidance to PREVIEW-only-for-Public-Preview/Beta.
- Cross-skill references use plain bold names (stable-skill convention)
- Remove the GA target date column and other specific dates (avoid
  maintenance churn)
- Drop Zerobus from the connector catalog (separate skill; kept as a
  related-skill and decision-tree cross-reference)
- Consolidate the SKILL.md Common Issues table into a pointer to the
  troubleshooting reference
- Remove the top Documentation block (duplicated in Resources)
- Fix openai.yaml default_prompt grammar and casing

Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
@jralfonsog

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review. Pushed 13c5fc6 addressing everything. Summarizing here rather than replying on each thread, hope it's OK @dustinvannoy-db :

  • Cross-skill links: plain bold names now (no file-path links), matching the stable-skill convention.
  • PREVIEW vs CURRENT: good catch. Verified against the docs (ServiceNow and Salesforce GA examples omit channel, Confluence Beta uses PREVIEW), so I dropped channel from all GA examples (they default to CURRENT) and reframed the guidance to "PREVIEW only for Public Preview / Beta connectors."
  • GA target column: removed, along with the other specific dates (Status line, "triggered only" notes), to avoid the maintenance churn.
  • Zerobus row: removed from the connector catalog. Kept only as a Related Skill and a push-vs-pull row in the decision tree.
  • Common Issues: consolidated into a one-line pointer to the troubleshooting reference.
  • Documentation block: removed from the top (same links are in Resources at the bottom).
  • Frontmatter: added parent / compatibility / metadata.version to match experimental: backfill metadata.version + parent + compatibility frontmatter #105.
  • openai.yaml default_prompt: applied your suggestion.

@jralfonsog

jralfonsog commented May 30, 2026

Copy link
Copy Markdown
Contributor Author

@auschoi96 attached the Skillforge report.html (zipped, since GitHub does not accept raw .html). I redacted the workspace URL and one local path the L5 run captured; the scores and judge feedback are intact.

lakeflow-connect-report-sanitized.html

@dustinvannoy-db dustinvannoy-db left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's confirm with @auschoi96 and someone from eng before we merge.

@dustinvannoy-db

Copy link
Copy Markdown
Collaborator

@auschoi96 attached the Skillforge report.html (zipped, since GitHub does not accept raw .html). I redacted the workspace URL and one local path the L5 run captured; the scores and judge feedback are intact.

lakeflow-connect-report-sanitized.html

@auschoi96 Added you as reviewer since I'd like confirmation the L4 and L5 aren't anything to worry about here. We can always address some things in a follow up PR if needed.

@simonfaltum simonfaltum left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the up-to-date product names where possible

Comment thread experimental/databricks-lakeflow-connect/references/1-saas-connectors.md Outdated
Comment thread experimental/databricks-lakeflow-connect/SKILL.md Outdated
… Automation Bundles)

Per @simonfaltum review: rename Databricks Asset Bundles to Declarative
Automation Bundles in Required Tools and the bundle authoring section.
Keeps the DAB acronym and the databricks-dabs skill reference.

Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
@jralfonsog

Copy link
Copy Markdown
Contributor Author

Good call, thanks. Updated to "Declarative Automation Bundles" in both spots (Required Tools and the bundle authoring section) in afc23f7. Kept the "DAB" acronym and the databricks-dabs skill reference as-is since the acronym still fits.

@dustinvannoy-db dustinvannoy-db marked this pull request as ready for review June 2, 2026 20:40
@dustinvannoy-db dustinvannoy-db requested review from a team and lennartkats-db as code owners June 2, 2026 20:40

@dustinvannoy-db dustinvannoy-db left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@simonfaltum

Copy link
Copy Markdown
Member

Went through the skill end to end. Overall it's in good shape: it lands governed UC Delta tables, leans on SDP + Auto-CDC + UC, and makes DABs the production authoring path. The highest-leverage improvements are in completing the authoring loop and removing the one dead-end, not adding scope. Ranked:

1. Add the run + poll commands so the workflow actually completes

SKILL.md:147 tells the agent to "trigger the first run and watch the event log", which is the climax of the workflow, but no reference shows how to run a pipeline. The minimal example (SKILL.md:109) stops at databricks pipelines create, and references/5-troubleshooting-and-monitoring.md only has event-log query SQL. So the agent creates a pipeline object and then stalls.

Add the run + poll commands, anchored on the DAB path (also the recommended path):

databricks bundle deploy -t dev
databricks pipelines run salesforce_ingestion        # KEY = pipeline name in YAML; waits by default
databricks pipelines run salesforce_ingestion --no-wait
databricks pipelines get-update --pipeline-id <id>   # poll run status

Worth stating the asymmetry explicitly: databricks pipelines run is bundle-keyed, so a pipeline created imperatively via pipelines create --json has no clean run-by-id CLI (you'd reach for the SDK/REST start_update). That's another nudge toward the DAB path.

Small, additive, no eval needed.

2. Turn the OAuth U2M "plan for a human step" into a concrete hand-off + resume

SKILL.md:168 / :99 and references/1-saas-connectors.md correctly say OAuth U2M connections (Salesforce, ServiceNow, HubSpot, Confluence) are UI-only and CLI/DAB can't bootstrap them, then say "plan for a one-time human step." That's a soft dead-end on the headline connector (Salesforce): no exact user instructions, and no way for the agent to verify success and resume.

Give the agent an exact hand-off plus a resume check:

Tell the user: Catalog Explorer > External Data > Connections > Create connection > Salesforce > sign in. Then verify and resume:

databricks connections get <connection_name>   # poll until "status": "READY"

This closes a loop the skill already half-opens (the PENDING row and the DESCRIBE CONNECTION query in references/5-troubleshooting-and-monitoring.md).

Small, additive.

3. Add the missing good/bad anti-pattern blocks

The troubleshooting tables are good, but there are no side-by-side correct-vs-wrong code blocks, and three are already in the skill as prose warnings that just need converting:

  • Lakehouse Federation's CREATE TABLE ... FROM CONNECTION (the skill warns at :107 this syntax doesn't exist for Lakeflow Connect) vs the correct ingestion_definition JSON
  • a libraries block vs an ingestion_definition block (the difference is noted at :90 but the wrong form is never shown)
  • continuous: true vs continuous: false + a Jobs schedule (triggered-only, :154)

4. Move a compact "is Lakeflow Connect the right tool?" gate up into SKILL.md

The Lakeflow-Connect-vs-Auto-Loader-vs-Federation-vs-Delta-Sharing tree in references/4-ingestion-decision-tree.md is great, but SKILL.md only links to it (:80). An agent reading top-down commits to building an LFC pipeline before it reaches the gate. Lift a short "use LFC when / do NOT use when" table near the top (files on S3/ADLS/GCS -> Auto Loader; zero-copy -> Federation; push -> Zerobus; partner share -> Delta Sharing), keeping the pointer to reference 4 for the full reasoning.

This changes routing, so validate with an eval / activation run before merging (routing tweaks are easy to regress).

5. Anchor "Lakeflow Spark Declarative Pipelines" to DLT

SKILL.md:18 uses "SDP" / "Lakeflow Spark Declarative Pipelines" with no DLT anchor. Models know DLT well; SDP is newer branding. The databricks-pipelines sibling already does this in both its description and body: "Lakeflow Spark Declarative Pipelines (formerly Delta Live Tables / DLT)". Matching it is a free navigability + consistency win.

6. Tighten the description

The frontmatter description (:9) is ~600 chars, roughly 3x the ~200-char convention (see #107) and ~1.6x the longest experimental sibling. The routing signal is the connector trigger names; the "Covers the unified setup pattern ... Delta Sharing" tail is documentation, not routing. Cut the tail, keep the "Use when ingesting from [connectors]" triggers. Quick activation check afterward so a real trigger term isn't dropped.

Minor

  • The skill lands salesforce_raw / sqlserver_raw and stops. Add one explicit next-step line pointing downstream (build the medallion Bronze/Silver/Gold pipeline -> databricks-pipelines). I wouldn't push Genie here, since ingestion isn't exploration.
  • Version contradiction: frontmatter compatibility says >= v0.294.0 (:10) but Required Tools says "Databricks CLI v1.0.0+" (:86). Make them agree on the real minimum.

Nothing here is blocking. #1 and #2 are the two I'd prioritize: they turn a skill that describes the workflow into one that completes it.

@simonfaltum simonfaltum left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skill review: databricks-lakeflow-connect

Reviewed the full PR head in a worktree (not from memory), verifying load-bearing commands and fields against the Go SDK and live CLI. Verdict: fix-then-merge. 2 inline P1s, the rest P2/nit. No P0s.

The API homework is the strong point and I verified these are all real: gateway_definition, ingestion_definition, gateway_storage_{catalog,schema,name}, ingestion_gateway_id, connection_name, source_/destination_*, channel; the DAB pipeline resource supports ingestion_definition/gateway_definition; and databricks pipelines create --json + databricks connections create exist. 'Declarative Automation Bundles' is the current official name (confirmed via databricks bundle --help, the databricks-dabs skill, and PR #34) and is correct, not a regression.

Fix before merge (P1):

  • SKILL.md:80 CLI version (v1.0.0+) contradicts the frontmatter (v0.294.0).
  • SKILL.md:161 + references/4-ingestion-decision-tree.md:60 applyAsChangesFrom is not a real identifier.

Worth your eyes (P2, inline): channel semantics (described as connector-GA, but it is the SDP runtime channel; please verify against connector docs, could be P1); db_owner grant ordering; continuous: false 'cron block'; description scope/length.

Checked and clean: python3 scripts/skills.py validate passes against the PR tree; all relative links resolve and all 6 referenced sibling skills exist; databricks-sdk>=0.85.0 is real (SDK at 0.105.0); strategy is strong (governed Delta tables via managed pipelines + DABs, idiomatic primitives, concrete escape hatches in the decision tree, all-public surface); patterns present (troubleshooting tables, decision table + 'do not use when', defer-to-docs, separation of concerns, gateway diagram, length parity). Eval evidence in the PR (stf 8.2 to 8.7, Skillforge 0.76 PASS) is exactly what this repo asks for.

Deferred (already tracked): 3-file-and-streaming-connectors.md + PuPr deep coverage.


## Required Tools

- **Databricks CLI v1.0.0+** for `databricks pipelines create` and `databricks connections create`. Verify with `databricks --version`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 (CR-5/CR-12): contradicts the frontmatter. Line 4 sets compatibility: ... >= v0.294.0, but this says v1.0.0+. An agent reads top-down and will tell the user to install v1.0.0+ right after passing the v0.294.0 gate. >= v0.294.0 is the repo's most common compatibility value; v1.0.0+ is the outlier. Both commands (pipelines create --json, connections create) are verified working and long predate v1.0.0, so the real minimum is far lower. Fix: change this to v0.294.0+ to match the frontmatter (or drop the inline number).

- **UC `CONNECTION` is the credential anchor** — every Lakeflow Connect pipeline points at a UC connection. The connection owns the auth; the pipeline references it by name.
- **Serverless ingestion pipeline + (optional) classic gateway** — SaaS connectors are pure serverless. Database connectors split into a customer-network gateway (classic) and a serverless ingestion pipeline (Delta-bound).
- **CDC and schema evolution are built in** — for sources that support change tracking or CDC, the connector applies changes incrementally and evolves the target schema. Data-type changes typically require a full snapshot reload.
- **Streaming Delta output** — destination tables are governed Delta tables with `applyAsChangesFrom` semantics for CDC sources. Compatible with downstream materialized views and Spark streaming.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 (CR-4): applyAsChangesFrom is not a real identifier. Verified against the SDK and the databricks-pipelines skill: the real APIs are apply_changes() / apply_changes_from_snapshot() / create_auto_cdc_flow() (Python) and APPLY CHANGES INTO / AUTO CDC INTO (SQL). Your own error code APPLY_CHANGES_FROM_SNAPSHOT_ERROR points at the real concept. It is a code-shaped token in backticks, so an agent may emit it and fail. Fix e.g.: 'applied with CDC semantics (APPLY CHANGES / AUTO CDC; apply_changes_from_snapshot for snapshot sources)'. Same fix needed at references/4-ingestion-decision-tree.md:60.

- You need a governed Delta copy in your lakehouse for performance, ML training, or downstream pipelines.
- Query volume against the source data is high.
- The source is performance-sensitive (you don't want to add query load to your production OLTP).
- You need point-in-time history (CDC into a Delta table with `applyAsChangesFrom`).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 (CR-4): same invented identifier as SKILL.md:161. applyAsChangesFrom is not real; use apply_changes_from_snapshot / APPLY CHANGES here.


- `ingestion_definition.connection_name` — the UC connection name (not URL, not ID).
- `objects[].table` — one entry per source table. Use `objects[].schema` to ingest a whole source schema in one block.
- `channel` — omit for GA connectors (defaults to `CURRENT`). Set `channel: PREVIEW` only for connectors still in Public Preview / Beta.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 (CR-5/CR-10), please verify. The SDK documents channel as the SDP runtime release channel ('SDP Release Channel that specifies which version to use'), not a per-connector GA indicator. Framing it as 'omit for GA connectors / set PREVIEW for preview connectors / switch to CURRENT once the connector is GA' gives the agent a wrong mental model (e.g. that CURRENT 'promotes' a connector). The same framing recurs at line 133 and references/5-troubleshooting-and-monitoring.md:40. Fix: describe channel as the runtime channel; if preview connectors require the preview runtime, say 'some preview connectors are only available in the PREVIEW runtime channel'. If preview connectors do not actually require channel: PREVIEW, this is wrong in 3 places and becomes P1, worth confirming against the connector docs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this one. @jralfonsog Please check if valid and disregard if it's incorrect here.

Prerequisites:

1. **SQL Server 2012+** (cloud-managed: Azure SQL DB, Azure SQL MI, RDS for SQL Server).
2. **A dedicated database user** with `db_owner` on the source database, or the minimum grants for CT/CDC (see the connector reference).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 (CR-8): leads with a broad grant. Secure-by-default wants least privilege first. This also conflates two roles: one-time CDC enablement (may legitimately need elevated rights, done by a DBA) vs the connector's ongoing read user (should be minimal). Fix: lead with the minimum CT/CDC grants for the runtime user; mention db_owner only as the broad-but-simple option, noted as needed transiently to enable CDC.

- **OAuth U2M** connections (Salesforce, ServiceNow, HubSpot, Confluence) must be created in Catalog Explorer — the OAuth handshake requires a browser. CLI and DAB cannot bootstrap U2M.
- **API-key / basic / refresh-token** connections (Workday Reports, GA4 via service account, ServiceNow basic) can be created with `databricks connections create` or a DAB resource.
2. **Create the ingestion pipeline** with `databricks pipelines create --json` (or DAB). The pipeline carries the `ingestion_definition` block that names the connection and lists the source objects to land.
3. **Schedule the pipeline**. Lakeflow Connect supports triggered runs only — schedule with a Jobs `pipeline_task` or with the pipeline's own `continuous: false` cron block.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 (CR-5): continuous: false is not a 'cron block'. It only selects triggered mode and carries no schedule; scheduling is via a Jobs pipeline_task (your row at references/2-database-connectors.md:130 states this correctly). As written, an agent may think setting continuous: false schedules the pipeline and skip the Jobs task, leaving a pipeline that never runs on cadence. Same at SKILL.md:142. Fix: drop 'cron block'; say triggered pipelines are scheduled by a Jobs pipeline_task.

@@ -0,0 +1,186 @@
---
name: databricks-lakeflow-connect
description: "Build managed ingestion pipelines into Databricks using Lakeflow Connect. Use when ingesting from SaaS apps (Salesforce, Workday Reports, ServiceNow, Google Analytics 4, HubSpot, Confluence), databases (SQL Server cloud and on-prem; PostgreSQL/MySQL CDC in PuPr), or file sources (SharePoint, Google Drive, SFTP) into Unity Catalog with serverless pipelines. Covers the unified setup pattern (UC connection -> ingestion pipeline -> streaming Delta tables), the gateway pattern for database CDC, DAB-based authoring, and the decision between Lakeflow Connect, Auto Loader, Lakehouse Federation, and Delta Sharing."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 (CR-11/ST-6): two small things on the routing description. (1) It headlines 'file sources (SharePoint, Google Drive, SFTP)', but the skill defers all file connectors (SFTP brief, SharePoint/Drive Beta, no 3-... ref), so it fires for an intent it punts to docs. It is honest (routes to the connector reference), but consider trimming file sources from the primary triggers. (2) At ~612 chars it is the 2nd-longest description in the repo (target is <=200; several siblings are also 480-762, so not anomalous). Trigger-rich, so low priority; trim if convenient.


```sql
SELECT timestamp, level, message, error
FROM event_log("<pipeline-id-or-name>")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the event_log() TVF takes a pipeline ID (or a TABLE(<streaming_table>) reference), not a pipeline name; the arg shown as <pipeline-id-or-name> may mislead. Worth confirming the name form is accepted; if not, use <pipeline-id>.

…nnel reframe, complete authoring loop)

P1 (must-fix before merge):
- CLI version floor v1.0.0+ -> v0.294.0+ to match the compatibility frontmatter
- Replace invented `applyAsChangesFrom` with real CDC identifiers (`APPLY CHANGES` / AUTO CDC, `apply_changes_from_snapshot`) in SKILL.md and the decision tree

Channel reframe (P2, both reviewers; verified against docs):
- Describe `channel` as the SDP runtime release channel, not a connector-GA switch, in 4 places. Preview connectors require PREVIEW because they run on the preview runtime; GA connectors run on CURRENT (omit)

Other P2 + nit:
- Lead SQL Server grants with least-privilege CT/CDC; `db_owner` only as the broad transient option
- Drop 'cron block' wording: triggered pipelines are scheduled by a Jobs `pipeline_task`
- Trim the description to its routing triggers (regenerate manifest)
- `event_log()` takes a pipeline ID, not a name

Additive (Simon-ranked):
- Add run + poll commands (`bundle run`, `start-update`/`get-update`)
- Concrete OAuth U2M hand-off + `connections get` resume check
- Good/bad anti-pattern blocks (FROM CONNECTION, libraries, continuous:true)
- Anchor SDP to 'formerly Delta Live Tables / DLT' on first mention
- Downstream next-step pointing to databricks-pipelines for the medallion build

Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
@jralfonsog

Copy link
Copy Markdown
Contributor Author

Hi team, thanks again for the careful review. Pushed 1ad42f6 addressing the round. Summarizing here rather than per thread:

P1 (both fixed)

  • CLI version: Required Tools now reads v0.294.0+, matching the compatibility frontmatter.
  • applyAsChangesFrom: removed in SKILL.md and the decision tree, replaced with the real identifiers (APPLY CHANGES / AUTO CDC, and apply_changes_from_snapshot for snapshot sources).

channel (P2, verified)
You were right that this is the SDP runtime release channel, not a connector-GA switch. I confirmed against the public docs and internal sources that preview connectors genuinely require channel: PREVIEW because they run on the preview runtime (SharePoint requires DBR 17.3+ and sets CHANNEL=PREVIEW, Confluence is the same), while GA connectors run on CURRENT and omit it. Reframed all 4 places accordingly, and added that PREVIEW builds also have slower startup. Thanks @dustinvannoy-db for flagging this too.

Other P2 + nit

  • SQL Server grants: now lead with least-privilege CT/CDC for the connector's read user, with db_owner only as the broad option a DBA uses transiently to enable CDC.
  • Dropped the "cron block" wording. continuous: false only selects triggered mode, scheduling is via a Jobs pipeline_task.
  • event_log(): now <pipeline-id> (confirmed the name form is not the documented arg).
  • Description: trimmed the documentation tail, kept the connector triggers.

Your ranked improvements

Deferred to a follow-up PR

  • improve skills based on trajectory analysis #4 (the routing gate near the top of SKILL.md). Since it changes routing, I would rather ship it with its own Skillforge re-eval than against this PR's existing baseline. That same follow-up is the right home for your inline suggestion to trim file sources from the primary triggers, since that is also a routing change. Both are tracked.

Let me know if anything else needs a pass.

… gate

Lift a compact use/do-not-use decision gate to the top of SKILL.md (files -> Auto Loader, zero-copy -> Lakehouse Federation, push -> Zerobus, partner share -> Delta Sharing), keeping the pointer to the full ingestion decision tree. Trim the file-source triggers from the description since the skill defers file connectors.

Addresses Simon's review item databricks#4. Validated with a full Skillforge re-eval in authoring mode (composite 0.76 -> 0.83, L4 thinking 0.68 -> 0.94, gaps clean, skill-invocation judge pass), so the routing change is regression-checked. Regenerate manifest for the trimmed description.

Signed-off-by: Jose Alfonso <jralfonsog@gmail.com>
@dustinvannoy-db dustinvannoy-db added this pull request to the merge queue Jun 8, 2026
Merged via the queue into databricks:main with commit 3481d8e Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants