Problem
More and more users reach for LLMs to generate DataFusion Python code. Today, agents are excellent at writing SQL but struggle to produce idiomatic DataFrame API code — they either transliterate SQL literally or invent patterns that don't match the library's grain. Nothing the project currently ships reliably surfaces to the agent at the moment it's writing code.
Goals
- Establish a single, authoritative guide for writing idiomatic DataFusion Python code.
- Make that guide discoverable through every channel agents actually use — not just the channels we wish they used.
- Validate the guide against a reference corpus (TPC-H) so it stays honest as the API evolves.
- Extend the same pattern across the wider DataFusion family (Ballista, Comet, Ray, etc.) via an upstream
llms.txt hub.
Where idiomatic code is defined
Single source of truth: SKILL.md at the repo root.
This one file — kept at the repo root with YAML frontmatter for skill-ecosystem discovery, and included verbatim on the docs site — is the canonical guide for agents. It contains:
- Core abstractions (
SessionContext / DataFrame / Expr / functions) and import conventions.
- A quick-start example that works end-to-end.
- SQL-to-DataFrame reference table (for users who think in SQL first).
- Migration sections for users coming from Spark, Pandas, and Polars — same shape as the SQL table, column-mapping each API's idioms to DataFusion's.
- Common pitfalls caught in real agent sessions:
&/|/~ vs Python and/or/not, lit() wrapping, decimal/float literal interactions, F.substring vs F.substr arity, join-key disambiguation, date-vs-timestamp arithmetic rules, etc.
- Idiomatic patterns: fluent chaining, window functions in place of correlated subqueries, semi/anti joins in place of
EXISTS/NOT EXISTS, aggregate().filter() for HAVING, variable assignment for CTEs.
The TPC-H example suite (examples/tpch/) is the reference corpus: every query is written as idiomatic DataFrame code, validated by answer-file comparison, and where the optimized logical plan differs from the SQL version, the difference is documented in a comment. This gives the SKILL.md guidance a continuously verified ground truth.
For humans, the primary reference is the online user guide at https://datafusion.apache.org/python. SKILL.md is written in a dense, skill-oriented format for agent consumption.
How agents discover it
Discovery is layered. Each layer catches agents the prior ones missed, so no single channel is load-bearing.
| Layer |
Mechanism |
Target audience |
| 1 |
npx skills add apache/datafusion-python — reads SKILL.md at repo root via the skill-ecosystem convention |
Agents with skill-registry support (Claude Code, Cursor, Windsurf, Cline, Codex, Copilot, Gemini CLI, Aider, opencode, +18 others) |
| 2 |
Community aggregators auto-scrape repos with a SKILL.md (skillsmp, awesome-claude-skills, claudemarketplaces) |
Users browsing skill indexes |
| 3 |
https://datafusion.apache.org/python/llms.txt published on the docs site (llmstxt.org convention) |
Agents that auto-fetch /llms.txt from documentation sites |
| 4 |
Docs site page that {include}s SKILL.md |
Humans and WebSearch-capable agents browsing the docs |
| 5 |
Enriched datafusion.__doc__ with a pointer to the online SKILL.md URL |
Agents that introspect the installed package (help(datafusion), IDE hovers, PyPI rendering) |
| 6 |
README section explaining the install paths (npx skills add preferred; manual pointer fallback) |
Users arriving from PyPI/README before any agent is wired up |
| 7 |
https://datafusion.apache.org/llms.txt upstream hub (separate PR to apache/datafusion) pointing at each subproject's llms.txt |
Agents that land anywhere in the DataFusion ecosystem |
Previously the plan relied on shipping the guide inside the wheel so agents that introspect the installed package would find it. In practice no shipping agent walks site-packages/*/AGENTS.md, so layer 5 is narrowed to what the module docstring can carry (via help() / introspection / IDE tooling), and the file itself is distributed at the repo level via layers 1–2.
Task list
Detailed plan to follow as a comment.
Changes from the original plan
- PR 1a landed as
SKILL.md at the repo root (not python/datafusion/AGENTS.md shipped inside the wheel). Empirical testing showed no mainstream agent walks site-packages for AGENTS.md, so the in-wheel distribution channel was aspirational.
- PR 1c changed from a
datafusion-init console script to a README section. With the skill ecosystem handling project-root pointer writing automatically, a console script's remaining audience (Python-only, no-Node, agent-agnostic users) is narrow enough that a README edit covers it with less surface area.
- PR 7 added — optional Claude Code plugin marketplace entry for
/plugin install datafusion-python UX.
Problem
More and more users reach for LLMs to generate DataFusion Python code. Today, agents are excellent at writing SQL but struggle to produce idiomatic DataFrame API code — they either transliterate SQL literally or invent patterns that don't match the library's grain. Nothing the project currently ships reliably surfaces to the agent at the moment it's writing code.
Goals
llms.txthub.Where idiomatic code is defined
Single source of truth:
SKILL.mdat the repo root.This one file — kept at the repo root with YAML frontmatter for skill-ecosystem discovery, and included verbatim on the docs site — is the canonical guide for agents. It contains:
SessionContext/DataFrame/Expr/functions) and import conventions.&/|/~vs Pythonand/or/not,lit()wrapping, decimal/float literal interactions,F.substringvsF.substrarity, join-key disambiguation, date-vs-timestamp arithmetic rules, etc.EXISTS/NOT EXISTS,aggregate().filter()forHAVING, variable assignment for CTEs.The TPC-H example suite (
examples/tpch/) is the reference corpus: every query is written as idiomatic DataFrame code, validated by answer-file comparison, and where the optimized logical plan differs from the SQL version, the difference is documented in a comment. This gives theSKILL.mdguidance a continuously verified ground truth.For humans, the primary reference is the online user guide at https://datafusion.apache.org/python.
SKILL.mdis written in a dense, skill-oriented format for agent consumption.How agents discover it
Discovery is layered. Each layer catches agents the prior ones missed, so no single channel is load-bearing.
npx skills add apache/datafusion-python— readsSKILL.mdat repo root via the skill-ecosystem conventionSKILL.md(skillsmp, awesome-claude-skills, claudemarketplaces)https://datafusion.apache.org/python/llms.txtpublished on the docs site (llmstxt.org convention)/llms.txtfrom documentation sites{include}sSKILL.mddatafusion.__doc__with a pointer to the onlineSKILL.mdURLhelp(datafusion), IDE hovers, PyPI rendering)npx skills addpreferred; manual pointer fallback)https://datafusion.apache.org/llms.txtupstream hub (separate PR toapache/datafusion) pointing at each subproject'sllms.txtPreviously the plan relied on shipping the guide inside the wheel so agents that introspect the installed package would find it. In practice no shipping agent walks
site-packages/*/AGENTS.md, so layer 5 is narrowed to what the module docstring can carry (viahelp()/ introspection / IDE tooling), and the file itself is distributed at the repo level via layers 1–2.Task list
SKILL.mdat repo root + package docstring entry point (Add SKILL.md and enrich package docstring #1497){include}+llms.txt) + AI skills + CLAUDE.mdapache/datafusionllms.txthub (separate repo).claude-plugin/plugin.jsonfor Claude Code plugin marketplace (optional)Detailed plan to follow as a comment.
Changes from the original plan
SKILL.mdat the repo root (notpython/datafusion/AGENTS.mdshipped inside the wheel). Empirical testing showed no mainstream agent walkssite-packagesforAGENTS.md, so the in-wheel distribution channel was aspirational.datafusion-initconsole script to a README section. With the skill ecosystem handling project-root pointer writing automatically, a console script's remaining audience (Python-only, no-Node, agent-agnostic users) is narrow enough that a README edit covers it with less surface area./plugin install datafusion-pythonUX.