Make it easier for agents to generate datafusion-python code

## Problem

More and more users reach for LLMs to generate DataFusion Python code. Today, agents are excellent at writing SQL but struggle to produce idiomatic DataFrame API code — they either transliterate SQL literally or invent patterns that don't match the library's grain. Nothing the project currently ships reliably surfaces to the agent at the moment it's writing code.

## Goals

1. Establish a single, authoritative guide for writing idiomatic DataFusion Python code.
2. Make that guide discoverable through every channel agents actually use — not just the channels we wish they used.
3. Validate the guide against a reference corpus (TPC-H) so it stays honest as the API evolves.
4. Extend the same pattern across the wider DataFusion family (Ballista, Comet, Ray, etc.) via an upstream `llms.txt` hub.

## Where idiomatic code is defined

**Single source of truth: `SKILL.md` at the repo root.**

This one file — kept at the repo root with YAML frontmatter for skill-ecosystem discovery, and included verbatim on the docs site — is the canonical guide for agents. It contains:

- Core abstractions (`SessionContext` / `DataFrame` / `Expr` / `functions`) and import conventions.
- A quick-start example that works end-to-end.
- SQL-to-DataFrame reference table (for users who think in SQL first).
- Migration sections for users coming from **Spark**, **Pandas**, and **Polars** — same shape as the SQL table, column-mapping each API's idioms to DataFusion's.
- Common pitfalls caught in real agent sessions: `&`/`|`/`~` vs Python `and`/`or`/`not`, `lit()` wrapping, decimal/float literal interactions, `F.substring` vs `F.substr` arity, join-key disambiguation, date-vs-timestamp arithmetic rules, etc.
- Idiomatic patterns: fluent chaining, window functions in place of correlated subqueries, semi/anti joins in place of `EXISTS`/`NOT EXISTS`, `aggregate().filter()` for `HAVING`, variable assignment for CTEs.

The **TPC-H example suite** (`examples/tpch/`) is the reference corpus: every query is written as idiomatic DataFrame code, validated by answer-file comparison, and where the optimized logical plan differs from the SQL version, the difference is documented in a comment. This gives the `SKILL.md` guidance a continuously verified ground truth.

For humans, the primary reference is the online user guide at https://datafusion.apache.org/python. `SKILL.md` is written in a dense, skill-oriented format for agent consumption.

## How agents discover it

Discovery is layered. Each layer catches agents the prior ones missed, so no single channel is load-bearing.

| Layer | Mechanism | Target audience |
|-------|-----------|-----------------|
| 1 | `npx skills add apache/datafusion-python` — reads `SKILL.md` at repo root via the skill-ecosystem convention | Agents with skill-registry support (Claude Code, Cursor, Windsurf, Cline, Codex, Copilot, Gemini CLI, Aider, opencode, +18 others) |
| 2 | Community aggregators auto-scrape repos with a `SKILL.md` (skillsmp, awesome-claude-skills, claudemarketplaces) | Users browsing skill indexes |
| 3 | `https://datafusion.apache.org/python/llms.txt` published on the docs site (llmstxt.org convention) | Agents that auto-fetch `/llms.txt` from documentation sites |
| 4 | Docs site page that `{include}`s `SKILL.md` | Humans and WebSearch-capable agents browsing the docs |
| 5 | Enriched `datafusion.__doc__` with a pointer to the online `SKILL.md` URL | Agents that introspect the installed package (`help(datafusion)`, IDE hovers, PyPI rendering) |
| 6 | README section explaining the install paths (`npx skills add` preferred; manual pointer fallback) | Users arriving from PyPI/README before any agent is wired up |
| 7 | `https://datafusion.apache.org/llms.txt` upstream hub (separate PR to `apache/datafusion`) pointing at each subproject's `llms.txt` | Agents that land anywhere in the DataFusion ecosystem |

Previously the plan relied on shipping the guide inside the wheel so agents that introspect the installed package would find it. In practice no shipping agent walks `site-packages/*/AGENTS.md`, so layer 5 is narrowed to what the module docstring can carry (via `help()` / introspection / IDE tooling), and the file itself is distributed at the repo level via layers 1–2.

## Task list

- [x] PR 1a — `SKILL.md` at repo root + package docstring entry point (#1497)
- [x] PR 1b — Module docstrings + doctest examples
- [x] PR 1c — README "Using DataFusion with AI coding assistants" section
- [x] PR 2  — TPC-H reference SQL + plan comparison diagnostic
- [x] PR 3  — Rewrite TPC-H non-idiomatic queries
- [ ] PR 4  — Docs site (`{include}` + `llms.txt`) + AI skills + CLAUDE.md
- [ ] PR 5  — Upstream sync process documentation
- [ ] PR 6  — `apache/datafusion` `llms.txt` hub (separate repo)
- [ ] PR 7  — `.claude-plugin/plugin.json` for Claude Code plugin marketplace (optional)

Detailed plan to follow as a comment.

## Changes from the original plan

- **PR 1a** landed as `SKILL.md` at the repo root (not `python/datafusion/AGENTS.md` shipped inside the wheel). Empirical testing showed no mainstream agent walks `site-packages` for `AGENTS.md`, so the in-wheel distribution channel was aspirational.
- **PR 1c** changed from a `datafusion-init` console script to a README section. With the skill ecosystem handling project-root pointer writing automatically, a console script's remaining audience (Python-only, no-Node, agent-agnostic users) is narrow enough that a README edit covers it with less surface area.
- **PR 7** added — optional Claude Code plugin marketplace entry for `/plugin install datafusion-python` UX.


Layer	Mechanism	Target audience
1	`npx skills add apache/datafusion-python` — reads `SKILL.md` at repo root via the skill-ecosystem convention	Agents with skill-registry support (Claude Code, Cursor, Windsurf, Cline, Codex, Copilot, Gemini CLI, Aider, opencode, +18 others)
2	Community aggregators auto-scrape repos with a `SKILL.md` (skillsmp, awesome-claude-skills, claudemarketplaces)	Users browsing skill indexes
3	`https://datafusion.apache.org/python/llms.txt` published on the docs site (llmstxt.org convention)	Agents that auto-fetch `/llms.txt` from documentation sites
4	Docs site page that `{include}`s `SKILL.md`	Humans and WebSearch-capable agents browsing the docs
5	Enriched `datafusion.__doc__` with a pointer to the online `SKILL.md` URL	Agents that introspect the installed package (`help(datafusion)`, IDE hovers, PyPI rendering)
6	README section explaining the install paths (`npx skills add` preferred; manual pointer fallback)	Users arriving from PyPI/README before any agent is wired up
7	`https://datafusion.apache.org/llms.txt` upstream hub (separate PR to `apache/datafusion`) pointing at each subproject's `llms.txt`	Agents that land anywhere in the DataFusion ecosystem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it easier for agents to generate datafusion-python code #1394

Problem

Goals

Where idiomatic code is defined

How agents discover it

Task list

Changes from the original plan

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Make it easier for agents to generate datafusion-python code #1394

Description

Problem

Goals

Where idiomatic code is defined

How agents discover it

Task list

Changes from the original plan

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions