-
Notifications
You must be signed in to change notification settings - Fork 0
prd
- Executive Summary
- Project Classification
- Success Criteria
- User Journeys
- Scope & Phasing
- Functional Requirements
- Non-Functional Requirements
- Open Questions
- Risk Mitigation
Add an LLM-assisted documentation augmentation pipeline to packages/bundler so generated UTDK packages can ship with materially better docs than raw OpenAPI alone. The system should first gather and validate documentation sources into a locally cached Markdown corpus, then use that corpus plus the OpenAPI spec to generate package-level docs, grouped doc trees, concise top-level READMEs, and improved package metadata.
- Two-step architecture - Separating crawl/validation from augmentation makes the pipeline inspectable, reproducible, and easier to debug than a single opaque generation pass.
- Docs as first-class package artifacts - Generated packages should carry structured Markdown docs that map closely to OpenAPI request groupings rather than burying context in a single flat README.
- Reusable source memory - Validated links, auth pages, examples, and gotchas should be persisted so future crawls and regeneration runs improve instead of restarting from zero.
- Human-steerable augmentation - Maintainers must be able to inspect cached sources, edit prompts/style guidance, and re-run augmentation without manually rediscovering the same documentation context.
Technical Type: Developer Tooling / Code Generation Pipeline
Domain: API SDK generation, documentation synthesis, LLM-assisted developer experience
Complexity: High - this spans crawling, source validation, cached corpora, prompt-driven artifact generation, and repeatable package-level outputs across many APIs.
Project Context: Brownfield - packages/bundler already generates UTDK packages from OpenAPI definitions, and existing generated packages already include openapi.json, metadata.ts, minimal README.md, and package metadata.
A user achieves success when they can:
- Generate a richer UTDK package from a provider without manual doc hunting - the run produces a concise top-level README, grouped Markdown docs, auth guidance, examples, and known gotchas from validated sources.
-
Inspect and improve the documentation inputs before regeneration - the maintainer can review cached Markdown source files, inspect an orienting
index.md, update prompt/style inputs, and re-run augmentation with predictable results. - Trust the output metadata and source links - generated package metadata identifies the validated docs sources and useful auth or registration links instead of forcing downstream users to rediscover them.
The system is successful when:
- Documentation loading and augmentation are separate runnable stages with durable intermediate artifacts on disk.
- Cached documentation sources are organized in a parallel structure to the UTDK packages and can be re-used on subsequent runs.
- Generated docs align to API groupings closely enough that a maintainer can trace a generated Markdown page back to a coherent OpenAPI request grouping.
- Prompting for artifact shape/style is explicit and separately maintained from the crawl/indexing prompt and source-loading logic.
- For a target provider run, at least one top-level README, one orienting
index.md, and one or more grouped Markdown docs are generated when supporting source material exists. - Every persisted external reference included in generated metadata or docs is validated during the load phase and recorded with its canonical URL.
- Re-running augmentation without re-crawling uses the cached corpus and completes without needing to rediscover source pages manually.
- Maintainers can execute each phase independently in local development and inspect resulting artifacts directly from the filesystem.
A maintainer generates a new provider package from packages/bundler and sees that the OpenAPI-backed output is technically correct but thin on guidance, auth setup, and examples. They run the documentation load phase to collect the OpenAPI source, official docs pages, auth registration pages, and related examples into a local Markdown cache that mirrors the package path. They inspect index.md to confirm the crawler found the right areas, then run the augmentation phase to produce grouped docs and a concise README tailored to the generated SDK surface. Success means the package is now publishable with materially better developer guidance and without the maintainer hand-authoring every doc page.
Requirements Revealed: staged pipeline, local cache, orienting index, grouped doc output, README synthesis, metadata enrichment, source validation.
A curator revisits an already generated provider such as github or google/books after noticing gaps in auth instructions or missing examples. Instead of starting from scratch, they open the cached Markdown corpus, add or prune references, and adjust the augmentation style prompt to better fit the package. They rerun only the augmentation stage and compare the new generated docs and metadata to the prior output. Success means improvement is incremental, low-friction, and grounded in preserved source context rather than one-off manual edits.
Requirements Revealed: cache reuse, prompt separation, targeted reruns, editable local artifacts, deterministic regeneration boundaries.
A developer installs a generated UTDK package and needs to know how to authenticate, what prerequisite project setup is required, and which operation groups matter for a common task. They begin with the package README, which gives a short overview, registration/auth links, and the key entry points, then drill into grouped Markdown pages for examples, gotchas, and operation-specific guidance. When docs reference external sources, those links are valid and relevant because they were explicitly validated during package generation. Success means the developer can move from install to first working call quickly without trawling the vendor site themselves.
Requirements Revealed: concise README, auth/setup guidance, valid links, grouped docs, examples, gotchas, user-facing clarity.
| User Type | Key Requirements |
|---|---|
| UTDK Maintainer | staged generation, validated source loading, cache inspection, grouped output docs, README synthesis, metadata enrichment |
| Package Curator | cache reuse, prompt editing, targeted reruns, local artifact inspection, regeneration controls |
| SDK Consumer | concise overview, auth guidance, examples, gotchas, grouped documentation, trustworthy links |
Core Deliverables:
- Add a load phase in
packages/bundlerthat gathers OpenAPI-adjacent documentation context, validates discovered references, and persists source pages as Markdown in a local cache tree parallel topackages/utdk/.... - Add an augmentation phase that consumes the cached corpus plus an explicit style/structure prompt and emits package documentation artifacts.
- Generate an orienting
index.mdper package cache to summarize what source material was collected and how a subsequent agent or rerun should use it. - Generate a short top-level package
README.mdinformed by the augmentation corpus and improve package metadata /package.jsonfields using validated source context. - Preserve enough metadata about discovered links, auth/setup URLs, and source provenance to support subsequent crawls and regeneration runs.
Quality Gates:
- Both phases run independently in local development.
- The load phase fails loudly or records warnings when references cannot be validated.
- Generated docs are structured by API/request grouping where feasible rather than emitted as one monolithic document.
- Maintainers can inspect raw cached Markdown and generated artifacts without proprietary tooling.
v2: Smarter refresh and merge behavior
- Incremental crawling that refreshes only stale or invalidated sources.
- Diff-aware regeneration that preserves curated local edits where appropriate.
- Stronger provenance reporting for which generated passages came from which sources.
v3: Quality feedback loop
- Automated eval or rubric scoring of generated docs quality, coverage, and factual grounding.
- Heuristics or learned strategies for mapping vendor doc taxonomies to UTDK/OpenAPI operation groupings.
- A reusable provider-doc knowledge layer that steadily improves all generated UTDK packages, feeds registry search/docs UX, and supports downstream agents that need trusted API guidance beyond raw schemas.
FR-1.1: The bundler must provide a distinct load phase that runs before augmentation and gathers documentation context for a target provider.
- The phase must at minimum ingest the provider OpenAPI document and any configured or discovered associated documentation pages.
- The phase must be runnable independently of augmentation.
FR-1.2: The load phase must validate discovered references before persisting them as trusted inputs.
- Validation must cover URL reachability and record the canonical resolved URL when available.
- Invalid or unreachable references must be excluded from trusted metadata or explicitly marked as failed.
- MVP validation is satisfied by successful HTTP resolution plus canonical URL capture.
FR-1.3: The load phase must persist fetched documentation as local Markdown files in a cache tree parallel to the generated UTDK package structure.
- The cache layout must make it easy to locate corpus files for a provider such as
packages/utdk/google/booksand its documentation cache peer. - The stored corpus must include an
index.mdthat summarizes source coverage, major themes, and suggested starting points for a later augmentation run. - The canonical cache root for MVP is
{repoRoot}/.registry, with provider paths parallel to the package path underpackages/utdk/....
FR-1.4: The load phase must capture useful source categories when available.
- Categories include overview docs, request/endpoint guides, examples, auth/project-registration instructions, rate limit or quota guidance, and notable gotchas.
- The system may store source metadata even when some categories are missing.
FR-1.5: The load phase must support seeded documentation inputs from existing repo metadata.
- Source discovery must combine the original provider registry entry with the current generated package metadata when it exists.
- The input registry entry acts as the seeded override/input source, while the current package metadata is otherwise the source of truth for continuing runs.
FR-2.1: The bundler must provide a distinct augmentation phase that consumes the cached corpus plus the OpenAPI data to generate package documentation artifacts.
- The phase must be runnable without re-crawling when the local cache already exists.
- The phase must follow the package
index.mdas the primary orienting document for cache consumption.
FR-2.2: Augmentation must be guided by a clearly separate prompt that defines the style, structure, and artifact expectations of the generated output.
- The style/structure prompt must not be conflated with the source-loading or crawling instructions.
- Maintainers must be able to inspect and revise this prompt independently of the load phase logic.
FR-2.3: The augmentation phase must generate Markdown docs in a tree structure that maps as closely as practical to OpenAPI request groupings.
- When a provider has coherent groups, each major group should receive its own Markdown page or subtree.
- When the grouping is ambiguous, the system should fall back to a documented grouping strategy rather than emit arbitrary file organization.
FR-2.4: Generated docs must prioritize actionable developer guidance beyond what raw OpenAPI already provides.
- This includes examples, auth/setup instructions, useful links, gotchas, and workflow-oriented explanations where supported by sources.
- Generated text should remain concise and grounded in validated source material.
FR-2.5: Generated grouped docs must be emitted into the package directory under a docs/ folder.
- The generated
README.mdmust act as the short entry point and link into the package-localdocs/tree.
FR-3.1: The augmentation pipeline must generate a short, easily consumed top-level README.md for each package.
- The README should orient a developer quickly: what the package is, how to authenticate or register, where to start, and where to drill deeper.
- The README should link into the grouped Markdown docs when they exist.
FR-3.2: The pipeline must enrich package metadata and package.json fields using validated source information.
- Candidate fields include description, homepage, keywords, auth-related links, or a package-specific metadata block for documentation provenance.
- Metadata must distinguish validated references from inferred or unavailable information.
FR-3.3: The generated package should retain enough structured metadata to inform future crawls and augmentation runs.
- Stored metadata should include discovered source URLs, validation status, and source categories when known.
- Downstream steps should be able to use this metadata as seed input instead of rediscovering the same sources.
FR-3.4: The pipeline must track source freshness and overwrite safety for generated docs.
- The system must hash the relevant OpenAPI content and store that hash in metadata.
- If generated docs are not stale, regeneration that would overwrite existing README or docs content must require an explicit overwrite flag.
- When the OpenAPI content hash changes, the package should default to being considered stale and eligible for regeneration.
FR-4.1: Maintainers must be able to run the load and augmentation phases independently in local development for a single provider.
- Each phase should expose an obvious local entrypoint from
packages/bundler. - The workflow must support iterating on one package without touching the entire generated corpus.
FR-4.2: The system must make intermediate and final artifacts easy to inspect.
- Cached source Markdown,
index.md, generated docs, README output, and metadata changes must all be visible on disk. - Failures and warnings should point maintainers to the relevant local artifact or source URL.
FR-4.3: The system must support augmentation-oriented testing.
- It should be possible to verify the load phase independently from the augmentation phase.
- It should be possible to exercise prompt and output changes without requiring another full crawl when source inputs are unchanged.
FR-5.1: The pipeline must preserve source provenance well enough to audit where generated guidance came from.
- At minimum, maintainers should be able to map generated docs back to source files or validated URLs.
FR-5.2: The augmentation process must avoid silently fabricating auth instructions, examples, or gotchas when source support is weak.
- Missing information should be surfaced as gaps, warnings, or omissions rather than confident invention.
- Performance: Running augmentation from an existing cache for a single provider should be materially faster than a fresh crawl, and the load phase should avoid unnecessary refetching when source metadata is still valid.
- Reliability: Partial crawl or validation failures must not corrupt existing trusted cache contents; the system should emit inspectable warnings and keep phase boundaries explicit.
- Security: Fetched content and external references must be treated as untrusted input; the system must validate and sanitize stored source material before it influences generated docs or metadata.
- Maintainability: Prompt assets, cache artifacts, and generation logic should be clearly separated so maintainers can evolve crawl behavior, output style, and package templates independently. Add focused tests around phase boundaries, cache/index generation, metadata/reference validation, and stale/overwrite detection via stored hashes.
- None at this stage. The initial open questions were resolved during PRD review and folded into the functional requirements:
- Cache root is
{repoRoot}/.registry. - Generated grouped docs live in each package's
docs/folder. - MVP reference validation requires HTTP success plus canonical URL capture.
- Source seeding combines the registry entry with current package metadata, with the registry entry acting as the seeded override/input.
- Regeneration uses stored OpenAPI hashes to detect staleness and requires an explicit overwrite flag for non-stale outputs.
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| LLM-generated docs overstate or invent unsupported guidance | Medium | High | Require validated source inputs, preserve provenance, and prefer omission/warnings over unsupported synthesis |
| Vendor documentation structures do not map cleanly to OpenAPI groupings | High | Medium | Define a fallback grouping strategy and capture grouping decisions in index.md or metadata |
| Crawl outputs become noisy, stale, or hard to inspect | Medium | High | Keep a simple parallel cache tree, generate orienting index.md, and separate load from augmentation |
| Auth/setup links change frequently and rot metadata quality | Medium | Medium | Revalidate links during load, store canonical URLs, and mark stale references explicitly |
| The workflow becomes too expensive or slow to use during iteration | Medium | Medium | Allow single-provider runs, cache reuse, and isolated augmentation reruns with no forced recrawl |