cookiebot — reference corpus

RAG-ready notes distilled from three baking-science references, for grounding a cookie assistant in why a formula behaves the way it does (spread, chew vs. crisp, sugar chemistry, fat state, leavening, browning) rather than just recipes.

This is a fork of foccaciabot — same three academic references, re-cut for cookies. The interactive build sheet lives in cookie-build-sheet.jsx; the science behind every dial is drawn from the corpus below.

What's tracked vs. ignored

Path	In git?	What it is
`.epub`, `.pdf`	❌ ignored	The purchased source books (binaries). Never committed.
`notes/`	✅ committed	Cleaned, per-section Markdown extracted from the books (human-reviewable).
`data/chunks.jsonl`	✅ committed	The embeddable RAG corpus — one chunk per line, with metadata.
`data/manifest.json`	✅ committed	Corpus stats + per-book index.
`scripts/`	✅ committed	The reproducible extraction + chunking pipeline.
`cookie-build-sheet.jsx`	✅ committed	The interactive calculator (React, single file).

The source books are licensed/purchased copies kept locally; only the derived notes are version-controlled.

Sources

Same three references as foccaciabot, re-weighted for cookies — pastry science moves to the centre, bread to the edge:

Slug	Book	Lang	Role
`bressanini-scienza-della-pasticceria`	Bressanini, La scienza della pasticceria (Gribaudo, 2014)	it	core — the academic pastry reference: pasta frolla, biscotti, sugars, leavening, fats.
`mcgee-on-food-and-cooking`	McGee, On Food and Cooking (1984)	en	high — Sugars, Maillard/caramel browning, butter & fats, eggs, cookies/cakes/batters.
`cauvain-technology-of-breadmaking`	Cauvain & Young, Technology of Breadmaking (Springer, 1998)	en	general — flour/gluten/leavening/fat fundamentals that transfer to cookie dough.

Current corpus: 748 notes → 1,737 chunks (~1.0M tokens), of which 1,040 are flagged cookie_relevant. Regenerate the exact numbers with the pipeline below.

The notes themselves are book extractions and so are identical to foccaciabot's (same purchased books). What changed for cookies is the relevance weighting in scripts/common.py and the cookie_relevant keyword filter in scripts/build_corpus.py — so retrieval now favours sugar, fat, egg, leavening and browning science over proofing and crumb.

Pipeline

pip install -r requirements.txt          # beautifulsoup4, lxml, PyMuPDF

python scripts/extract_epub.py           # EPUBs  -> notes/<slug>/*.md
python scripts/extract_pdf.py            # PDF    -> notes/<slug>/*.md
python scripts/build_corpus.py           # notes/ -> data/chunks.jsonl + manifest.json

notes/ is the source of truth; data/ is fully derived from it. Re-running is idempotent for a given set of source books. If you only adjusted the relevance filter (the common cookie tweak), just re-run build_corpus.py — it reads the already-extracted notes/ and needs neither the source books nor PyMuPDF.

How extraction works

EPUB (extract_epub.py): reads the spine + NCX table of contents directly (ebooklib chokes on these files' broken font manifests). Sectioning is TOC-driven via anchor IDs, because every page repeats the book title as an <h1>. Handles both nested TOCs (Bressanini) and flat TOCs where chapter titles are listed before a separate flat list of sub-sections (McGee), mapping sub-sections to chapters by spine position. Decodes cp1252-mislabeled-as-UTF-8 bytes so Italian/French accents survive.
PDF (extract_pdf.py): the book has no bookmarks, so chapters are detected from numbered title lines ("3 Functional ingredients") validated by monotonic numbering. Running headers/footers (page numbers, the ALL-CAPS running title) are stripped; printed page numbers are captured as inline markers so each chunk gets a precise page for citation. Front matter (title/copyright/contents OCR noise) is skipped.

Chunk schema (`data/chunks.jsonl`)

Each line is one JSON object:

{
  "id": "bressanini-...:0012:03:ab12cd34",  // stable: book:order:index:hash
  "book_slug": "bressanini-scienza-della-pasticceria",
  "book": "La scienza della pasticceria",
  "author": "Dario Bressanini",
  "year": 2014,
  "lang": "it",
  "relevance": "core",                    // book-level
  "cookie_relevant": true,                // chunk-level cookie filter
  "chapter": "...",
  "heading": "...",
  "breadcrumb": ["...", "..."],
  "page": 88,                             // printed page (PDF) or null (EPUB)
  "citation": "Dario Bressanini, La scienza della pasticceria (2014), ..., p.88",
  "source_note": "notes/bressanini-.../0013-....md",
  "tokens_est": 587,
  "text": "Lo zucchero non è solo un dolcificante..."
}

Chunks target ~800 tokens with ~100 tokens of overlap, packed on paragraph boundaries (oversized paragraphs are sentence-split). Token counts are estimated as chars/4 to avoid a tokenizer dependency.

Consuming the corpus

chunks.jsonl is embedder-agnostic. Typical next step: embed each text, store with the metadata, and at query time filter on cookie_relevant (or weight by relevance) before vector search. The citation field is ready to surface in answers.

The build sheet

cookie-build-sheet.jsx is a single self-contained React component — drive the qualities (spread, chew vs. crisp vs. cakey, brown-sugar share, butter state, leavening, browning) and the style; the recipe and method regenerate live. Baker's percentages are all relative to total flour = 100%. Run it locally with Vite (npm install && npm run dev), or see it embedded on the website at /kitchen/cookie-calculator.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.claude		.claude
data		data
notes		notes
scripts		scripts
src		src
.gitignore		.gitignore
HANDOFF.md		HANDOFF.md
README.md		README.md
cookie-build-sheet.jsx		cookie-build-sheet.jsx
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
vite.config.js		vite.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cookiebot — reference corpus

What's tracked vs. ignored

Sources

Pipeline

How extraction works

Chunk schema (`data/chunks.jsonl`)

Consuming the corpus

The build sheet

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cookiebot — reference corpus

What's tracked vs. ignored

Sources

Pipeline

How extraction works

Chunk schema (data/chunks.jsonl)

Consuming the corpus

The build sheet

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Chunk schema (`data/chunks.jsonl`)

Packages