diff --git a/docs/superpowers/specs/2026-06-13-reading-intake-friction-fixes-design.md b/docs/superpowers/specs/2026-06-13-reading-intake-friction-fixes-design.md new file mode 100644 index 0000000..c4d34a8 --- /dev/null +++ b/docs/superpowers/specs/2026-06-13-reading-intake-friction-fixes-design.md @@ -0,0 +1,163 @@ +# CrickNote Spec: Reading-Intake Friction Fixes + +**Date:** 2026-06-13 +**Status:** Phases 1, 2b, 3 implemented on branch `feat/reading-intake-friction-fixes`; Phase 2a (pagination) and Phase 4 (zotero_intake) deferred after a value re-assessment (see §4). +**Scope:** Remove six friction points in the reading-intake pipeline surfaced by real-usage feedback (analysing a 20-page paper from Zotero): a too-low source-text cap, a documented Zotero path that produces an un-compilable note, logs contaminating stdout JSON, awkward `zotero_prepare_bundle` ergonomics, no section-level write, and figure-locating during Figure Map drafting. Adds a thin `zotero_intake` orchestrator and a body-only write tool. +**Depends on:** Spec 2 — Knowledge Base Workflow (CREATE/compile), Zotero Integration (2026-04-18), Figure Map (2026-06-13) +**Does NOT touch:** DB schema, KB pipeline (`kb_suggest`/`kb_apply`/mapping artifacts), serial numbering, the CREATE acronym framework, frontmatter field set + +--- + +## 1. Problem + +Feedback from an end-to-end intake of a single paper identified six friction points. The author believed the CLI was unpatchable ("not in this vault"); in fact all six live in this repo's `src/` and are fixable. Verified against source: + +| # | Friction | Verified location | +|---|----------|-------------------| +| 1 | `compile_reading_note` caps source text at 10k tokens / 40k chars per source — a 29k-token paper returned only through ~Fig 6 (no Discussion/Methods/Figs 7–8). A second hidden cap parses only the first 20 PDF pages. | `PER_SOURCE_TOKEN_CAP = 10_000`, `SESSION_TOKEN_CAP = 30_000` ([source-loader.ts:14](../../../src/knowledge/source-loader.ts)); `pdfParse(buffer, { max: 20 })` (line 46) | +| 2 | The documented Zotero path (`fetch → prepare_bundle → create_reading_note`) creates a note with no `sources:`, so `compile` returns `sources_missing`. `create_reading_note` only sets sources when explicitly passed; the skill never passes them. | [SKILL.md:14,19](../../../skills/cricknote-reading-intake/SKILL.md); [templates.ts:53](../../../src/agent/tools/templates.ts); [kb-tools.ts:79](../../../src/agent/tools/kb-tools.ts) | +| 3 | Non-error logs are written to stdout — the same stream as the JSON result — so every captured call has an `INFO [source-loader] …` line prepended, breaking `json.load`. | [logger.ts:109,123](../../../src/utils/logger.ts) write to `process.stdout`; result written to stdout at [cli.ts:44](../../../src/cli.ts) | +| 4 | `zotero_prepare_bundle` can't take the `citekey` `fetch` just returned, ignores the `slug_prefix` `fetch` computed (forcing the agent to invent a slug), and reports a missing slug as `"Invalid slug format"` (sounds malformed). | [zotero-tools.ts:557](../../../src/agent/tools/zotero-tools.ts) params; `slug_prefix` returned at line 224; error at line 571 | +| 5 | To add analysis, `vault_write` demands the *entire* file including frontmatter — the agent must reproduce the folded-YAML title, 18-author list, and sources block exactly. No body-only or section-level write exists. | [vault.ts:142](../../../src/agent/tools/vault.ts) (`vault_write`), `vault_append` is end-only (line 105) | +| 6 | Figure *panels* extract as garbled symbol/whitespace soup; the Figure Map output already ships, but the compiled text has no page boundaries to help locate figures. | `source-loader` returns raw `pdf-parse` text, no page markers | + +--- + +## 2. Decisions (resolved during brainstorming) + +1. **Orchestration:** *Fix the seams + add a thin wrapper.* The discrete tools (`fetch`/`prepare_bundle`/`ingest`/`compile`) are well-factored and individually useful (abstract-only mode, multi-file bundles, re-ingest). Fix every seam so the manual chain is painless, then add a thin `zotero_intake` that calls the same three internally for the common one-call case. Discrete tools stay public. (Rejected: thick orchestrator that hides the steps and concentrates partial-failure recovery in one tool.) +2. **Truncation (#1):** *Raise caps so a normal paper returns whole in one call; add `offset`/`max_tokens` as the escape hatch* for the rare overflow. (Rejected: pagination-first — too many round-trips for the 95% case.) +3. **Figure input (#6):** *Page markers now, noise-stripping deferred.* Raising the caps removed the token-budget rationale for risky heuristic stripping of scientific text (α/β/γ, equations, data tables). Page boundaries give most of the locating benefit at zero content-loss risk. +4. **Body-only write (#5):** *`vault_write_body` (whole body, frontmatter preserved)*, not surgical per-section replace — the agent drafts all 7 sections (Figure Map + 6 CREATE) at once, so one call / one confirmation beats N confirmations. +5. **`create_reading_note` (#2):** *Repoint the skill to `ingest_reading_bundle` AND make `create_reading_note` defensive* (auto-discover bundle files when `sources` omitted and a bundle exists). Keeps `create`'s legitimate "note before any files" capability while closing the footgun. + +--- + +## 3. Phased design + +Four phases, ordered by risk. Each is independently shippable and TDD-tested (vitest, `tests/unit/*`). + +### Phase 1 — Seam fixes (low risk, high relief) + +**1a · Raise caps (#1).** In `source-loader.ts`: +- Let a single source draw the full session budget (drop the artificial 10k per-source ceiling below the session cap). +- Raise `SESSION_TOKEN_CAP` so a normal paper fits in one call (recommended **50_000** tokens ≈ 200k chars; tunable in plan — covers a long review + a NotebookLM summary while staying bounded). +- Raise the PDF page cap `pdfParse(buffer, { max: 20 })` → **`{ max: 80 }`** (covers paper + supplement; the token cap remains the real governor). + +**1b · Logs → stderr (#3).** In `logger.ts`, route *all* levels to `process.stderr` (not just `error`). stdout becomes exclusively the result channel, so `cricknote tool …` output is always clean, pipeable JSON. + +**1c · Slug error message (#4b).** In `zotero_prepare_bundle`, split the check: missing/empty slug → `slug is required.`; present-but-non-matching → `Invalid slug format: "" (expected lowercase kebab-case).` + +**1d · Skill repoint + defensive `create` (#2).** +- `SKILL.md`: change the Zotero path (line 14) and local-files path (line 19) from `create_reading_note` → `ingest_reading_bundle`. +- `create_reading_note`: when `sources` is omitted *and* `Reading/attachments//` exists, auto-discover readable files (reuse the `discoverBundle` logic) and register them — so a direct call can't silently create a sourceless note. + +### Phase 2 — `compile` pagination + page markers (#1b, #6) + +**2a · Pagination (DEFERRED — see §4).** `compile_reading_note` and `loadSources` gain optional `offset` (token offset into the compiled source stream, default 0) and `max_tokens` (per-call budget override, default = session cap). The payload gains a top-level `truncated: boolean`, `next_offset: number | null` (null when exhausted), and `tokens_remaining: number`, so overflow is unmistakable and pageable. Pagination operates over the concatenated source text in existing priority order (`notes → pdf → notebooklm → web → other`). + +**2b · Page markers (SHIPPED).** Switch `extractPdf` to per-page extraction (`pdf-parse` `pagerender` / page callback) and join with `\n\n--- page N ---\n\n` separators. **No content removed.** Helps the LLM cite "Fig 3 is on page 7" while drafting the Figure Map. + +### Phase 3 — Body-only write (#5) + +New `vault_write_body { path, body }` tool in `vault.ts`: +- Reads the existing file, splits it at the end of the frontmatter block, and replaces **only** the body region with `body` — the original frontmatter text is kept **verbatim** (no re-serialization, so quote style, key order, and folded-YAML author lists are untouched). Returns a `pending_edit` (same safe-write flow as `vault_append`, including `conflictDetector.recordFileRead`). +- Errors if the file does not exist (use `vault_write` to create) or has no frontmatter block (use `vault_write`). +- The agent drafts the full body (Figure Map + CREATE sections) and saves once — no frontmatter reproduction. + +### Phase 4 — `zotero_intake` thin wrapper (#4a, orchestration) — DEFERRED (see §4) + +New `zotero_intake { citekey? , doi?, zotero_key?, slug?, related_projects?, …selection-resume fields }` in `zotero-tools.ts`: +- Runs **fetch → prepare_bundle → ingest** internally by composing the existing handlers' `execute()` and parsing the JSON between steps (to detect `error` / `needs_*`). `fetch` and `prepare_bundle` are in-module; `ingest` is reached via a dynamic import of `createReadingIntakeTools(vaultPath)` with `conflictDetector` omitted — safe because a fresh note skips the conflict-snapshot path. (No 3-tool refactor required.) +- **Slug composition:** if `slug` given, validate and use it; else use `fetch`'s `slug_prefix` directly (already a valid slug). The agent never hand-threads `pdf_path` or invents a slug. +- **Metadata:** derived from `fetch`'s normalized CSL (title/authors/year/journal/doi); caller may override. +- **Passthrough:** if `fetch` returns `needs_item_selection` / `needs_attachment_selection`, return that payload unchanged (caller resolves, then retries) — the wrapper does not reimplement resume. +- **Returns:** the note's `pending_edit` plus `meta { slug, bundle_path, files_created }`, so the agent compiles next. +- **Partial failure is recoverable:** `prepare_bundle` copies the PDF (idempotent — matching hash is skipped) before `ingest` returns a `pending_edit`; a later failure leaves a valid bundle that re-running reuses. + +--- + +## 4. Deferred (explicitly out of scope) + +- **Phase 2a — `compile` pagination (`offset`/`max_tokens`)** — deferred after raising the session cap to 50k: a normal paper (~29k tokens) now returns whole in one call, so pagination only matters for >50k monsters (rare). Revisit if such papers appear. The `truncated` flag + warning still signal overflow. +- **Phase 4 — `zotero_intake` orchestrator** — deferred: with the seams (#2/#3/#4) fixed, the manual fetch → prepare_bundle → ingest chain is already painless. The one-call wrapper is convenience, not friction-removal, and is the largest/riskiest build (cross-module composition + http-mocked tests). The `vault_pdf_dir` path-coupling guard travels with it. +- **#6 figure noise-stripping** — dropping garbled panel lines while keeping legends. Deferred to a separate, carefully-validated pass (must not eat α/β/γ, equations, or data tables). Raising the caps removed its urgency. +- **Section-addressed `vault_replace_section`** — superseded by `vault_write_body` for the reading-note use case; revisit only if surgical single-section edits are needed elsewhere. +- **Forced template migration** for installs that predate page markers — same limitation noted in the Figure Map spec. + +--- + +## 5. Critical-review findings (verified against source before implementation) + +A pre-implementation pass confirmed feasibility and surfaced these notes — folded into the plan: + +- **The stale worktree is not a blocker.** `.worktrees/feat-zotero` is fully merged into `main` (main 82 commits ahead, branch 0 ahead) and its `zotero-tools.ts` is byte-identical to main. Leftover cruft — flagged for separate cleanup, not touched here. +- **`logger.test.ts` encodes the old behaviour.** Seven assertions expect `info`/`warn`/`debug` on **stdout** (e.g. lines 25, 39, 52, 127). Routing all levels to stderr is a deliberate test rewrite (flip to `stderrSpy`), not a silent change. Intended: stdout becomes a pure data channel. +- **Cap tests change.** `source-loader.test.ts:34,63` assert the 10k/30k thresholds; both fixtures + expectations move to the new budget. +- **`discoverBundle` is private** (`reading-intake.ts:80`). Extract it to a shared module (e.g. `src/knowledge/reading-bundle.ts`) so `create_reading_note` (Phase 1d) reuses it rather than duplicating. +- **Page markers are feasible** via a custom `pdf-parse` `pagerender` that collects per-page text through a closure counter (pdf-parse 1.1.4 calls it sequentially per page). +- **Phase 4 is testable** with the existing `vi.mock('node:http')` pattern (`zotero-tools.test.ts:24`) — no live Zotero needed. +- **⚠️ Path-coupling constraint (latent bug).** `prepare_bundle` writes to `config.zotero.vault_pdf_dir/`, but `discover`/`ingest`/`compile` hardcode `Reading/attachments/`. They coincide **only because `vault_pdf_dir` defaults to `Reading/attachments`** (`config.ts:20`). The Zotero→ingest path (#2) and `zotero_intake` (Phase 4) therefore assume the default. **Mitigation (Phase 1d):** when Zotero is enabled and `vault_pdf_dir !== 'Reading/attachments'`, `zotero_intake` returns a clear error instead of silently producing a sourceless note. Full path-unification (threading `vault_pdf_dir` into discover/ingest/compile) is out of scope. + +--- + +## 6. What Changes + +### `src/knowledge/source-loader.ts` (Phase 1a, 2a, 2b) +- Replace the fixed 10k per-source *ceiling* so a single source can use the remaining session budget; raise `SESSION_TOKEN_CAP` to `50_000`. +- `extractPdf`: raise `{ max: 20 }` → `{ max: 80 }`; switch to per-page extraction and insert `--- page N ---` separators. +- `loadSources`: accept `{ offset?, maxTokens? }`; return `truncated`, `nextOffset`, `tokensRemaining` alongside `sources`/`warnings`/`totalTokens`. + +### `src/agent/tools/kb-tools.ts` (Phase 2a) +- `compile_reading_note`: declare `offset` and `max_tokens` params; thread them into `loadSources`; surface `truncated`/`next_offset`/`tokens_remaining` in the payload and in the `instruction` text (tell the LLM how to fetch the rest). + +### `src/utils/logger.ts` (Phase 1b) +- `log()`: write every level to `process.stderr` (remove the `level === 'error'` stdout/stderr branch in both `json` and `pretty` formats). File output unchanged. + +### `src/agent/tools/zotero-tools.ts` (Phase 1c, 4) +- `zotero_prepare_bundle`: split the missing-vs-malformed slug error. +- Add `zotero_intake` tool (fetch→prepare→ingest orchestration, §3 Phase 4). + +### `src/agent/tools/templates.ts` (Phase 1d) +- `create_reading_note`: when `sources` omitted and the bundle dir exists, auto-discover readable files and register them. Factor bundle discovery so it is shared with `reading-intake.ts` (avoid duplicating `discoverBundle`). + +### `src/agent/tools/vault.ts` (Phase 3) +- Add `vault_write_body` tool (§3 Phase 3). + +### `skills/cricknote-reading-intake/SKILL.md` (Phase 1d, 4) +- Repoint both paths to `ingest_reading_bundle`; add a one-line `zotero_intake` shortcut for the common case; mention `vault_write_body` in the "Write it" step; note the new `offset`/`max_tokens` for long papers. + +### `src/agent/build-registry.ts` (Phase 3, 4) +- Register `vault_write_body` and `zotero_intake`. + +### Tests (all phases) +- `tests/unit/source-loader.test.ts`: single large source uses full session budget; `SESSION_TOKEN_CAP` boundary; `offset`/`max_tokens` paging returns correct slices, `next_offset`/`tokens_remaining`/`truncated`; page markers present and content intact. +- `tests/unit/reading-note.test.ts` / new: `create_reading_note` auto-discovers when sources omitted + bundle exists; still works with no bundle (placeholder). +- New `vault_write_body` tests: frontmatter preserved exactly, body replaced, missing-file error, conflict snapshot recorded. +- New `zotero_intake` tests: slug from `slug_prefix`; explicit slug override; `needs_*_selection` passthrough; metadata derived from fetch; partial-failure idempotency. +- `logger` test: all levels go to stderr; stdout stays empty for non-result writes. + +--- + +## 7. What Does Not Change + +- Discrete tools `zotero_fetch_item`, `zotero_prepare_bundle`, `ingest_reading_bundle`, `compile_reading_note`, `vault_write`, `vault_append` keep their existing contracts (only additive params/fields). +- `create_reading_note`'s "create a note before any files exist" capability — preserved. +- CREATE framework, `hasCreateHeadings`, `inferReadingPipelineStep`, Figure Map output format. +- DB schema, KB pipeline, mapping artifacts, frontmatter field set, safe-write/conflict-detection flow. + +--- + +## 8. Design Decisions and Rationale + +**Why a thin wrapper, not a thick one?** The friction was in the seams, not the existence of steps. A thin `zotero_intake` reuses the discrete tools' logic and passes through their resume states, so flexibility and partial-failure recovery are preserved while the common case becomes one call. + +**Why raise caps instead of paginating by default?** A typical paper (~29k tokens) is below a 50k session budget, so it returns whole in one call — what deep analysis needs. `offset`/`max_tokens` exist for the rare overflow without burdening the 95% case. + +**Why defer figure noise-stripping?** Heuristics that detect "garbled" lines risk deleting real scientific content (symbols, equations, tables). With caps raised, the noise is affordable; page markers deliver the locating benefit safely. Stripping deserves its own validated pass. + +**Why body-only write, not section-addressed?** The reading-note workflow fills all sections at once. Body-only is one call / one confirmation and structurally cannot clobber frontmatter — directly answering the feedback ("can't clobber frontmatter, far lighter on tokens"). + +**Why keep `create_reading_note` at all?** `ingest_reading_bundle` requires an existing bundle; `create` uniquely supports notes with no files yet (Threads, deferred captures). Making it defensive closes the footgun without losing that capability. + +**Why logs to stderr (all levels), not just gated?** A CLI whose contract is "stdout = JSON result" must keep stdout pristine. Mixing any log line breaks `json.load`. stderr is the correct sink for all diagnostics; the optional log file already captures everything for later inspection. diff --git a/skills/cricknote-reading-intake/SKILL.md b/skills/cricknote-reading-intake/SKILL.md index 55caaeb..8efb504 100644 --- a/skills/cricknote-reading-intake/SKILL.md +++ b/skills/cricknote-reading-intake/SKILL.md @@ -11,16 +11,19 @@ One paper at a time. All writes go through `cricknote tool`. 1. `cricknote tool zotero_fetch_item '{"citekey":""}'` (or `{"doi":"..."}`). 2. `cricknote tool zotero_prepare_bundle '{...}'` to copy the PDF into `Reading/attachments//`. -3. `cricknote tool create_reading_note '{"slug":"","title":"","authors":["..."],"year":2026,"journal":"","doi":""}'`. +3. `cricknote tool ingest_reading_bundle '{"slug":"","title":"","authors":["..."],"year":2026,"journal":"","doi":""}'` + — discovers the copied PDF and registers it as a source, so the note compiles. ## From local files (no Zotero) 1. Put files under `Reading/attachments//`. 2. `cricknote tool discover_reading_bundle '{"slug":""}'`. -3. `cricknote tool create_reading_note '{...}'`. +3. `cricknote tool ingest_reading_bundle '{...}'` — registers the discovered files + as sources. (Use `create_reading_note` only for a note with no files yet.) ## Analyze the paper 1. `cricknote tool compile_reading_note '{"path":"Reading/Papers/.md"}'` - — returns source text. + — returns source text, with `--- page N ---` markers between PDF pages + (use them to note which page each figure is on). 2. Draft the **Figure Map** AND the CREATE sections. Show both to the user. **Figure Map rules** (goes at the top, before `## Claims`): @@ -36,7 +39,9 @@ One paper at a time. All writes go through `cricknote tool`. - If no figures are found (review, theory, or methods paper): leave `## Figure Map` with a single line: `` -3. Write it: `cricknote tool vault_write '{"path":"Reading/Papers/.md","content":""}'`. +3. Write it: `cricknote tool vault_write_body '{"path":"Reading/Papers/.md","body":""}'` + — preserves the frontmatter (authors, sources, etc.); you supply only the body + (Figure Map + CREATE sections). Use `vault_write` only when creating a file from scratch. ## Check status `cricknote tool reading_pipeline_status '{"path":"Reading/Papers/.md"}'` diff --git a/src/agent/build-registry.ts b/src/agent/build-registry.ts index d75f17c..26adb1d 100644 --- a/src/agent/build-registry.ts +++ b/src/agent/build-registry.ts @@ -10,6 +10,22 @@ import { createContextTools } from './tools/context.js'; import { createSerialTools } from './tools/serial-tools.js'; import { createKbTools } from './tools/kb-tools.js'; import { createZoteroTools } from './tools/zotero-tools.js'; +import { loadConfig } from '../config/config.js'; + +/** + * Resolve the vault-relative attachments directory for the reading-intake + * pipeline. Mirrors where zotero_prepare_bundle writes PDFs + * (config.zotero.vault_pdf_dir) so discover/ingest/compile read from the same + * place a Zotero bundle was written to. Defaults to 'Reading/attachments' and + * never throws — a missing or unreadable config falls back to the default. + */ +function resolveAttachmentsDir(): string { + try { + return loadConfig().zotero?.vault_pdf_dir ?? 'Reading/attachments'; + } catch { + return 'Reading/attachments'; + } +} /** * Build the complete CrickNote tool registry. Shared by the Obsidian runtime @@ -31,14 +47,16 @@ export function buildToolRegistry( for (const h of handlers) registry.register(h); }; + const attachmentsDir = resolveAttachmentsDir(); + add(createVaultTools(vaultPath, conflictDetector, db)); add(createSearchTools(db)); add(createTaskTools(vaultPath, conflictDetector)); - add(createTemplateTools(vaultPath, conflictDetector)); - add(createReadingIntakeTools(vaultPath, conflictDetector)); + add(createTemplateTools(vaultPath, conflictDetector, attachmentsDir)); + add(createReadingIntakeTools(vaultPath, conflictDetector, attachmentsDir)); add(createContextTools(vaultPath)); add(createSerialTools(vaultPath, db)); - add(createKbTools(vaultPath)); + add(createKbTools(vaultPath, undefined, attachmentsDir)); add(createZoteroTools(vaultPath)); return registry; diff --git a/src/agent/tools/kb-tools.ts b/src/agent/tools/kb-tools.ts index b26cc9d..d8ba21e 100644 --- a/src/agent/tools/kb-tools.ts +++ b/src/agent/tools/kb-tools.ts @@ -37,6 +37,7 @@ function isValidSlug(s: string): boolean { return SLUG_RE.test(s); } export function createKbTools( vaultPath: string, injectedDb?: Database.Database, + attachmentsDir = 'Reading/attachments', ): ToolHandler[] { void log; void injectedDb; @@ -94,7 +95,8 @@ export function createKbTools( const result = await loadSources( sources as Array<{ type: string; path: string }>, sourceSlug, - vaultPath + vaultPath, + attachmentsDir ); return JSON.stringify({ diff --git a/src/agent/tools/reading-intake.ts b/src/agent/tools/reading-intake.ts index 0004ab1..887043c 100644 --- a/src/agent/tools/reading-intake.ts +++ b/src/agent/tools/reading-intake.ts @@ -17,27 +17,12 @@ import { syncReadingBodyTitle, type ReadingSourceInput, type ReadingPipelineStep, - type ReadingSourceType, } from '../../knowledge/reading-note.js'; import { readMappingArtifact } from '../../knowledge/mapping-artifact.js'; +import { discoverBundle } from '../../knowledge/reading-bundle.js'; import { resolveVaultPath } from '../../utils/paths.js'; import { renderNoteTemplate, type RenderResult, type TemplateKind } from '../../templates/template-loader.js'; -interface DiscoveredBundleFile { - path: string; - type: ReadingSourceType; - readable: boolean; -} - -interface BundleDiscoveryResult { - slug: string; - folderExists: boolean; - bundlePath: string; - discoveredFiles: DiscoveredBundleFile[]; - recommendedSources: ReadingSourceInput[]; - warnings: string[]; -} - interface MappingArtifactSummary { path?: string; status?: string; @@ -46,9 +31,6 @@ interface MappingArtifactSummary { cleanupCandidates?: string[]; } -const TEXT_SOURCE_EXTENSIONS = new Set(['.md', '.txt']); -const IGNORED_BUNDLE_FILES = new Set(['.ds_store']); - function normalizeBundleSlug(value: unknown): string { if (typeof value !== 'string' || !value.trim()) { throw new Error('slug is required.'); @@ -56,92 +38,6 @@ function normalizeBundleSlug(value: unknown): string { return slugifyReadingTitle(value); } -function classifyBundleFile(fileName: string): { type: ReadingSourceType; readable: boolean } { - const lower = fileName.toLowerCase(); - const ext = path.extname(lower); - - if (ext === '.pdf') { - return { type: 'pdf', readable: true }; - } - - if (TEXT_SOURCE_EXTENSIONS.has(ext)) { - if (lower.includes('notebooklm')) { - return { type: 'notebooklm', readable: true }; - } - if (lower.includes('web')) { - return { type: 'web', readable: true }; - } - return { type: 'notes', readable: true }; - } - - return { type: 'other', readable: false }; -} - -function discoverBundle(vaultPath: string, slug: string): BundleDiscoveryResult { - const bundlePath = resolveVaultPath(vaultPath, path.join('Reading', 'attachments', slug)); - const warnings: string[] = []; - - if (!fs.existsSync(bundlePath) || !fs.statSync(bundlePath).isDirectory()) { - return { - slug, - folderExists: false, - bundlePath, - discoveredFiles: [], - recommendedSources: [], - warnings: [`Reading bundle not found: Reading/attachments/${slug}`], - }; - } - - const discoveredFiles: DiscoveredBundleFile[] = []; - - for (const entry of fs.readdirSync(bundlePath, { withFileTypes: true }).sort((a, b) => a.name.localeCompare(b.name))) { - if (IGNORED_BUNDLE_FILES.has(entry.name.toLowerCase())) { - continue; - } - - if (!entry.isFile()) { - warnings.push(`Skipping non-file bundle entry "${entry.name}".`); - continue; - } - - const relativePath = normalizeReadingSourcePath(entry.name); - const classified = classifyBundleFile(relativePath); - discoveredFiles.push({ - path: relativePath, - type: classified.type, - readable: classified.readable, - }); - - if (!classified.readable) { - warnings.push(`Unsupported bundle file "${relativePath}" — only .pdf, .md, and .txt are used for reading intake.`); - } - } - - const recommendedSources = normalizeReadingSources( - discoveredFiles - .filter((file) => file.readable) - .map((file) => ({ type: file.type, path: file.path })) - ); - - const pdfCount = discoveredFiles.filter((file) => file.type === 'pdf' && file.readable).length; - if (pdfCount > 1) { - warnings.push(`Multiple PDF files found in Reading/attachments/${slug}; review the recommended sources before ingesting.`); - } - - if (recommendedSources.length === 0) { - warnings.push(`Reading bundle "${slug}" has no readable source files yet.`); - } - - return { - slug, - folderExists: true, - bundlePath, - discoveredFiles, - recommendedSources, - warnings, - }; -} - function normalizeExcludedPaths(paths: unknown): Set { if (!Array.isArray(paths)) { return new Set(); @@ -275,7 +171,8 @@ function determinePipelineStep( export function createReadingIntakeTools( vaultPath: string, - conflictDetector?: ConflictDetector + conflictDetector?: ConflictDetector, + attachmentsDir = 'Reading/attachments' ): ToolHandler[] { return [ { @@ -298,7 +195,7 @@ export function createReadingIntakeTools( return JSON.stringify({ error: (err as Error).message }); } - const discovery = discoverBundle(vaultPath, slug); + const discovery = discoverBundle(vaultPath, slug, attachmentsDir); return JSON.stringify({ slug: discovery.slug, folder_exists: discovery.folderExists, @@ -352,10 +249,10 @@ export function createReadingIntakeTools( return JSON.stringify({ error: (err as Error).message }); } - const discovery = discoverBundle(vaultPath, slug); + const discovery = discoverBundle(vaultPath, slug, attachmentsDir); if (!discovery.folderExists) { - return JSON.stringify({ error: `Reading bundle not found: Reading/attachments/${slug}` }); + return JSON.stringify({ error: `Reading bundle not found: ${path.join(attachmentsDir, slug)}` }); } let excludedPaths: Set; @@ -379,13 +276,13 @@ export function createReadingIntakeTools( selectedSources = selectedSources.filter((source) => !excludedPaths.has(source.path)); if (selectedSources.length === 0) { - return JSON.stringify({ error: `No readable sources selected for Reading/attachments/${slug}` }); + return JSON.stringify({ error: `No readable sources selected for ${path.join(attachmentsDir, slug)}` }); } for (const source of selectedSources) { let sourcePath: string; try { - sourcePath = resolveVaultPath(vaultPath, path.join('Reading', 'attachments', slug, source.path)); + sourcePath = resolveVaultPath(vaultPath, path.join(attachmentsDir, slug, source.path)); } catch { return JSON.stringify({ error: `Selected source resolves outside the vault: "${source.path}"` }); } @@ -538,7 +435,7 @@ export function createReadingIntakeTools( } } - const discovery = discoverBundle(vaultPath, slug); + const discovery = discoverBundle(vaultPath, slug, attachmentsDir); if (!noteRef || !fs.existsSync(noteRef.absPath)) { return JSON.stringify({ diff --git a/src/agent/tools/templates.ts b/src/agent/tools/templates.ts index fae5a1f..db6c63f 100644 --- a/src/agent/tools/templates.ts +++ b/src/agent/tools/templates.ts @@ -14,9 +14,10 @@ import { type ReadingSourceInput, } from '../../knowledge/reading-note.js'; import { resolveVaultPath } from '../../utils/paths.js'; +import { discoverBundle } from '../../knowledge/reading-bundle.js'; import { renderNoteTemplate, type RenderResult } from '../../templates/template-loader.js'; -export function createTemplateTools(vaultPath: string, conflictDetector?: ConflictDetector): ToolHandler[] { +export function createTemplateTools(vaultPath: string, conflictDetector?: ConflictDetector, attachmentsDir = 'Reading/attachments'): ToolHandler[] { return [ { definition: { @@ -57,6 +58,16 @@ export function createTemplateTools(vaultPath: string, conflictDetector?: Confli } catch (err) { return JSON.stringify({ error: (err as Error).message }); } + } else { + // Defensive: if files are already in the bundle dir for this slug, register + // them so the note isn't created sourceless (which would make + // compile_reading_note report sources_missing). The no-bundle placeholder + // capability is preserved — discovery is skipped when the folder is absent + // or empty. (Prefer ingest_reading_bundle; this just closes the footgun.) + const discovery = discoverBundle(vaultPath, slug, attachmentsDir); + if (discovery.folderExists && discovery.recommendedSources.length > 0) { + normalizedSources = discovery.recommendedSources; + } } let notePath: string; try { diff --git a/src/agent/tools/vault.ts b/src/agent/tools/vault.ts index f952517..6d3d714 100644 --- a/src/agent/tools/vault.ts +++ b/src/agent/tools/vault.ts @@ -171,5 +171,48 @@ export function createVaultTools( }); }, }, + { + definition: { + name: 'vault_write_body', + description: "Replace the body of an existing note while preserving its frontmatter exactly. Use to fill in or update a reading note's sections (Figure Map, Claims, Reasoning, …) without reproducing the frontmatter. Triggers safe edit flow (diff preview, user confirmation).", + parameters: { + type: 'object', + properties: { + path: { type: 'string', description: 'Relative path to the existing note' }, + body: { type: 'string', description: 'New markdown body — everything after the frontmatter block' }, + }, + required: ['path', 'body'], + }, + }, + execute: async (args) => { + let notePath: string; + try { + notePath = resolveVaultPath(vaultPath, args.path as string); + } catch { + return JSON.stringify({ error: `Invalid path: "${args.path}"` }); + } + if (!fs.existsSync(notePath)) { + return JSON.stringify({ error: `File not found: ${args.path}` }); + } + const existing = fs.readFileSync(notePath, 'utf-8'); + // Match the leading frontmatter block verbatim — no re-serialization, so + // folded YAML, key order, and long author lists are preserved exactly. + const fmMatch = existing.match(/^(---\r?\n[\s\S]*?\r?\n---)[ \t]*\r?\n?/); + if (!fmMatch) { + return JSON.stringify({ error: `No frontmatter block in ${args.path}. Use vault_write to set frontmatter and body together.` }); + } + // Record snapshot before modification so conflict detection is active. + conflictDetector?.recordFileRead(notePath, existing); + const frontmatter = fmMatch[1]; + const body = (args.body as string).replace(/^\n+/, ''); + const newContent = `${frontmatter}\n\n${body}`; + return JSON.stringify({ + type: 'pending_edit', + path: notePath, + newContent, + operation: 'update', + }); + }, + }, ]; } diff --git a/src/agent/tools/zotero-tools.ts b/src/agent/tools/zotero-tools.ts index 4a34399..77a50c1 100644 --- a/src/agent/tools/zotero-tools.ts +++ b/src/agent/tools/zotero-tools.ts @@ -545,6 +545,17 @@ function zoteroFetchFallback(args: Record, exportPath: string): const SLUG_RE = /^[a-z0-9][a-z0-9-]*$/; +/** Distinguish a missing slug from a malformed one so the error is actionable. */ +function validateSlugArg(value: unknown): { slug: string } | { error: string } { + if (typeof value !== 'string' || value.trim() === '') { + return { error: 'slug is required.' }; + } + if (!SLUG_RE.test(value)) { + return { error: `Invalid slug format: "${value}" (expected lowercase kebab-case, e.g. "smith-2026-il42").` }; + } + return { slug: value }; +} + // ─── zotero_prepare_bundle ──────────────────────────────────────────────────── function zoteroPrepareBundleTool(vaultPath: string, cfg: () => CrickNoteConfig): ToolHandler { @@ -567,8 +578,9 @@ function zoteroPrepareBundleTool(vaultPath: string, cfg: () => CrickNoteConfig): const z = getZoteroConfig(config); if ('error' in z) return JSON.stringify(z); - const slug = args.slug; - if (typeof slug !== 'string' || !SLUG_RE.test(slug)) return JSON.stringify({ error: 'Invalid slug format.' }); + const slugCheck = validateSlugArg(args.slug); + if ('error' in slugCheck) return JSON.stringify(slugCheck); + const slug = slugCheck.slug; const rawBundleDir = path.join(vaultPath, (z as ZoteroConfig).vault_pdf_dir, slug); if (fs.existsSync(rawBundleDir) && fs.lstatSync(rawBundleDir).isSymbolicLink()) { @@ -604,18 +616,19 @@ function zoteroPrepareBundleTool(vaultPath: string, cfg: () => CrickNoteConfig): } } + const bundleRel = path.join((z as ZoteroConfig).vault_pdf_dir, slug); const dirExists = fs.existsSync(bundleDir); const hasMarker = dirExists && fs.existsSync(markerPath); if (dirExists && !hasMarker) { - return JSON.stringify({ error: `Pre-existing manual bundle at Reading/attachments/${slug}/ — remove or rename it before using Zotero ingestion.` }); + return JSON.stringify({ error: `Pre-existing manual bundle at ${bundleRel}/ — remove or rename it before using Zotero ingestion.` }); } let existingMarkerFiles: Record = {}; if (hasMarker) { const existingMarker = readMarker(markerPath); if (!existingMarker || existingMarker.created_by !== 'zotero_prepare_bundle') { - return JSON.stringify({ error: `Marker at Reading/attachments/${slug}/.zotero-bundle was not created by zotero_prepare_bundle. Refusing to operate.` }); + return JSON.stringify({ error: `Marker at ${bundleRel}/.zotero-bundle was not created by zotero_prepare_bundle. Refusing to operate.` }); } existingMarkerFiles = existingMarker.files; } @@ -752,8 +765,9 @@ function zoteroCleanupBundleTool(vaultPath: string, cfg: () => CrickNoteConfig): const z = getZoteroConfig(config); if ('error' in z) return JSON.stringify(z); - const slug = args.slug; - if (typeof slug !== 'string' || !SLUG_RE.test(slug)) return JSON.stringify({ error: 'Invalid slug format.' }); + const slugCheck = validateSlugArg(args.slug); + if ('error' in slugCheck) return JSON.stringify(slugCheck); + const slug = slugCheck.slug; const rawBundleDir = path.join(vaultPath, (z as ZoteroConfig).vault_pdf_dir, slug); if (fs.existsSync(rawBundleDir) && fs.lstatSync(rawBundleDir).isSymbolicLink()) { diff --git a/src/knowledge/reading-bundle.ts b/src/knowledge/reading-bundle.ts new file mode 100644 index 0000000..c88e481 --- /dev/null +++ b/src/knowledge/reading-bundle.ts @@ -0,0 +1,125 @@ +import fs from 'node:fs'; +import path from 'node:path'; +import { + normalizeReadingSourcePath, + normalizeReadingSources, + type ReadingSourceInput, + type ReadingSourceType, +} from './reading-note.js'; +import { resolveVaultPath } from '../utils/paths.js'; + +export interface DiscoveredBundleFile { + path: string; + type: ReadingSourceType; + readable: boolean; +} + +export interface BundleDiscoveryResult { + slug: string; + folderExists: boolean; + bundlePath: string; + discoveredFiles: DiscoveredBundleFile[]; + recommendedSources: ReadingSourceInput[]; + warnings: string[]; +} + +const TEXT_SOURCE_EXTENSIONS = new Set(['.md', '.txt']); +const IGNORED_BUNDLE_FILES = new Set(['.ds_store']); + +function classifyBundleFile(fileName: string): { type: ReadingSourceType; readable: boolean } { + const lower = fileName.toLowerCase(); + const ext = path.extname(lower); + + if (ext === '.pdf') { + return { type: 'pdf', readable: true }; + } + + if (TEXT_SOURCE_EXTENSIONS.has(ext)) { + if (lower.includes('notebooklm')) { + return { type: 'notebooklm', readable: true }; + } + if (lower.includes('web')) { + return { type: 'web', readable: true }; + } + return { type: 'notes', readable: true }; + } + + return { type: 'other', readable: false }; +} + +/** + * Inspect // (default Reading/attachments/) and recommend + * readable source files. Shared by discover_reading_bundle, + * ingest_reading_bundle, and the defensive source auto-discovery in + * create_reading_note. `attachmentsDir` mirrors config.zotero.vault_pdf_dir so a + * Zotero-prepared bundle is discovered at the same path it was written to. + */ +export function discoverBundle( + vaultPath: string, + slug: string, + attachmentsDir = 'Reading/attachments' +): BundleDiscoveryResult { + const bundleRel = path.join(attachmentsDir, slug); + const bundlePath = resolveVaultPath(vaultPath, bundleRel); + const warnings: string[] = []; + + if (!fs.existsSync(bundlePath) || !fs.statSync(bundlePath).isDirectory()) { + return { + slug, + folderExists: false, + bundlePath, + discoveredFiles: [], + recommendedSources: [], + warnings: [`Reading bundle not found: ${bundleRel}`], + }; + } + + const discoveredFiles: DiscoveredBundleFile[] = []; + + for (const entry of fs.readdirSync(bundlePath, { withFileTypes: true }).sort((a, b) => a.name.localeCompare(b.name))) { + if (IGNORED_BUNDLE_FILES.has(entry.name.toLowerCase())) { + continue; + } + + if (!entry.isFile()) { + warnings.push(`Skipping non-file bundle entry "${entry.name}".`); + continue; + } + + const relativePath = normalizeReadingSourcePath(entry.name); + const classified = classifyBundleFile(relativePath); + discoveredFiles.push({ + path: relativePath, + type: classified.type, + readable: classified.readable, + }); + + if (!classified.readable) { + warnings.push(`Unsupported bundle file "${relativePath}" — only .pdf, .md, and .txt are used for reading intake.`); + } + } + + const recommendedSources = normalizeReadingSources( + discoveredFiles + .filter((file) => file.readable) + .map((file) => ({ type: file.type, path: file.path })) + ); + + const pdfCount = discoveredFiles.filter((file) => file.type === 'pdf' && file.readable).length; + if (pdfCount > 1) { + warnings.push(`Multiple PDF files found in ${bundleRel}; review the recommended sources before ingesting.`); + } + + if (recommendedSources.length === 0) { + warnings.push(`Reading bundle "${slug}" has no readable source files yet.`); + } + + return { + slug, + folderExists: true, + bundlePath, + discoveredFiles, + recommendedSources, + warnings, + }; +} diff --git a/src/knowledge/source-loader.ts b/src/knowledge/source-loader.ts index f7e13bb..b31c671 100644 --- a/src/knowledge/source-loader.ts +++ b/src/knowledge/source-loader.ts @@ -11,8 +11,12 @@ import { logger } from '../utils/logger.js'; const log = logger.child('source-loader'); -const PER_SOURCE_TOKEN_CAP = 10_000; -const SESSION_TOKEN_CAP = 30_000; +// The session cap is the only governor: a single source may draw the entire +// budget (no artificial per-source ceiling below it). Raised from 10k/30k after +// a 29k-token paper was being cut off mid-Results. Pagination (offset/maxTokens) +// handles papers that exceed even this. +const SESSION_TOKEN_CAP = 50_000; +const PER_SOURCE_TOKEN_CAP = SESSION_TOKEN_CAP; const CHARS_PER_TOKEN = 4; const UNSUPPORTED_EXTS = new Set(['.xlsx', '.csv', '.png', '.jpg', '.jpeg', '.gif', '.svg', '.webp']); @@ -39,12 +43,40 @@ function truncateToTokens(text: string, maxTokens: number): { text: string; trun return { text: text.slice(0, maxChars), truncated: true }; } +/** Join per-page PDF text with `--- page N ---` boundary markers (1-indexed). */ +export function joinPdfPages(pages: string[]): string { + return pages.map((text, i) => `--- page ${i + 1} ---\n${text}`).join('\n\n'); +} + async function extractPdf(absPath: string): Promise { - // Dynamic import so servers without pdf-parse installed still start + // Dynamic import so environments without pdf-parse installed still start const pdfParse = (await import('pdf-parse')).default; const buffer = fs.readFileSync(absPath); - const data = await pdfParse(buffer, { max: 20 }); - return data.text; + const pages: string[] = []; + // Custom per-page render (mirrors pdf-parse's default item-join) so we can + // insert page boundary markers — these help locate figures when drafting the + // Figure Map. No content is removed. + await pdfParse(buffer, { + max: 80, + pagerender: async (pageData: { + getTextContent: (opts: object) => Promise<{ items: Array<{ str: string; transform: number[] }> }>; + }) => { + const textContent = await pageData.getTextContent({ normalizeWhitespace: false, disableCombineTextItems: false }); + let lastY: number | undefined; + let text = ''; + for (const item of textContent.items) { + if (lastY === item.transform[5] || lastY === undefined) { + text += item.str; + } else { + text += '\n' + item.str; + } + lastY = item.transform[5]; + } + pages.push(text); + return text; + }, + }); + return joinPdfPages(pages); } const TYPE_PRIORITY: Record = { @@ -58,7 +90,8 @@ const TYPE_PRIORITY: Record = { export async function loadSources( sources: Array<{ type: string; path: string }>, sourceSlug: string, - vaultPath: string + vaultPath: string, + attachmentsDir = 'Reading/attachments' ): Promise { const loaded: LoadedSource[] = []; const warnings: string[] = []; @@ -105,7 +138,7 @@ export async function loadSources( let absPath: string; try { - absPath = resolveVaultPath(vaultPath, path.join('Reading', 'attachments', sourceSlug, src.path)); + absPath = resolveVaultPath(vaultPath, path.join(attachmentsDir, sourceSlug, src.path)); } catch { warnings.push(`Skipping "${src.path}" — path resolves outside vault.`); continue; diff --git a/src/utils/logger.ts b/src/utils/logger.ts index 889d813..0783b19 100644 --- a/src/utils/logger.ts +++ b/src/utils/logger.ts @@ -103,11 +103,8 @@ export class Logger { if (this.format === 'json') { const line = JSON.stringify(entry); - if (level === 'error') { - process.stderr.write(line + '\n'); - } else { - process.stdout.write(line + '\n'); - } + // All diagnostics go to stderr; stdout is reserved for the CLI's JSON result. + process.stderr.write(line + '\n'); if (this.logFile) fs.appendFileSync(this.logFile, line + '\n'); } else { const color = LEVEL_COLORS[level]; @@ -117,11 +114,8 @@ export class Logger { ? ' ' + Object.entries(data).map(([k, v]) => `${k}=${typeof v === 'string' ? v : JSON.stringify(v)}`).join(' ') : ''; const line = `${color}${timestamp} ${level.toUpperCase().padEnd(5)}${RESET} ${tag}${msg}${extra}`; - if (level === 'error') { - process.stderr.write(line + '\n'); - } else { - process.stdout.write(line + '\n'); - } + // All diagnostics go to stderr; stdout is reserved for the CLI's JSON result. + process.stderr.write(line + '\n'); // Always write JSON to file regardless of pretty console format if (this.logFile) fs.appendFileSync(this.logFile, JSON.stringify(entry) + '\n'); } diff --git a/tests/integration/reading-pipeline.test.ts b/tests/integration/reading-pipeline.test.ts index 464c25c..f9677d9 100644 --- a/tests/integration/reading-pipeline.test.ts +++ b/tests/integration/reading-pipeline.test.ts @@ -267,3 +267,65 @@ describe('reading pipeline — full state-transition sequence', () => { expect(result.error).toMatch(/slug.*path|path.*slug/i); }); }); + +// --------------------------------------------------------------------------- +// Path-coupling: the documented Zotero path must work with a non-default +// vault_pdf_dir. Bundle files live under a custom attachments dir; ingest must +// discover + validate them there, and compile must load them from there. +// --------------------------------------------------------------------------- + +describe('reading pipeline — non-default attachments dir (vault_pdf_dir)', () => { + let vaultPath: string; + let detector: ConflictDetector; + const customDir = 'Library/PDFs'; + const slug = 'jones-2025-il7-custom'; + + beforeEach(() => { + vaultPath = fs.mkdtempSync(path.join(os.tmpdir(), 'cricknote-custom-dir-')); + fs.mkdirSync(path.join(vaultPath, customDir, slug), { recursive: true }); + fs.mkdirSync(path.join(vaultPath, 'Reading', 'Papers'), { recursive: true }); + fs.writeFileSync( + path.join(vaultPath, customDir, slug, 'notes.md'), + '# Notes\n\nKey finding: IL-7 sustains memory T cells.' + ); + detector = new ConflictDetector(); + }); + + afterEach(() => { + fs.rmSync(vaultPath, { recursive: true, force: true }); + }); + + it('ingest discovers + compile loads sources from the custom dir', async () => { + // 1. Ingest with sources omitted → forces discovery + inline validation + // against the custom attachments dir. + const intakeTools = createReadingIntakeTools(vaultPath, detector, customDir); + const ingest = getToolByName(intakeTools, 'ingest_reading_bundle'); + const ingestResult = JSON.parse(await ingest.execute({ + slug, + title: 'IL-7 sustains memory T cells', + authors: ['Bob Jones'], + year: 2025, + journal: 'Immunity', + })); + + expect(ingestResult.type).toBe('pending_edit'); + expect(ingestResult.operation).toBe('create'); + const parsed = matter(ingestResult.newContent); + expect(parsed.data.sources).toEqual([{ type: 'notes', path: 'notes.md' }]); + + // 2. Simulate the runtime writing the pending_edit to disk. + fs.writeFileSync(ingestResult.path, ingestResult.newContent, 'utf-8'); + + // 3. Compile must load the source from the custom dir, not Reading/attachments. + const kbTools = createKbTools(vaultPath, undefined, customDir); + const compile = getToolByName(kbTools, 'compile_reading_note'); + const compileResult = JSON.parse(await compile.execute({ + path: `Reading/Papers/${slug}.md`, + })); + + expect(compileResult.sources_missing).toBe(false); + expect(compileResult.sources).toHaveLength(1); + expect(compileResult.sources[0].content).toContain('IL-7 sustains memory T cells'); + expect(compileResult.warnings.some((w: string) => w.includes('not found'))).toBe(false); + }); +}); diff --git a/tests/unit/logger.test.ts b/tests/unit/logger.test.ts index 819866e..0429d42 100644 --- a/tests/unit/logger.test.ts +++ b/tests/unit/logger.test.ts @@ -18,12 +18,12 @@ describe('Logger', () => { stderrSpy.mockRestore(); }); - it('outputs JSON lines in json format', () => { + it('outputs JSON lines in json format to stderr', () => { const log = new Logger({ format: 'json', level: 'info' }); log.info('test message', { key: 'value' }); - expect(stdoutSpy).toHaveBeenCalledOnce(); - const output = stdoutSpy.mock.calls[0][0] as string; + expect(stderrSpy).toHaveBeenCalledOnce(); + const output = stderrSpy.mock.calls[0][0] as string; const parsed = JSON.parse(output.trim()); expect(parsed.level).toBe('info'); @@ -32,12 +32,12 @@ describe('Logger', () => { expect(parsed.ts).toBeDefined(); }); - it('outputs human-readable lines in pretty format', () => { + it('outputs human-readable lines in pretty format to stderr', () => { const log = new Logger({ format: 'pretty', level: 'info' }); log.info('server started', { port: 8080 }); - expect(stdoutSpy).toHaveBeenCalledOnce(); - const output = stdoutSpy.mock.calls[0][0] as string; + expect(stderrSpy).toHaveBeenCalledOnce(); + const output = stderrSpy.mock.calls[0][0] as string; expect(output).toContain('INFO'); expect(output).toContain('server started'); expect(output).toContain('port=8080'); @@ -49,11 +49,21 @@ describe('Logger', () => { log.info('also hidden'); log.warn('should appear'); - expect(stdoutSpy).toHaveBeenCalledOnce(); - const output = stdoutSpy.mock.calls[0][0] as string; + expect(stderrSpy).toHaveBeenCalledOnce(); + const output = stderrSpy.mock.calls[0][0] as string; expect(output).toContain('should appear'); }); + it('never writes to stdout — stdout is the data channel', () => { + const log = new Logger({ format: 'json', level: 'debug' }); + log.debug('d'); + log.info('i'); + log.warn('w'); + log.error('e'); + + expect(stdoutSpy).not.toHaveBeenCalled(); + }); + it('writes error level to stderr', () => { const log = new Logger({ format: 'json', level: 'info' }); log.error('something broke', { code: 500 }); @@ -73,7 +83,7 @@ describe('Logger', () => { const child = parent.child('websocket'); child.info('client connected'); - const output = stdoutSpy.mock.calls[0][0] as string; + const output = stderrSpy.mock.calls[0][0] as string; const parsed = JSON.parse(output.trim()); expect(parsed.component).toBe('websocket'); expect(parsed.msg).toBe('client connected'); @@ -85,8 +95,8 @@ describe('Logger', () => { child.info('should be filtered'); child.warn('should appear'); - expect(stdoutSpy).toHaveBeenCalledOnce(); - const output = stdoutSpy.mock.calls[0][0] as string; + expect(stderrSpy).toHaveBeenCalledOnce(); + const output = stderrSpy.mock.calls[0][0] as string; expect(output).toContain('should appear'); }); @@ -110,21 +120,20 @@ describe('Logger', () => { const log = new Logger({ format: 'json', level: 'info' }); log.info('plain message'); - const output = stdoutSpy.mock.calls[0][0] as string; + const output = stderrSpy.mock.calls[0][0] as string; const parsed = JSON.parse(output.trim()); expect(parsed.msg).toBe('plain message'); expect(Object.keys(parsed)).toEqual(['ts', 'level', 'msg']); }); - it('debug level includes all messages', () => { + it('writes every level to stderr (debug, info, warn, error)', () => { const log = new Logger({ format: 'json', level: 'debug' }); log.debug('d'); log.info('i'); log.warn('w'); log.error('e'); - // debug + info + warn go to stdout, error goes to stderr - expect(stdoutSpy).toHaveBeenCalledTimes(3); - expect(stderrSpy).toHaveBeenCalledTimes(1); + expect(stderrSpy).toHaveBeenCalledTimes(4); + expect(stdoutSpy).not.toHaveBeenCalled(); }); }); diff --git a/tests/unit/reading-bundle.test.ts b/tests/unit/reading-bundle.test.ts new file mode 100644 index 0000000..87ea89d --- /dev/null +++ b/tests/unit/reading-bundle.test.ts @@ -0,0 +1,51 @@ +import { describe, it, expect, beforeEach, afterEach } from 'vitest'; +import fs from 'node:fs'; +import os from 'node:os'; +import path from 'node:path'; +import { discoverBundle } from '../../src/knowledge/reading-bundle.js'; + +describe('discoverBundle', () => { + let vaultPath: string; + beforeEach(() => { + vaultPath = fs.mkdtempSync(path.join(os.tmpdir(), 'rb-test-')); + }); + afterEach(() => { fs.rmSync(vaultPath, { recursive: true, force: true }); }); + + it('discovers bundle files in the default Reading/attachments dir', () => { + const dir = path.join(vaultPath, 'Reading', 'attachments', 'smith-2026-il42'); + fs.mkdirSync(dir, { recursive: true }); + fs.writeFileSync(path.join(dir, 'paper.pdf'), 'fake pdf'); + + const result = discoverBundle(vaultPath, 'smith-2026-il42'); + + expect(result.folderExists).toBe(true); + expect(result.recommendedSources).toEqual([{ type: 'pdf', path: 'paper.pdf' }]); + }); + + it('discovers bundle files in a non-default attachments dir (vault_pdf_dir)', () => { + const customDir = 'Library/PDFs'; + const dir = path.join(vaultPath, customDir, 'smith-2026-il42'); + fs.mkdirSync(dir, { recursive: true }); + fs.writeFileSync(path.join(dir, 'paper.pdf'), 'fake pdf'); + + const result = discoverBundle(vaultPath, 'smith-2026-il42', customDir); + + expect(result.folderExists).toBe(true); + expect(result.recommendedSources).toEqual([{ type: 'pdf', path: 'paper.pdf' }]); + expect(result.bundlePath).toBe( + path.join(fs.realpathSync(vaultPath), customDir, 'smith-2026-il42') + ); + }); + + it('does not find a bundle in the default dir when files live under a custom dir', () => { + const dir = path.join(vaultPath, 'Library', 'PDFs', 'smith-2026-il42'); + fs.mkdirSync(dir, { recursive: true }); + fs.writeFileSync(path.join(dir, 'paper.pdf'), 'fake pdf'); + + // Reader still pointed at the default dir → bundle is invisible. + const result = discoverBundle(vaultPath, 'smith-2026-il42'); + + expect(result.folderExists).toBe(false); + expect(result.warnings[0]).toContain('Reading bundle not found'); + }); +}); diff --git a/tests/unit/source-loader.test.ts b/tests/unit/source-loader.test.ts index 031f2a0..9a8d6b2 100644 --- a/tests/unit/source-loader.test.ts +++ b/tests/unit/source-loader.test.ts @@ -2,7 +2,7 @@ import { describe, it, expect, beforeEach, afterEach } from 'vitest'; import fs from 'node:fs'; import os from 'node:os'; import path from 'node:path'; -import { loadSources } from '../../src/knowledge/source-loader.js'; +import { loadSources, joinPdfPages } from '../../src/knowledge/source-loader.js'; describe('loadSources', () => { let vaultPath: string; @@ -14,8 +14,12 @@ describe('loadSources', () => { 'IL-42 suppresses CD8 by 40%.' ); fs.writeFileSync( - path.join(vaultPath, 'Reading', 'attachments', 'smith-2026-il42', 'large.md'), - 'x'.repeat(42000) // > 10 000 tokens at ~4 chars/token + path.join(vaultPath, 'Reading', 'attachments', 'smith-2026-il42', 'paper.md'), + 'x'.repeat(116000) // ~29 000 tokens at ~4 chars/token — a full paper, under the 50k cap + ); + fs.writeFileSync( + path.join(vaultPath, 'Reading', 'attachments', 'smith-2026-il42', 'huge.md'), + 'x'.repeat(220000) // ~55 000 tokens — exceeds the 50k cap ); }); afterEach(() => { fs.rmSync(vaultPath, { recursive: true, force: true }); }); @@ -31,9 +35,20 @@ describe('loadSources', () => { expect(result.warnings).toHaveLength(0); }); - it('truncates a source that exceeds 10 000 tokens', async () => { + it('returns a full ~29k-token paper without truncation', async () => { const result = await loadSources( - [{ type: 'notes', path: 'large.md' }], + [{ type: 'notes', path: 'paper.md' }], + 'smith-2026-il42', + vaultPath + ); + expect(result.sources[0].truncated).toBe(false); + expect(result.sources[0].content.length).toBe(116000); + expect(result.warnings).toHaveLength(0); + }); + + it('truncates a single source that exceeds the 50 000 token cap', async () => { + const result = await loadSources( + [{ type: 'notes', path: 'huge.md' }], 'smith-2026-il42', vaultPath ); @@ -60,11 +75,11 @@ describe('loadSources', () => { expect(result.warnings.some(w => w.includes('Cannot read'))).toBe(true); }); - it('respects the 30 000 token session cap', async () => { + it('respects the 50 000 token session cap across multiple sources', async () => { for (let i = 1; i <= 4; i++) { fs.writeFileSync( path.join(vaultPath, 'Reading', 'attachments', 'smith-2026-il42', `part${i}.md`), - 'y'.repeat(32000) + 'y'.repeat(60000) // ~15 000 tokens each → 60 000 total, over the 50k cap ); } const result = await loadSources( @@ -72,7 +87,8 @@ describe('loadSources', () => { 'smith-2026-il42', vaultPath ); - expect(result.totalTokens).toBeLessThanOrEqual(30000); + expect(result.totalTokens).toBeGreaterThan(30000); // proves the cap was raised above the old 30k + expect(result.totalTokens).toBeLessThanOrEqual(50000); expect(result.warnings.some(w => w.includes('session cap'))).toBe(true); }); @@ -132,4 +148,34 @@ describe('loadSources', () => { expect(result.sources[0].path).toBe('notes.md'); expect(result.warnings[0]).toContain('relative to the attachment folder'); }); + + it('loads sources from a non-default attachments dir (vault_pdf_dir)', async () => { + const customDir = 'Library/PDFs'; + fs.mkdirSync(path.join(vaultPath, customDir, 'jones-2025'), { recursive: true }); + fs.writeFileSync( + path.join(vaultPath, customDir, 'jones-2025', 'notes.md'), + 'Custom-dir source content.' + ); + + const result = await loadSources( + [{ type: 'notes', path: 'notes.md' }], + 'jones-2025', + vaultPath, + customDir + ); + + expect(result.sources).toHaveLength(1); + expect(result.sources[0].content).toContain('Custom-dir source content'); + expect(result.warnings).toHaveLength(0); + }); +}); + +describe('joinPdfPages', () => { + it('joins pages with page-number markers', () => { + expect(joinPdfPages(['alpha', 'beta'])).toBe('--- page 1 ---\nalpha\n\n--- page 2 ---\nbeta'); + }); + + it('returns an empty string when there are no pages', () => { + expect(joinPdfPages([])).toBe(''); + }); }); diff --git a/tests/unit/template-tools.test.ts b/tests/unit/template-tools.test.ts index 51f14a4..97a65fa 100644 --- a/tests/unit/template-tools.test.ts +++ b/tests/unit/template-tools.test.ts @@ -69,6 +69,45 @@ describe('template tools', () => { expect(parsed.content).toContain('## Extensions'); }); + it('create_reading_note auto-discovers bundle files when sources are omitted', async () => { + const slug = 'auto-discover-paper'; + const bundleDir = path.join(vaultPath, 'Reading', 'attachments', slug); + fs.mkdirSync(bundleDir, { recursive: true }); + fs.writeFileSync(path.join(bundleDir, 'paper.pdf'), '%PDF-1.4 test'); + + const createReadingTool = createTemplateTools(vaultPath, detector) + .find(tool => tool.definition.name === 'create_reading_note'); + const result = JSON.parse(await createReadingTool!.execute({ + title: 'Auto Discover Paper', + authors: ['Alice Smith'], + year: 2026, + journal: 'Cell', + slug, + // no `sources` — should be auto-discovered from the existing bundle + })); + + const parsed = matter(result.newContent); + expect(parsed.data.sources).toEqual([{ type: 'pdf', path: 'paper.pdf' }]); + }); + + it('create_reading_note still creates a sourceless placeholder when no bundle exists', async () => { + const createReadingTool = createTemplateTools(vaultPath, detector) + .find(tool => tool.definition.name === 'create_reading_note'); + const result = JSON.parse(await createReadingTool!.execute({ + title: 'Future Thread Capture', + authors: ['Alice Smith'], + year: 2026, + journal: 'Cell', + slug: 'future-thread-capture', + // no `sources` and no bundle folder — must not error + })); + + expect(result.type).toBe('pending_edit'); + const parsed = matter(result.newContent); + expect(parsed.data.sources ?? []).toEqual([]); + expect(parsed.content).toContain('## Claims'); + }); + it('create_reading_note preserves meaningful body content on update', async () => { const existingPath = path.join(fs.realpathSync(vaultPath), 'Reading', 'Papers', 'drafted-paper.md'); fs.writeFileSync( diff --git a/tests/unit/vault-write-body.test.ts b/tests/unit/vault-write-body.test.ts new file mode 100644 index 0000000..552cb99 --- /dev/null +++ b/tests/unit/vault-write-body.test.ts @@ -0,0 +1,83 @@ +import { afterEach, beforeEach, describe, expect, it } from 'vitest'; +import fs from 'node:fs'; +import os from 'node:os'; +import path from 'node:path'; +import { createVaultTools } from '../../src/agent/tools/vault.js'; +import { ConflictDetector } from '../../src/editing/conflict-detector.js'; + +// Frontmatter with a folded title and a multi-line author list — exactly the +// kind of block the agent should never have to retype (feedback #5). +const FRONTMATTER = [ + '---', + 'title: >-', + ' A Very Long Title That Folds Across Lines', + 'authors:', + ' - Alice Smith', + ' - Bob Jones', + ' - Carol Δ White', + 'year: 2026', + 'journal: Nature', + 'sources:', + ' - type: pdf', + ' path: paper.pdf', + '---', +].join('\n'); + +const originalContent = `${FRONTMATTER}\n\n# Old Title\n\n## Claims\n\nold body content\n`; + +describe('vault_write_body', () => { + let vaultPath: string; + let detector: ConflictDetector; + + function getTool() { + return createVaultTools(vaultPath, detector).find(t => t.definition.name === 'vault_write_body')!; + } + + beforeEach(() => { + vaultPath = fs.realpathSync(fs.mkdtempSync(path.join(os.tmpdir(), 'cn-vwb-'))); + fs.mkdirSync(path.join(vaultPath, 'Reading', 'Papers'), { recursive: true }); + detector = new ConflictDetector(); + }); + + afterEach(() => { fs.rmSync(vaultPath, { recursive: true, force: true }); }); + + it('preserves frontmatter verbatim and replaces only the body', async () => { + const rel = 'Reading/Papers/test.md'; + fs.writeFileSync(path.join(vaultPath, rel), originalContent); + + const result = JSON.parse(await getTool().execute({ + path: rel, + body: '# New Title\n\n## Figure Map\n\n## Claims\n\nnew analysis', + })); + + expect(result.type).toBe('pending_edit'); + expect(result.operation).toBe('update'); + // The frontmatter block (incl. folded title, author list, unicode) is byte-identical. + expect(result.newContent.startsWith(FRONTMATTER)).toBe(true); + expect(result.newContent).toContain('new analysis'); + expect(result.newContent).not.toContain('old body content'); + expect(result.newContent).not.toContain('# Old Title'); + }); + + it('errors when the file does not exist', async () => { + const result = JSON.parse(await getTool().execute({ path: 'Reading/Papers/missing.md', body: 'x' })); + expect(result.error).toMatch(/not found/i); + }); + + it('errors when the file has no frontmatter', async () => { + const rel = 'Reading/Papers/no-fm.md'; + fs.writeFileSync(path.join(vaultPath, rel), '# Just a body, no frontmatter\n'); + const result = JSON.parse(await getTool().execute({ path: rel, body: 'x' })); + expect(result.error).toMatch(/frontmatter/i); + }); + + it('records a conflict snapshot before proposing the edit', async () => { + const rel = 'Reading/Papers/test.md'; + const abs = path.join(vaultPath, rel); + fs.writeFileSync(abs, originalContent); + + await getTool().execute({ path: rel, body: 'new' }); + + expect(detector.getSnapshot(abs)?.content).toBe(originalContent); + }); +}); diff --git a/tests/unit/zotero-tools.test.ts b/tests/unit/zotero-tools.test.ts index ddc0dff..f82dbe2 100644 --- a/tests/unit/zotero-tools.test.ts +++ b/tests/unit/zotero-tools.test.ts @@ -587,6 +587,13 @@ describe('zotero_prepare_bundle', () => { expect(result.error).toMatch(/invalid slug/i); }); + it('reports a missing slug distinctly from a malformed one', async () => { + const tool = await getPrepareTool(vault); + const result = JSON.parse(await tool.execute({})); + expect(result.error).toMatch(/slug is required/i); + expect(result.error).not.toMatch(/invalid slug format/i); + }); + it('creates dir, links PDF, writes marker, returns source_type pdf', async () => { const pdfSrc = path.join(os.tmpdir(), `test-${Date.now()}.pdf`); fs.writeFileSync(pdfSrc, Buffer.from('%PDF-test-content'));