Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 37 additions & 14 deletions AGENT_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,8 @@ Save a new document or update an existing one.
| `content` | Yes | Markdown content. Use H1/H2/H3 headings -- the chunker uses them for segmentation. |
| `document_id` | No | UUID of an existing document to update. When provided, updates that document directly regardless of `update_if_exists`. Returns an error if the document does not exist. Workflow: search → note the `[id: ...]` → pass here. |
| `update_if_exists` | No | When `true`, updates the document with the same title (versions the old content). Default `false`. Ignored when `document_id` is provided. |
| `expected_content_hash` | **Yes, on content updates** | Optimistic-concurrency token: the `content_hash` of the version you based your edit on (returned by `cerefox_get_document`, `cerefox_search`, and `cerefox_metadata_search`). Stale → **conflict error** (re-read, merge, retry). Absent → **token-required error**. Not needed when creating. See "Concurrent writers" below. |
| `last_write_wins` | No | Explicitly skip the concurrency check (default `false`). Use ONLY when an external source of truth makes conflicts meaningless (file re-sync). Recorded in the audit log. **Never use it to silence a conflict.** |
| `project_name` | No | **Single** project name (created if absent). On update: **non-destructive add** — ensures this membership exists, preserves others. See "Project membership semantics" below. |
| `project_names` | No | **List** of project names (each created if absent). On update: **destructive replace** — sets the document's full project set to exactly this list. Use when you want to set multiple projects at once, or deliberately change the membership list. Wins over `project_name` when both are passed. |
| `metadata` | No | Arbitrary JSON. Use at minimum: `type` and `status`. |
Expand All @@ -76,15 +78,24 @@ Save a new document or update an existing one.

**The update workflow (preferred -- ID-based)**:
1. Search for the document. Note the `[id: abc123]` in the result.
2. Call `cerefox_ingest` with `document_id: "abc123"` and the new content.
3. The old content is automatically versioned and recoverable.
2. `cerefox_get_document("abc123")` — read the current content and note its `content_hash`.
3. Call `cerefox_ingest` with `document_id: "abc123"`, the new content, and `expected_content_hash: "<the hash you read>"`.
4. The old content is automatically versioned and recoverable.

**The update workflow (fallback -- title-based)**:
1. Search for the document first.
2. Call `cerefox_ingest` with the **exact same title** and `update_if_exists: true`.
1. Search for the document first (note its hash).
2. Call `cerefox_ingest` with the **exact same title**, `update_if_exists: true`, and `expected_content_hash`.
3. If you use a different title, a **new** document is created (the old one remains). This is almost never what you want when revising.

**Deduplication**: Content is SHA-256 hashed. Identical content is skipped (no re-indexing). Metadata-only changes update metadata without creating a version.
**Deduplication**: Content is SHA-256 hashed. Identical content is skipped (no re-indexing, no concurrency check needed — identical content cannot lose data). Metadata-only changes update metadata without creating a version.

#### Concurrent writers (optimistic concurrency)

Cerefox is **shared** memory — another agent (or the user) may update a document between your read and your write. Content updates therefore require proof of freshness: `expected_content_hash` must equal the document's current `content_hash` at write time, checked atomically inside the database.

- **Conflict error** ("document changed since it was read"): the document moved underneath you. `cerefox_get_document` again → **merge your changes into the latest content** → retry with the new hash. Never resolve a conflict by overwriting blindly — the current content includes another writer's work.
- **Token-required error**: you attempted a content update without `expected_content_hash`. Read the document first; if you already did, pass the hash you read.
- `last_write_wins: true` bypasses the check — reserved for re-sync flows where an external source of truth (e.g., files on disk) makes conflicts meaningless. It is recorded in the audit log.

**What to ingest**: Distilled summaries, decisions with rationale, curated insights. Not raw dumps, logs, or transcripts. Use Markdown headings for structure.

Expand Down Expand Up @@ -112,7 +123,7 @@ Retrieve the complete text of a document by its UUID.
| `version_id` | No | UUID of an archived version (from `cerefox_list_versions`). |
| `requestor` | No | Your agent name. |

Use this when search returns partial results, or to read a previous version before restoring it.
Use this when search returns partial results, or to read a previous version before restoring it. The response header includes the document's current `content_hash` — pass it back as `expected_content_hash` when updating via `cerefox_ingest`.

---

Expand Down Expand Up @@ -268,26 +279,31 @@ Metadata is matched as **strings**, so store the flag as the string `"true"` (no

```
1. cerefox_search("topic") -- find relevant docs, note [id: uuid]
2. cerefox_get_document(id) -- get full text if partial
2. cerefox_get_document(id) -- get full text + content_hash
3. cerefox_ingest(title, content, -- update by document ID (deterministic)
document_id="uuid")
document_id="uuid",
expected_content_hash="<hash from step 2>")
4. On a conflict error: repeat from step 2, merging your changes into
the latest content before retrying with the fresh hash.
```

### Search then update (title-based -- fallback)

```
1. cerefox_search("topic") -- find relevant docs
2. cerefox_get_document(id) -- get full text if partial
1. cerefox_search("topic") -- find relevant docs (note the hash)
2. cerefox_get_document(id) -- get full text + content_hash
3. cerefox_ingest(title, content, -- update with same title
update_if_exists=true)
update_if_exists=true,
expected_content_hash="<hash from step 2>")
```

### Save new knowledge

```
1. cerefox_search("topic") -- check if it already exists
2. If not found: cerefox_ingest(title, content, project_name, metadata)
3. If found: cerefox_ingest(same_title, new_content, document_id="uuid")
3. If found: cerefox_ingest(same_title, new_content, document_id="uuid",
expected_content_hash="<its current hash>")
```

### Catch up on recent changes
Expand All @@ -309,6 +325,7 @@ Metadata is matched as **strings**, so store the flag as the string `"true"` (no
5. **Add metadata**: at minimum `type` (e.g., "research", "decision-log") and `status` ("active", "draft").
6. **Write structured Markdown** with H1/H2/H3 headings. The chunker uses heading structure.
7. **Distill, don't dump.** Summaries > transcripts. Decisions > discussions. Insights > raw data.
8. **Prove freshness on updates.** Pass `expected_content_hash` (the hash you read) on every content update. On conflict: re-read → merge → retry. Never `last_write_wins` your way out of a conflict.

---

Expand Down Expand Up @@ -418,7 +435,7 @@ The legacy Python `uv run cerefox` is a frozen husk as of v0.9 — only `uv run
| MCP tool | CLI command |
|---|---|
| `cerefox_search(query, match_count, project_name, metadata_filter, requestor)` | `cerefox search "<query>" --match-count N --project-name <n> --metadata-filter '<json>' --requestor <name>` (also `--mode`, `--alpha`, `--min-score`, `--only-metadata` — CLI-only) |
| `cerefox_ingest(title, content, project_name, metadata, update_if_exists, document_id, source, author, author_type)` (file) | `cerefox document ingest <path> --title <t> --project-name <n> --metadata '<json>' --update-if-exists\|--document-id <uuid> --source <s> --author <a> --author-type user\|agent` |
| `cerefox_ingest(title, content, project_name, metadata, update_if_exists, document_id, expected_content_hash, last_write_wins, source, author, author_type)` (file) | `cerefox document ingest <path> --title <t> --project-name <n> --metadata '<json>' --update-if-exists\|--document-id <uuid> --expected-content-hash <hash>\|--last-write-wins --source <s> --author <a> --author-type user\|agent` |
| `cerefox_ingest(...)` (paste) | `printf '%s' "<content>" \| cerefox document ingest --paste --title "<title>"` (same flags) |
| `cerefox_get_document(document_id, version_id, requestor)` | `cerefox document get <document-id> --version-id <vid> --requestor <name>` |
| `cerefox_list_versions(document_id, requestor)` | `cerefox document version list <document-id> --requestor <name>` |
Expand Down Expand Up @@ -476,12 +493,18 @@ printf '# Title\n\nBody markdown with H2s for chunking.\n' \
# Step 1: search and note the [id: abc12345-...] in the result
cerefox search "the exact doc" --match-count 1 --requestor "claude-code"

# Step 2: update by ID
# Step 2: read it — the header shows `content_hash:` (the concurrency token)
cerefox document get "abc12345-..." --requestor "claude-code"

# Step 3: update by ID, proving freshness with the hash from step 2
printf '...new content...' \
| cerefox document ingest --paste \
--title "Exact Same Title" \
--document-id "abc12345-..." \
--expected-content-hash "<hash from step 2>" \
--author "claude-code" --author-type "agent"

# Conflict error? Repeat from step 2, merge into the latest content, retry.
```

**Title-based update (fallback when ID isn't available):**
Expand Down
21 changes: 13 additions & 8 deletions AGENT_QUICK_REFERENCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ Cerefox is a persistent, shared knowledge base. You have **10 MCP tools** (9 of
| Tool | Purpose | Key params |
|------|---------|------------|
| `cerefox_search` | Find documents (hybrid FTS + semantic) | `query` (required), `project_name`, `metadata_filter`, `requestor` |
| `cerefox_ingest` | Save or update a document | `title`, `content` (required), `document_id` (update by ID), `update_if_exists`, `project_name` (single, non-destructive add on update), `project_names` (list, destructive replace on update), `metadata`, `author` |
| `cerefox_get_document` | Get full document by ID | `document_id` (required) |
| `cerefox_ingest` | Save or update a document | `title`, `content` (required), `document_id` (update by ID), `expected_content_hash` (**required on content updates** — see rule 9), `last_write_wins`, `update_if_exists`, `project_name` (single, non-destructive add on update), `project_names` (list, destructive replace on update), `metadata`, `author` |
| `cerefox_get_document` | Get full document by ID (header includes `content_hash` — the update token) | `document_id` (required) |
| `cerefox_list_versions` | Version history of a document | `document_id` (required) |
| `cerefox_metadata_search` | Find or list docs by metadata, project, or time (no text query) | `metadata_filter`, `project_name` (list a project's docs), `updated_since`, `include_content` — **at least one** of metadata_filter/project_name/updated_since/created_since |
| `cerefox_list_metadata_keys` | Discover available metadata keys | (none required) |
Expand All @@ -27,20 +27,25 @@ Cerefox is a persistent, shared knowledge base. You have **10 MCP tools** (9 of
6. **Write structured Markdown** with H1/H2/H3 headings for good chunking and search.
7. **Deletes are soft (recoverable); purge is web-UI-only.** If you decide to delete, surface it to the user (`I soft-deleted X — recoverable from the Cerefox web UI trash`). You cannot un-do your own delete from agent code by design.
8. **Cross-doc links inside content**: **always use `[Text](document-uuid)`.** UUIDs are the only fully reliable link form — stable across title changes, never ambiguous, no encoding gotchas. Every `cerefox_search` result shows `[id: <uuid>]` after the title; grab it and use it. Title-based linking (`[Text](<Title With Spaces>)`) is fragile (breaks on colons, parens, ampersands, brackets — silently navigates to wrong page) — **don't write title-based links**; do an extra search to get the UUID instead. Repo-path forms (`[Text](docs/path.md)`) exist for repo-ingested files; don't construct manually. See `AGENT_GUIDE.md → Writing linkable content` for the full rule.
9. **Project memberships — non-destructive by default**: on `cerefox_ingest` updates, **`project_name` (singular) is a non-destructive add** (ensures membership, preserves others). Use **`project_names` (list)** when you want to set the doc's full project set in one call (destructive replace). For metadata-only project changes without writing content, use **`cerefox_set_document_projects(document_id, project_names)`** — that tool is the destructive-replace contract made explicit. Never call `cerefox_set_document_projects` with a single name when you mean "add" — that would REMOVE the doc from all other projects. When in doubt, use `cerefox_ingest` with singular `project_name`.
9. **Concurrency: content updates require `expected_content_hash`.** Pass the `content_hash` you read (shown by `cerefox_get_document`, `cerefox_search`, and `cerefox_metadata_search`) when updating a document. If it's stale you get a **conflict** — re-read the document, merge your changes into the latest content, retry with the new hash. **Never resolve a conflict by overwriting blindly** — the current content includes another writer's work. `last_write_wins: true` skips the check; use it ONLY when an external source of truth makes conflicts meaningless (file re-sync), never to silence a conflict.
10. **Project memberships — non-destructive by default**: on `cerefox_ingest` updates, **`project_name` (singular) is a non-destructive add** (ensures membership, preserves others). Use **`project_names` (list)** when you want to set the doc's full project set in one call (destructive replace). For metadata-only project changes without writing content, use **`cerefox_set_document_projects(document_id, project_names)`** — that tool is the destructive-replace contract made explicit. Never call `cerefox_set_document_projects` with a single name when you mean "add" — that would REMOVE the doc from all other projects. When in doubt, use `cerefox_ingest` with singular `project_name`.

## Update Workflow (ID-based -- preferred)

```
search("topic") -> find doc [id: abc123] -> get_document(abc123) -> modify ->
ingest(title="Same Title", content="...", document_id="abc123", author="my-agent")
search("topic") -> find doc [id: abc123] -> get_document(abc123) -> note its content_hash -> modify ->
ingest(title="Same Title", content="...", document_id="abc123",
expected_content_hash="<the hash you read>", author="my-agent")
```

On a **conflict** error: get_document again (fresh content + fresh hash) -> merge your changes -> retry with the new hash.

## Update Workflow (title-based -- fallback)

```
search("topic") -> find doc -> modify ->
ingest(title="Same Title", content="...", update_if_exists=true, author="my-agent")
search("topic") -> find doc (note its hash) -> modify ->
ingest(title="Same Title", content="...", update_if_exists=true,
expected_content_hash="<the hash you read>", author="my-agent")
```

## Catch-Up Workflow
Expand All @@ -59,7 +64,7 @@ Same operations, same conventions. Full reference: [`docs/guides/cli.md`](docs/g
|---|---|
| `cerefox_search` | `cerefox search "<q>" --requestor "<your-name>"` |
| `cerefox_ingest` (paste) | `printf '...' \| cerefox document ingest --paste --title "<t>" --author "<your-name>" --author-type agent` |
| `cerefox_ingest` (update by ID) | `printf '...' \| cerefox document ingest --paste --title "<t>" --document-id "<uuid>" --author "<your-name>" --author-type agent` |
| `cerefox_ingest` (update by ID) | `printf '...' \| cerefox document ingest --paste --title "<t>" --document-id "<uuid>" --expected-content-hash "<hash>" --author "<your-name>" --author-type agent` |
| `cerefox_get_document` | `cerefox document get <id> --version-id <vid> --requestor "<your-name>"` |
| `cerefox_list_versions` | `cerefox document version list <id> --requestor "<your-name>"` |
| `cerefox_list_projects` | `cerefox project list --requestor "<your-name>"` |
Expand Down
41 changes: 40 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,46 @@ Versioning: [Semantic Versioning](https://semver.org/spec/v2.0.0.html) — all `

## [Unreleased]

Open roadmap.
### Changed — BREAKING

- **Optimistic concurrency control on content updates** (design:
[`docs/specs/concurrency-control-design.md`](docs/specs/concurrency-control-design.md)).
Updating a document's content (via `document_id` or `update_if_exists`) now requires
**`expected_content_hash`** — the `content_hash` of the version the edit was based on,
returned by every read surface (`cerefox_get_document`, `cerefox_search`,
`cerefox_metadata_search`, the REST EFs, `cerefox document get` / `cerefox search`,
and the web edit page). The check is atomic inside the `cerefox_ingest_document` RPC
(`SELECT … FOR UPDATE`), closing the read→embed→write race where two concurrent
writers silently last-write-wins'd each other. A stale hash fails with a **conflict**
(re-read → merge → retry; HTTP 409 on the REST path); a missing hash fails with
**token-required** (HTTP 400). `last_write_wins: true` (CLI `--last-write-wins`)
explicitly skips the check and is recorded in the audit log — `document ingest-dir`
and `guides ingest` pass it internally (the filesystem / npm package is their source
of truth), and the frozen Python fallback declares it to preserve its historical
behavior. **Breaking**: pre-v0.11 clients' content updates fail against an upgraded
server until updated (`cerefox self-update`); existing GPT Actions need the v2.0.0
OpenAPI block re-pasted. Creates are unaffected. Schema version 0.4.0 → **0.5.0**
(RPC-only change; ships via `cerefox server deploy --schema-only`).

### Added

- `content_hash` returned by all document-shaped reads (MCP tool headers, CLI output,
REST EF responses, web document API) — the token for the concurrency contract above.
- CLI flags `--expected-content-hash` / `--last-write-wins` on `cerefox document ingest`.
- Web edit page detects mid-edit concurrent changes and shows a merge-needed conflict
error instead of silently overwriting.

### Fixed

- **Web edit page could corrupt metadata keys via the key autocomplete.** The key
suggestions embedded the usage count in the option label (`status (108)`), and
Mantine's Autocomplete inserts the *label* into the field on select — so picking a
suggestion (and saving) stored the literal string `status (108)` as the metadata key,
polluting the KB taxonomy (it then showed up in the key list as `status (108) (1)`).
The dropdown now shows the count via `renderOption` ("status · 108 docs" style),
while only the bare key ever enters the field. The search filter's key Select (which
was never affected — Select keeps value/label separate) now labels the count as
"(N docs)" for clarity.

---

Expand Down
Loading
Loading