DocuMind is a local-first retrieval-augmented generation (RAG) system with two Chroma collections: a public index (default; Wikipedia-scale text via offline bulk jobs) and a papers index (PDFs, DOCX, .txt, arXiv). Documents are ingested, chunked, embedded into ChromaDB (cosine space), and queried through FastAPI with a per-request library field. Answers are grounded on retrieved passages, returned with structured citations, and shaped by mode-specific (and library-aware) generation policies. Default LLM and embedding inference run via Ollama on operator-controlled hardware.
This document specifies architecture, control flows, configuration, and operational behavior sufficient for engineering review, extension, and production hardening.
- System overview
- Design principles
- Repository layout
- Runtime architecture
- Data lifecycle: ingest → index
- Retrieval and generation pipeline
- Query modes
- FLARE-inspired active retrieval
- HTTP API
- Configuration
- Security middleware
- Observability and reliability
- Deployment
- Bundled corpus and scripts
- Testing
- Known limitations and extension points
- Portfolio artifacts
- References
- Interview narrative: quality bar, challenges, and retrieval design
| Layer | Responsibility |
|---|---|
| Presentation | Next.js 15 dashboard (web/) for operator workflows; optional Streamlit (frontend/app.py) calling the same REST API. |
| Application | FastAPI application (app/main.py): routing, middleware, dependency injection, lifespan-managed singletons. |
| Domain services | Document parsing and chunking (app/services/document_service.py, app/utils/chunker.py); vector persistence (app/services/embedding_service.py); RAG orchestration (app/services/rag_service.py). |
| Model I/O | Ollama client (app/utils/ollama_client.py): chat completions and per-text embeddings over HTTP. |
| Persistence | Chroma persistent client on disk (CHROMA_PERSIST_DIR); two collections — CHROMA_COLLECTION_PUBLIC (encyclopedia-scale) and CHROMA_COLLECTION_NAME (papers / PDFs). Cosine space (hnsw:space: cosine). |
Ports (convention): API 8001, Next.js dev 3002, Ollama 11434.
- Grounding first — Final user-facing answers for LLM-backed modes are conditioned only on retrieved chunk text; prompts explicitly forbid inventing papers, metrics, or datasets absent from context.
- Explicit provenance — Responses include
SourceCitationobjects (document id, title, section, page hint, chunk index, distance, preview). - Dependency-aware serving — Liveness vs readiness split so orchestrators can distinguish “process up” from “dependencies usable”.
- Configurable retrieval policy — Top‑k, distance cutoff, keyword rerank weight, fallback when strict filtering returns nothing, and diversity caps are all environment-tunable.
- Single-tenant baseline — One shared library index per deployment; ACLs per document are not implemented in-tree (see §16).
| Path | Role |
|---|---|
app/main.py |
FastAPI app, lifespan, global exception handler, middleware, router includes. |
app/config.py |
pydantic-settings Settings; single cached get_settings(). |
app/logging_config.py |
Optional JSON logging layout. |
app/routers/ingest.py |
Multipart ingest, delete by doc_id. |
app/routers/papers.py |
List / get / delete paper metadata from index. |
app/routers/query.py |
RAG query and collection stats. |
app/routers/arxiv.py |
arXiv PDF fetch by id. |
app/services/document_service.py |
File type detection, text extraction, delegation to chunker. |
app/services/embedding_service.py |
Chroma add/query/delete; Ollama embeddings. |
app/services/rag_service.py |
Retrieval, rerank, diversity, mode prompts, FLARE branch, Ollama chat. |
app/utils/chunker.py |
RecursiveCharacterTextSplitter; section heuristics in metadata. |
app/utils/ollama_client.py |
Retry-wrapped HTTP to Ollama /api/chat and /api/embeddings. |
app/models/ |
Pydantic request/response models shared by routers. |
data/sample_docs/ |
Bundled UTF-8 corpus (see §14). |
tests/ |
API and unit tests; tests/conftest.py uses dependency overrides and fake embedding/RAG for isolation. |
evaluation/ |
Optional regression fixtures for pipeline shape. |
scripts/ |
Corpus generators, portfolio PDF, arXiv bulk helpers. |
web/ |
Next.js operator UI. |
Dockerfile / docker-compose.yml |
Container image (Python 3.11-slim, non-root) and Compose stack with Chroma volume + healthcheck. |
flowchart TB
subgraph clients [Clients]
N[Next.js]
S[Streamlit]
end
subgraph api [DocuMind API]
F[FastAPI]
L[Lifespan: services + seed]
end
subgraph svc [Services]
D[DocumentService]
E[ChromaEmbeddingService]
R[RAGService]
end
subgraph ext [External]
O[Ollama]
C[(ChromaDB)]
end
N --> F
S --> F
F --> L
L --> D
L --> E
L --> R
D --> E
R --> E
R --> O
E --> O
E --> C
Lifespan (app/main.py): On startup, constructs OllamaClient, one shared chromadb.PersistentClient on CHROMA_PERSIST_DIR, then an EmbeddingRegistry with two ChromaEmbeddingService wrappers (papers + public collections) and two RAGService instances (content_library each). Sharing one client avoids double-opening the same SQLite store. Optionally runs seed_sample_docs into the papers collection only when SEED_SAMPLE_DOCS=true: compares SAMPLE_CORPUS_VERSION marker on disk to settings; on mismatch, deletes sample_* vectors in that collection, rewrites marker, then ingests each data/sample_docs/*.txt as sample_<stem>. The public collection is empty until you run scripts/bulk_index_public.py or scripts/build_public_corpus.py (or POST /api/v1/ingest with library=public).
Routers mount under /api/v1 except health routes at root.
- Input:
POST /api/v1/ingest(multipart/form-data: file + optionallibraryfield, defaultpublic) orPOST /api/v1/fetch-arxiv(JSONarxiv_id; always indexes papers). - Validation: File size cap
MAX_FILE_SIZE_MB; MIME/type checks in ingest router / document service. - Extraction: PyPDF2 for PDF, python-docx for DOCX, raw decode for
.txt. - Metadata: Heuristic title, authors, year, optional arXiv id from leading text when parseable.
- Chunking:
DocumentChunkeruses LangChainRecursiveCharacterTextSplitterwithCHUNK_SIZEandCHUNK_OVERLAP. Eachlangchain_core.documents.Documentcarries metadata:doc_id,filename,section(heuristic),chunk_index,page_numberwhen known, etc. - Indexing:
ChromaEmbeddingService.add_documents(HTTP path) oradd_indexed_batch(bulk indexer) embeds chunks via OllamaEMBEDDING_MODEL, writes to the selected collection with stable ids{doc_id}_{i}. Each chunk metadata is stamped withembedding_model,chroma_collection, andindexed_at(UTC) for re-embed and drift workflows.
DELETE /api/v1/papers/{doc_id} and DELETE /api/v1/ingest/{doc_id} call embedding_service.delete_document. If no chunks exist for that doc_id, the service returns false and the API responds 404 — empty delete is not silently successful.
Each Chroma collection is created with metadata={"hnsw:space": "cosine"}. Query results expose distance per hit; the RAG layer sorts ascending (lower distance = closer match) and keeps rows with distance < RELEVANCE_THRESHOLD before optional fallback (threshold is a tunable cutoff on this distance scale for your embedding model and corpus).
All logic below is implemented in app/services/rag_service.py unless noted.
For a user top_k and query_mode, the service expands the vector search n_results before reranking (e.g. up to 64 for general / compare, up to 56 for other modes). This widens the candidate pool so rerank and diversity filters have material to work with.
embedding_service.search(embed_query, retrieve_k, section_filter)returns rows{content, metadata, distance}.- Keyword rerank: Rows are sorted by
distance − KEYWORD_RERANK_WEIGHT × keyword_overlap_score(rerank_query, content)
so lexical overlap with the user question can reorder within a distance band. - Threshold filter: Keep rows with
distance < RELEVANCE_THRESHOLD. - Fallback: If nothing passes and
ENABLE_FALLBACK_RETRIEVALis true, take the topFALLBACK_TOP_Nby rerank order and mark internally (answer may append a disclosure line). - Diversity:
_select_diverse_sourcesprefers at most one strong chunk perdoc_idbefore filling remaining slots, reducing single-document context monopolization. - Context slot cap: Depends on
query_mode(e.g. up to 24 chunks forgeneral/compare).
datasetsmode: Does not call the LLM for the main body. It scans retrieved chunk text for known dataset hints and patterns, emits a structured Markdown inventory. FLARE is skipped.- Other modes: Builds a single context block from selected chunks, applies the mode’s system prompt (
SYSTEM_PROMPTSfor papers,PUBLIC_SYSTEM_PROMPTSfor public), callsOllamaClient.chatwith mode-dependent temperature, returns Markdown answer plusSourceCitationlist. - Confidence: Derived from mean chunk distance (clamped); exposed as a scalar for UI.
query_mode |
Behavior |
|---|---|
general |
Broad grounded synthesis; higher temperature than methodology. |
compare |
Cross-paper comparison framing; large retrieval budget; table-oriented prompt. |
methodology |
Implementation-focused extraction; moderate temperature. |
datasets |
Deterministic dataset / benchmark surfacing from chunk text. |
reproduce |
Reproducibility checklist style; structured sections in prompt. |
Optional section_filter restricts Chroma where clause on metadata section (abstract, introduction, methodology, experiments, results, conclusion).
Full FLARE (Jiang et al., arXiv:2305.06983) uses token-level confidence to trigger mid-generation retrieval. Ollama’s chat API used here does not expose per-token logprobs.
Implementation: When use_flare (request) or FLARE_ACTIVE_RETRIEVAL (settings) is true and mode ≠ datasets:
- Run the standard first-pass retrieval → context selection.
- Build a truncated mini-context (bounded by
FLARE_DRAFT_MAX_CONTEXT_CHARS) from selected chunks. - Call the LLM once with
FLARE_DRAFT_SYSTEMto produce a 2–4 sentence forward-looking draft; unsupported facts must appear as???or explicit excerpt-level hedges. - If
flare_triggers_follow_up(draft)is true, run a secondsearchwith a composite query (user question + draft excerpt, capped length). - Merge reranked lists by chunk identity, keeping the better (lower) distance per chunk; re-run threshold, fallback, and diversity on the merged set.
- Final synthesis uses merged chunks. Response fields
flare_enabledandflare_followup_retrievalrecord what occurred.
| Approach | Idea | Why it is not the default here |
|---|---|---|
| Token-level FLARE (paper-faithful) | Use per-token confidence from the generator to trigger retrieval mid-stream. | Ollama’s /api/chat does not expose logprobs; wiring OpenAI logprobs would fork the inference abstraction. |
| HyDE | LLM hallucinates a hypothetical document; embed that for retrieval. | Extra latency + hallucinated retrieval queries can pollute dense search on technical corpora unless heavily guarded. |
| Multi-query / RAG-Fusion | LLM emits several sub-queries; retrieve per query; fuse (RRF). | Strong for recall; cost and latency scale with query count; harder to explain citations per sub-query in a portfolio UI. |
| Self-RAG / CRAG | Model judges “is retrieval needed?” and quality of hits; may rewrite queries. | Heavier orchestration and eval surface; many steps for a local single-GPU demo. |
| Re-ranker only (cross-encoder) | Keep one retrieval pass; rerank with a second model. | Excellent production pattern; not bundled to keep the stack Ollama-centric and CPU-light for reviewers cloning cold. |
Why FLARE-shaped active retrieval anyway: It is literature-grounded (easy to cite Jiang et al. in interviews), bounded (one draft call + at most one follow-up search), and honest about constraints (draft uses ??? / hedges instead of fake logprobs). It demonstrates you understand when to stop retrieving and how to merge evidence from two passes—without pretending the host is a commercial API.
| Method | Path | Body / params | Notes |
|---|---|---|---|
| GET | /health |
— | Ollama availability + stats for DEFAULT_LIBRARY collection. |
| GET | /health/live |
— | Process liveness. |
| GET | /health/ready |
— | 503 if dependencies not ready. |
| GET | /api/v1/libraries |
— | Both collections’ CollectionStats + default_library (ops / capacity). |
| POST | /api/v1/ingest |
multipart/form-data: file, optional library |
Indexes into public or papers. |
| DELETE | /api/v1/ingest/{doc_id} |
Query ?library= (default public) |
404 if no chunks. |
| POST | /api/v1/fetch-arxiv |
{ "arxiv_id": "..." } |
Downloads PDF; indexes papers only. |
| POST | /api/v1/query |
QueryRequest JSON (library default public) |
See app/models/request_models.py. |
| GET | /api/v1/papers |
Query ?library= |
Library cards. |
| GET | /api/v1/papers/{doc_id} |
Query ?library= |
One document. |
| DELETE | /api/v1/papers/{doc_id} |
Query ?library= |
404 if no chunks. |
| GET | /api/v1/collection/stats |
Query ?library= |
Aggregate counts for one collection. |
OpenAPI: /docs, /redoc, /openapi.json unless DISABLE_OPENAPI=true.
Authentication: When API_KEY is non-empty, all /api/v1/* routes (except CORS preflight) require header X-API-Key matching the setting; mismatch → 401.
All keys are listed in .env.example. Grouped reference:
| Group | Variables | Purpose |
|---|---|---|
| Models | OLLAMA_BASE_URL, LLM_MODEL, EMBEDDING_MODEL |
Inference endpoints and model tags. |
| Vector store | CHROMA_PERSIST_DIR, CHROMA_COLLECTION_NAME, CHROMA_COLLECTION_PUBLIC, DEFAULT_LIBRARY |
On-disk path; papers vs public collection names; default library for /health stats. |
| Chunking | CHUNK_SIZE, CHUNK_OVERLAP |
Text splitter parameters; affects chunk count and context granularity. |
| Retrieval defaults | TOP_K_RESULTS, RELEVANCE_THRESHOLD, ENABLE_FALLBACK_RETRIEVAL, FALLBACK_TOP_N, KEYWORD_RERANK_WEIGHT |
Global defaults; per-request top_k overrides for query. |
| Ingest | MAX_FILE_SIZE_MB, ARXIV_BASE_URL |
Upload cap and arXiv PDF export host. |
| Sample corpus | SAMPLE_CORPUS_VERSION, SEED_SAMPLE_DOCS |
Bump version to purge/re-seed sample_* in papers when SEED_SAMPLE_DOCS=true. |
| Network | CORS_ORIGINS, CORS_ALLOW_ALL, TRUSTED_HOSTS |
Browser and Host-header policy. |
| App | APP_ENV, DISABLE_OPENAPI |
Environment label; docs toggle. |
| Security / transport | API_KEY, ENABLE_RESPONSE_GZIP |
Optional API key gate; gzip responses. |
| Logging | LOG_LEVEL, LOG_JSON |
Verbosity and JSON log lines. |
| FLARE | FLARE_ACTIVE_RETRIEVAL, FLARE_DRAFT_MAX_CONTEXT_CHARS |
Global FLARE default and draft context budget. |
Applied in app/main.py (order matters for FastAPI / Starlette):
- CORS —
CORSMiddlewarewith explicit origins or wildcard whenCORS_ALLOW_ALL(dev-only). - Trusted hosts — Optional
TrustedHostMiddlewarewhenTRUSTED_HOSTSis set. - Gzip —
GZipMiddlewarewhenENABLE_RESPONSE_GZIPand payload exceeds minimum size. - Per-request —
X-Request-IDassignment, optional API key gate, default security headers (X-Content-Type-Options,X-Frame-Options,Referrer-Policy;Permissions-Policyin productionAPP_ENV). - Errors —
HTTPExceptionandRequestValidationErrorreturn structured JSON; uncaught exceptions return 500 withrequest_idin body.
- Request correlation — Every response carries
X-Request-ID; access logs includerequest_id, method, path, status,duration_ms. - Structured logs —
LOG_JSON=truefor log platforms. - Healthchecks — Docker Compose defines an HTTP probe against
/health/live(seedocker-compose.yml). Prefer/health/readyfor LB routing when Ollama and Chroma must be live. - Chroma persist corruption (development) — If opening the store raises a recoverable Chroma/Rust error (
APP_ENV=development), the API renamesCHROMA_PERSIST_DIRto a sibling*.broken.<UTC>folder, then exits startup withRuntimeError. Restart the process once so a fresh Python interpreter opens the new empty directory (PyO3 panics can poison in-process bindings; an immediate re-open in the same process is unsafe). Production/staging surfaces the error without renaming.
| Target | Command / notes |
|---|---|
| Docker Compose | docker compose up --build — publishes 8001, mounts Chroma volume chroma_data, read-only ./data. Set OLLAMA_BASE_URL to reachable Ollama (default host.docker.internal:11434 on Docker Desktop). |
| Bare metal / VM | uvicorn app.main:app --host 0.0.0.0 --port 8001 (add --proxy-headers behind TLS terminator per your platform). |
| Windows dev | .\start_documind.ps1 (Ollama, API, Next); uses .venv\Scripts\python.exe when present. First boot can sit in corpus ingest for a long time before /health responds; the script waits up to 180 minutes (-MaxApiWaitMinutes). -SkipModelPull speeds repeat boots. .\stop_documind.ps1 clears ports 3002, 8001, 11434 — confirm Ollama shutdown is intended. |
Backup: Copy CHROMA_PERSIST_DIR regularly; it is the authoritative index. Source PDFs/DOCX should remain in object storage or VCS-independent archives if they are not all under data/.
data/sample_docs/— On the order of ~460 UTF-8 files (~1 MB text): curated summaries plus 400 syntheticsample_corpus_p7_*.txtfromscripts/generate_production_corpus.py. This is not a massive real-world KB; it is optional demo material for the papers collection whenSEED_SAMPLE_DOCS=true.- Chroma in the repo clone — Often tens of MB after local indexing; size grows with chunk count × (vectors + stored text + HNSW). Empty public collection adds negligible disk until you bulk-index.
pip install datasets(for Hugging Face streaming).- One command (stream + bulk index):
python scripts/build_public_corpus.py --articles 10000
Use--articles 50000or higher for serious scale;--articles 0 --allow-unboundedstreams the full dump (disk-hungry). - Piecemeal:
scripts/stream_wikipedia_to_txt.py→scripts/bulk_index_public.py(--dry-runfor chunk estimates,--checkpointfor resume,--workersfor parallel Ollama embeds). - Ops:
GET /api/v1/librariesfor both collections’ chunk and document counts.
- Regeneration (papers bundle):
python scripts/generate_production_corpus.py --count 500 --forcethen bumpSAMPLE_CORPUS_VERSION(withSEED_SAMPLE_DOCS=true). - Hand-authored expansion:
scripts/materialize_institutional_corpus.py. - arXiv bulk:
scripts/bulk_ingest_arxiv.py+data/arxiv_seed_list.txt(indexes papers).
pytest -qtests/conftest.py overrides FastAPI dependencies with fake embedding/RAG services so unit tests do not require Ollama or Chroma.
Query regression suite: tests/test_rag_query_suite.py runs 20 parameterized cases (tests/query_eval_cases.py) against real RAGService with a ranking fake vector layer (tests/ranking_fake_embedding.py) and a deterministic Ollama stub — metrics cover status, has_answer, source counts, answer substrings, and wall time. Live library smoke: python scripts/run_query_eval.py --base-url http://127.0.0.1:8001 (optional --csv report.csv; skips empty-corpus cases with --skip-empty-corpus-cases).
CI: On push and pull request to main or master, GitHub Actions (.github/workflows/ci.yml) runs on Python 3.11 and 3.12: ruff check (syntax / undefined-name rules), then pytest. No Ollama or Chroma in CI. Pytest and Ruff defaults: pyproject.toml. Dependabot for Actions: .github/dependabot.yml.
Not implemented in this repository (non-exhaustive):
- Per-user or per-tenant ACL on chunks or documents.
- SSO / OIDC for the API or UI.
- OCR pipeline for low-quality scanned PDFs beyond basic text extraction.
- Hosted managed vector SaaS swap (Pinecone, Weaviate, etc.) — would replace
ChromaEmbeddingServicewhile preserving router contracts. - Token-level FLARE — requires a host that exposes logprobs or an alternative uncertainty model.
- Chroma auto-quarantine — development-only; requires one manual restart after a bad on-disk store is moved aside (see §12).
Natural extensions: swap Ollama for OpenAI/Azure OpenAI behind the same RAGService boundary; add golden-set eval CI; wire /health/ready to load balancers; add cross-encoder reranking as an optional second stage.
Under portfolio/: client project catalog HTML, portfolio brief HTML, optional PDF generation (scripts/portfolio_requirements.txt, scripts/generate_portfolio_pdf.py), dashboard screenshot portfolio/screenshots/documind-dashboard.png.
Regenerating the screenshot (recommended): A bare playwright screenshot of the home page misses indexed doc counts and synthesis. Use the bundled driver after API + Next are up and sample ingest has progressed:
.\.venv\Scripts\pip install -r scripts\screenshot_requirements.txt
.\.venv\Scripts\playwright install chromium
.\scripts\capture_dashboard.ps1 # waits for /health/live, Gold demo scenario (compare, Top K 24, FLARE), synthesis text, tall-viewport PNG
# Smaller corpus / faster index gate: .\scripts\capture_dashboard.ps1 -MinDocs 40
# Custom API / wait cap: .\scripts\capture_dashboard.ps1 -ApiBase "http://127.0.0.1:8001" -MaxLivenessWaitMinutes 240Or directly: .\.venv\Scripts\python scripts\capture_dashboard_playwright.py --help — waits for ≥120 chars in .prose-answer, scrolls synthesis into view, writes 1680×3200 portfolio/screenshots/documind-dashboard.png (default --viewport-width 1680; use --viewport-width 1440 if needed), then 1000×750 portfolio/screenshots/documind-upwork-catalog-1000x750.png (default stack infographic tile; --plain-catalog-thumb top-crops the dashboard). Thumb only: .\.venv\Scripts\python scripts\capture_dashboard_playwright.py --thumb-only. Standalone tile: python scripts/catalog_thumb_art.py --out portfolio/screenshots/documind-upwork-catalog-1000x750.png. Avoid --full-page for portfolio assets.
- Jiang et al., Active Retrieval Augmented Generation (FLARE), arXiv:2305.06983.
- Gao et al., Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE), arXiv:2212.10496.
- FastAPI, Pydantic v2, ChromaDB, LangChain text splitters, Ollama HTTP API.
- Senior IC–level system design: dual corpora (
libraryrouting), explicit provenance on chunks, live vs ready health, optional API key, structured errors withrequest_id, Docker + Compose, regression tests that do not require GPU clusters in CI. - RAG depth beyond “call OpenAI”: retrieval budget by mode, keyword rerank, diversity cap, fallback when the strict distance filter starves, optional second retrieval pass with merge semantics, deterministic
datasetsmode for grounded extraction without generative drift.
- Single-tenant / no row-level ACL, no SSO, no rate limiting or quota service, no multi-region active-active.
- Ollama-centric inference: great for reproducible demos; production would likely pin vendor APIs or vLLM behind autoscaling with SLO dashboards.
- Chroma embedded SQLite on disk: fine for many products; hyperscale teams often move vectors to managed stores (e.g. Pinecone, Weaviate Cloud, Aurora pgvector) with backup/restore runbooks.
- CI does not run full embedding + LLM golden paths—by design for cost; live
scripts/run_query_eval.pyis the operator’s integration check.
- Dual library without doubling connections — Two logical indexes, one
PersistentClient, two collections; avoids subtle SQLite / Rust binding issues from opening the same path twice. - Chroma 1.x + legacy on-disk stores — Upstream issues (e.g. chroma-core/chroma#5909) can surface as Rust panics; development path quarantines the directory and forces a restart so the interpreter is not left with poisoned native bindings after PyO3 failure.
- Active retrieval without logprobs — Full FLARE is token-conditional; this stack uses a draft + lexical/regex uncertainty gate (
flare_triggers_follow_up) so behavior stays testable and bounded. - Evaluating RAG without flaky LLM output in CI —
tests/test_rag_query_suite.pyuses a deterministic stub for chat and a ranking-aware fake embedding layer so structural expectations stay stable.
Point to §8.1: HyDE, multi-query fusion, rerank-only, self-RAG/CRAG vs bounded FLARE-shaped retrieval under local API constraints.
Prioritize a short design doc PR: cross-encoder rerank behind a flag, OpenTelemetry spans on retrieve vs generate, Ragas or similar on a frozen eval JSONL, and a one-page SLO table (p95 latency, error budget). Those are high signal per line of code for staff+ loops.
Python 3.11+ (Dockerfile pins 3.11-slim), FastAPI, Uvicorn, Pydantic Settings, ChromaDB, langchain_core + langchain-text-splitters, Ollama, Next.js 15, React 18, TypeScript, pytest, optional Streamlit.