Skip to content

Releases: hamii31/Cortex

Cortex 3.1.0 - Structure-aware indexing, hierarchical retrieval, separated library

Choose a tag to compare

@hamii31 hamii31 released this 11 May 07:24
e4ed3db

This release is the largest quality improvement to Cortex's RAG pipeline since 1.0. Two things changed: indexing now reads each document's authored structure, and retrieval now surfaces section-level context alongside specific excerpts. The result is noticeably better answers on well-structured documents — books, textbooks, dissertations, academic papers.

Headline changes

Structure-aware indexing. At index time, Cortex now reads each document's authored chapter and section layout. Three strategies are tried in order:

  1. PDF outline (bookmarks tree) — most modern PDF books and papers have this. Used directly when available.
  2. Table-of-contents page parsing — for PDFs without outlines, common TOC formats are parsed from text.
  3. Fixed-size fallback — for documents with no recoverable structure, sections form by grouping ~15 chunks at page boundaries (the original 1.1 behavior).

The log file reports which strategy was used for each book. Sections from real structural cues carry the chapter title, which appears in the section-context block at query time.

Hierarchical retrieval context. Each section gets an LLM-generated summary at index time. At query time, when a retrieved chunk belongs to a section with a summary, the summary is included in the prompt above the verbatim excerpts. The model sees:

[Section context — Neuroscience, 'Chapter 5: Synaptic Plasticity', pp. 95-118]
LLM-generated summary of what this chapter is about and what it argues...

[Excerpt 1] Neuroscience, p. 102 (relevance 0.82)
verbatim chunk text...

[Excerpt 2] Neuroscience, p. 107 (relevance 0.79)
verbatim chunk text...

This addresses the most common chunk-retrieval failure mode: technically correct answers that miss the larger argument the source was making. The model now has both the detail (chunks) and the context (section it's situated in).

Cortex works best with well-structured documents. Books, textbooks, dissertations, and peer-reviewed papers benefit most from these changes. Unstructured material (raw text dumps, single-paragraph notes, scanned PDFs without OCR) still works but loses the depth advantage — Cortex falls back to fixed-size sections in those cases.

Library separation from SmartReader

SmartReader cache auto-detection has been removed. Cortex no longer surfaces books from SmartReader's cache directory by default.

Why: SmartReader caches were produced before structure-aware indexing existed and lack the hierarchical section summaries 1.2 produces. Surfacing them by default created confusing duplicates with Cortex-native caches of the same books, with strictly inferior retrieval context.

Migration: re-index any books you care about in Cortex by dragging the original files into the window. The new caches will use structure-aware sections.

Opt-in compatibility: if you still want SmartReader cache access, set CORTEX_SMARTREADER_CACHE=<path> before launching. SmartReader-indexed books then appear in the library tagged sr, read-only.

This is a mild breaking change: if your previous Cortex installation was auto-discovering SmartReader books, those entries will disappear from the library after upgrading. Nothing is deleted from disk — Cortex just stops looking at that directory unless you opt back in.

Smaller improvements

  • The retroactive /api/library/{book_id}/summarize endpoint can generate hierarchical summaries for caches indexed before 1.2 (without re-embedding). Note: it can't recover document structure since the original PDF isn't kept, so the retroactive summaries will use fixed-size sections.
  • Section-aware deletion: removing a book through the UI now also deletes its .summaries.json companion file.
  • Library entries now expose a has_summaries field via /api/library so external tools can tell which books have hierarchical context.

Installation

  1. Install [Ollama](https://ollama.com).
  2. Pull the embedder and at least one model tier:
    ollama pull nomic-embed-text
    ollama pull qwen2.5:7b      # Lite tier
    ollama pull qwen2.5:14b     # Standard tier
    ollama pull hf.co/bartowski/Qwen2.5-32B-Instruct-GGUF:Qwen2.5-32B-Instruct-Q4_K_L.gguf  # Research tier
    
  3. Download Cortex.exe below and double-click.

The browser opens to the chat UI automatically. Click the model name in the top-left to pick your tier.

See the [README](https://github.com/hamii31/Cortex) for full documentation.

Upgrading from 1.0 / 1.1

  • Conversations and existing caches are preserved. Books indexed with 1.0/1.1 remain queryable. They just won't have section summaries until you re-index or call the retroactive summarize endpoint.
  • SmartReader auto-detection is gone. Set CORTEX_SMARTREADER_CACHE to restore the old behavior if needed.
  • No database migration required. The cache format is forward-compatible — old .pkl files load unchanged, the new .summaries.json files are additive.

Known limitations

  • The TOC-page parser uses regex patterns that work on common academic and trade-book formats but can miss unusual layouts. The PDF outline path is much more reliable when it's available.
  • Retroactive summarization of old caches uses fixed-size sections (no chapter titles), because the original PDF is no longer accessible. Re-indexing is the only way to get chapter-aware sections for previously-indexed books.
  • EPUB and DOCX use their format-native structure (chapters / paragraph groups) rather than outline extraction. This works fine but doesn't produce chapter titles in citations.
  • On hardware below the Research tier's VRAM requirement, the 32B Q4_K_L runs at 2–5 tokens/sec via CPU offload. Normal for partial-offload mode.

Acknowledgments

Thanks to everyone who reported the multi-source retrieval issue, the model-detection bug, and the SmartReader cache confusion that motivated the library separation.

Cortex 3.0 — Unified executable, runtime model switching, reasoning modes

Choose a tag to compare

@hamii31 hamii31 released this 09 May 09:34
d8913eb

Recommended release. This supersedes the per-tier (7B/14B/32B) variants — one executable now handles all three model tiers with runtime switching. Older per-tier releases will remain available but won't receive further updates.

What's new

One executable, three model tiers. Pick whichever models match your hardware via the sidebar dropdown and switch between them at any time. Your selection persists across launches.

Tier | Model | Min VRAM -- | -- | -- Lite | qwen2.5:7b | 8 GB Standard | qwen2.5:14b | 12 GB Research | qwen2.5:32b Q4_K_L | 24 GB (or 32+ GB system RAM with CPU offload)

Install only the tiers you have hardware for. The dropdown shows install status so you can see what's ready and what needs ollama pull.

Reasoning modes. Five structured prompt scaffolds you can apply per-message via pills above the input:

  • Default — direct answer, no scaffold
  • Compare — forces a markdown comparison table before prose. For "A vs B" questions.
  • Process — forces explicit state/step layout. For "how does X work" questions. (Standard tier and up.)
  • Cross-source — forces a cross-reference table across attached documents. (Standard tier and up.)
  • Critique — forces structured strengths/weaknesses analysis. For reviewing plans, papers, code. (Standard tier and up.)

Modes that require strong instruction-following are hidden on the Lite tier because 7B models can't reliably fill out the scaffolds. When multiple documents are attached and you're on Standard or Research tier, Cortex auto-promotes Default queries to Cross-source mode.

Multi-source RAG fix. Previously, retrieval pulled the global top-K chunks regardless of source, which meant a single dominant book could starve other attached sources. Now retrieval reserves slots per book when multiple are attached, and the prompt explicitly instructs the model to use all sources. If you tried multi-document queries on earlier versions and noticed the model only citing one book, this is fixed.

Cleaner shutdown. The browser sends a heartbeat every 10 seconds while the tab is open and an explicit shutdown signal when the tab closes. The server exits ~30 seconds after the tab disappears, freeing port 8000 without manual intervention.

Bug fixes

  • Empty model names from newer Ollama Python client versions no longer break model detection (the client started returning model instead of name in the response).
  • SmartReader caches with __slots__ mismatches now load correctly.
  • Index titles no longer carry temporary UUID prefixes from upload filenames.
  • Build script kills any running Cortex process before cleaning, so iterative builds don't hit "file in use" errors.
  • Launcher tolerates --noconsole builds where stdout/stderr are None — uvicorn no longer crashes on isatty().

Installation

  1. Install Ollama.
  2. Pull the embedder and at least one model tier:
   ollama pull nomic-embed-text
   ollama pull qwen2.5:7b      # Lite tier
   ollama pull qwen2.5:14b     # Standard tier
   ollama pull hf.co/bartowski/Qwen2.5-32B-Instruct-GGUF:Qwen2.5-32B-Instruct-Q4_K_L.gguf  # Research tier
  1. Download Cortex.exe below and double-click.

The browser opens to the chat UI automatically. Click the model name in the top-left to pick your tier.

Read more

Cortex 32B (Q4_K_L) — Highest-fidelity 32B variant, research-tier build

Choose a tag to compare

@hamii31 hamii31 released this 07 May 12:02
6b33357

The highest-fidelity 4-bit quantization of qwen2.5:32b. Q4_K_L preserves more precision in critical layers (attention, output projections) than Q4_K_M, giving the most accurate answers of any 4-bit Cortex build. About 20 GB on disk, ~22 GB VRAM in use.

What's the difference vs Q4_K_M?

Q4_K_L uses 6-bit quantization for attention and output layers (instead of 4-bit), at the cost of ~1 GB more size. The improvement over Q4_K_M is small — most users won't notice — but on hard technical questions where precision matters (math derivations, long-form code, careful synthesis across many sources), it can be the difference between right and almost-right.

If your hardware can handle it, this is the most accurate Cortex build available. If you don't know whether you need this, you don't — use Cortex 14B for daily work or Cortex 32B for research.

Hardware

Same as Q4_K_M but with ~2 GB more VRAM/RAM needed.

  • GPU (full speed): 24 GB VRAM (will fill it — close other GPU users first)
  • GPU (partial offload): 8–16 GB VRAM with at least 32 GB system RAM, expect 2–4 tok/s
  • Disk: ~20 GB for the model

Install

  1. Install Ollama.
  2. Pull the model from HuggingFace:
    ollama pull hf.co/bartowski/Qwen2.5-32B-Instruct-GGUF:Q4_K_L
    ollama pull nomic-embed-text
  3. Download Cortex.exe below and double-click.

Best fit: researchers running technical/academic work on 24 GB+ GPUs who want the strongest quality available without going to 70B-class models.

Cortex 14B — Best balance of quality and speed (recommended)

Choose a tag to compare

@hamii31 hamii31 released this 07 May 09:50
f8d19f5

The recommended Cortex build for most users. Ships with qwen2.5:14b (Q4) — about 9 GB on disk, ~10 GB VRAM in use. Noticeably stronger reasoning than the 7B without the wait time of the 32B.

Hardware

  • GPU: 12 GB VRAM minimum (RTX 3060 12GB, 4070, 5070 Ti, etc.)
  • CPU-offloaded fallback: 24 GB system RAM, expect 5–8 tok/s
  • Disk: ~9 GB for the model

What to expect

Fast generation on 12 GB+ GPUs (~20–25 tok/s). Quality is the sweet spot for local LLMs in 2026 — handles most academic and technical questions well, both with and without attached sources. Reasoning, code review, and synthesis are all meaningfully better than the 7B at the cost of about half the speed.

Best fit: most users, most workflows. If you're not sure which build to pick, this is the right answer.

Install

  1. Install Ollama.
  2. Pull the models:
    ollama pull qwen2.5:14b
    ollama pull nomic-embed-text
  3. Download Cortex.exe below and double-click.

The browser opens to the chat UI automatically. See the README for full usage and configuration.

Cortex 7B — Lightweight, fast, runs on most laptops

Choose a tag to compare

@hamii31 hamii31 released this 07 May 09:26
f8d19f5

Description:markdownThe lightweight Cortex build. Ships with qwen2.5:7b (Q4) as the default chat model — about 4.7 GB on disk, ~5 GB VRAM in use.

Hardware

  • GPU: 8 GB VRAM minimum (RTX 3060, 4060, 5060, similar laptop GPUs)
  • CPU-only fallback: 16 GB system RAM
  • Disk: ~5 GB for the model

What to expect

Fast generation (~30 tok/s on a typical 8 GB GPU). Strong on RAG-grounded queries against your indexed library — when you attach a textbook or paper and ask questions, the retrieved excerpts do most of the work and the 7B's job is mostly synthesis. Less strong on unattached chat where it has to reason from scratch — for hard pure-reasoning questions, consider the 14B or 32B builds instead.

Best fit: students and researchers on laptops, anyone wanting a responsive chat experience, daily driver for quick lookups against your library.

Install

  1. Install Ollama.
  2. Pull the models:ollama pull qwen2.5:7b
    ollama pull nomic-embed-text
  3. Download Cortex.exe below and double-click.

The browser opens to the chat UI automatically. See the README for full usage and configuration.