Releases: mithun50/TreeDex
Releases · mithun50/TreeDex
v0.1.5 — Smart Hierarchy Extraction for Large Documents
What's New
This release fixes hierarchy extraction accuracy on large (300+ page) documents. Previously, subsections would get flattened to top-level sections — now TreeDex uses multiple strategies to maintain correct depth.
New Features
- PDF ToC extraction — If the PDF has bookmarks/outline, the tree is built directly from them — zero LLM calls needed, perfect hierarchy every time
- Font-size heading detection — Analyzes font sizes across the document and injects
[H1]/[H2]/[H3]markers so the LLM knows exactly which level each heading belongs to - Capped continuation context — For multi-chunk documents, the LLM sees a compact summary (top-level outline + last 30 sections) instead of the full history — 78% fewer tokens wasted on context
- Orphan repair — If the LLM outputs
"2.3.1"without a"2.3"parent, synthetic parents are auto-inserted to maintain a valid tree
New Exports
| Function | Python | Node.js |
|---|---|---|
| Extract PDF ToC | extract_toc(path) |
await extractToc(path) |
| ToC → sections | toc_to_sections(toc) |
tocToSections(toc) |
| Repair orphans | repair_orphans(sections) |
repairOrphans(sections) |
How It Helps
Before — LLM sees flat text, guesses hierarchy:
1 Introduction 1.1 Background Large Language Models...
After — LLM sees font-size markers, knows exact depth:
[H2] 1 Introduction
[H3] 1.1 Background
Large Language Models...
Impact on Large Documents (1M+ tokens)
| LLM Context | Groups | LLM Calls | Continuation Tokens |
|---|---|---|---|
| 20k (default) | 56 | 56 | 1,336 (was 5,946) |
| 128k (large) | 8 | 8 | minimal |
| PDF with ToC | 0 | 0 | N/A |
Full Changelog
pdf_parser/pdf-parser— ToC extraction, heading analysis, annotated text buildertree_builder/tree-builder—toc_to_sections(),repair_orphans()prompts— Updated to reference[H1]/[H2]/[H3]heading markerscore— ToC shortcut path, heading detection, capped context, orphan repairloaders— Pass-through fordetect_headingsoption- Tests — New test coverage for all new functionality
README.md— Updated How It Works section and API referencehow-treedex-works.svg— Updated pipeline diagram
19 files changed, 1,012 insertions(+), 102 deletions(-)
v0.1.4
What's New
Web Demo — Chat UI + Caching + Vercel Deploy
- Chat-style UI: Two-panel layout with sidebar + chat bubbles (user/AI), typing indicator, collapsible sources
- Main upload zone: File upload front-and-center in the chat area
- Per-file progress: Upload multiple files with real-time indexing status
- Disk cache (
.cache/): Re-uploading the same file is instant; cached docs auto-restore on restart - Auto-retry on 429: Rate-limited Groq calls automatically wait and retry (up to 8 attempts)
- Conversational fallback: Greetings get natural responses instead of "no info found"
- Vercel serverless: Full
api/functions with client-side IndexedDB state - Mobile responsive: Sidebar hamburger menu, keyboard-aware input bar
- dotenv config:
.env.examplewithGROQ_API_KEY,PORT,LLM_MODEL,LLM_BASE_URL
Install
npm install treedex@0.1.4
pip install treedex==0.1.4v0.1.2
New Features
- Agentic RAG mode —
query(question, agentic=True)retrieves relevant sections then generates a direct LLM answer. Available in both Python and Node.js. - Multi-document support — Web demo now supports uploading and querying across multiple documents simultaneously
- Answer prompt — New
answerPrompt/ANSWER_PROMPTtemplate for answer generation
Bug Fixes
- Fix page range assignment — Sections starting on the same page no longer get inverted ranges (e.g. "pages 10-9"), which caused empty context text
- Fix PDF loading — Convert
BuffertoUint8Arrayforpdfjs-distcompatibility - Fix multer file upload — Preserve original file extension for format detection
- Increase LLM timeout — Default 5 min (configurable via
timeoutoption onOpenAICompatibleLLM)
Other
- Shorter retrieval reasoning (one sentence instead of verbose paragraph)
- Improved answer prompt with explicit instructions to extract facts from context
- Web demo switched from NVIDIA Kimi K2.5 to Groq for fast inference
- Updated Colab notebook with agentic mode examples
- Updated README with agentic RAG documentation
Full Changelog: v0.1.1...v0.1.2
v0.1.1
Bug Fixes
- Fix PDF loading — Convert
BuffertoUint8Arrayforpdfjs-distcompatibility - Fix web demo file upload — Preserve file extension after multer upload so
autoLoadercan detect format - Increase LLM request timeout — Default timeout raised from 2 min to 5 min; now configurable via
timeoutoption onOpenAICompatibleLLM
Other
- Web demo now uses published
treedex@0.1.1from npm instead of local file link
Full Changelog: v0.1.0...v0.1.1
v0.1.0 — Initial Release
TreeDex v0.1.0
Tree-based, vectorless document RAG framework.
Highlights
- Tree-based indexing — preserves document hierarchy (chapters, sections, subsections)
- 18+ LLM backends — Gemini, OpenAI, Claude, Groq, Together AI, Fireworks, DeepSeek, Ollama, and any OpenAI-compatible endpoint
- Zero vector dependencies — no embeddings, no vector DB, just JSON
- Exact page attribution — every answer traces back to source pages
- 4 document formats — PDF, TXT, HTML, DOCX
Install
pip install treedexQuick Start
from treedex import TreeDex, GeminiLLM
llm = GeminiLLM(api_key="YOUR_KEY")
index = TreeDex.from_file("document.pdf", llm=llm)
result = index.query("What is the main argument?")
print(result.context)
print(result.pages_str)What's Included
treedex/— Core library (pdf_parser, tree_builder, loaders, llm_backends, prompts, core)examples/— Quick start examples + sample indextests/— Full test suitebenchmarks/— TreeDex vs ChromaDB vs Naive comparison (auto-run in CI)assets/— SVG charts auto-generated from real benchmarks