Skip to content

Releases: mithun50/TreeDex

v0.1.5 — Smart Hierarchy Extraction for Large Documents

22 Mar 12:35

Choose a tag to compare

What's New

This release fixes hierarchy extraction accuracy on large (300+ page) documents. Previously, subsections would get flattened to top-level sections — now TreeDex uses multiple strategies to maintain correct depth.

New Features

  • PDF ToC extraction — If the PDF has bookmarks/outline, the tree is built directly from them — zero LLM calls needed, perfect hierarchy every time
  • Font-size heading detection — Analyzes font sizes across the document and injects [H1]/[H2]/[H3] markers so the LLM knows exactly which level each heading belongs to
  • Capped continuation context — For multi-chunk documents, the LLM sees a compact summary (top-level outline + last 30 sections) instead of the full history — 78% fewer tokens wasted on context
  • Orphan repair — If the LLM outputs "2.3.1" without a "2.3" parent, synthetic parents are auto-inserted to maintain a valid tree

New Exports

Function Python Node.js
Extract PDF ToC extract_toc(path) await extractToc(path)
ToC → sections toc_to_sections(toc) tocToSections(toc)
Repair orphans repair_orphans(sections) repairOrphans(sections)

How It Helps

Before — LLM sees flat text, guesses hierarchy:

1 Introduction  1.1 Background  Large Language Models...

After — LLM sees font-size markers, knows exact depth:

[H2] 1 Introduction
[H3] 1.1 Background
Large Language Models...

Impact on Large Documents (1M+ tokens)

LLM Context Groups LLM Calls Continuation Tokens
20k (default) 56 56 1,336 (was 5,946)
128k (large) 8 8 minimal
PDF with ToC 0 0 N/A

Full Changelog

  • pdf_parser / pdf-parser — ToC extraction, heading analysis, annotated text builder
  • tree_builder / tree-buildertoc_to_sections(), repair_orphans()
  • prompts — Updated to reference [H1]/[H2]/[H3] heading markers
  • core — ToC shortcut path, heading detection, capped context, orphan repair
  • loaders — Pass-through for detect_headings option
  • Tests — New test coverage for all new functionality
  • README.md — Updated How It Works section and API reference
  • how-treedex-works.svg — Updated pipeline diagram

19 files changed, 1,012 insertions(+), 102 deletions(-)

v0.1.4

01 Mar 12:36

Choose a tag to compare

What's New

Web Demo — Chat UI + Caching + Vercel Deploy

  • Chat-style UI: Two-panel layout with sidebar + chat bubbles (user/AI), typing indicator, collapsible sources
  • Main upload zone: File upload front-and-center in the chat area
  • Per-file progress: Upload multiple files with real-time indexing status
  • Disk cache (.cache/): Re-uploading the same file is instant; cached docs auto-restore on restart
  • Auto-retry on 429: Rate-limited Groq calls automatically wait and retry (up to 8 attempts)
  • Conversational fallback: Greetings get natural responses instead of "no info found"
  • Vercel serverless: Full api/ functions with client-side IndexedDB state
  • Mobile responsive: Sidebar hamburger menu, keyboard-aware input bar
  • dotenv config: .env.example with GROQ_API_KEY, PORT, LLM_MODEL, LLM_BASE_URL

Install

npm install treedex@0.1.4
pip install treedex==0.1.4

v0.1.2

01 Mar 09:51

Choose a tag to compare

New Features

  • Agentic RAG modequery(question, agentic=True) retrieves relevant sections then generates a direct LLM answer. Available in both Python and Node.js.
  • Multi-document support — Web demo now supports uploading and querying across multiple documents simultaneously
  • Answer prompt — New answerPrompt / ANSWER_PROMPT template for answer generation

Bug Fixes

  • Fix page range assignment — Sections starting on the same page no longer get inverted ranges (e.g. "pages 10-9"), which caused empty context text
  • Fix PDF loading — Convert Buffer to Uint8Array for pdfjs-dist compatibility
  • Fix multer file upload — Preserve original file extension for format detection
  • Increase LLM timeout — Default 5 min (configurable via timeout option on OpenAICompatibleLLM)

Other

  • Shorter retrieval reasoning (one sentence instead of verbose paragraph)
  • Improved answer prompt with explicit instructions to extract facts from context
  • Web demo switched from NVIDIA Kimi K2.5 to Groq for fast inference
  • Updated Colab notebook with agentic mode examples
  • Updated README with agentic RAG documentation

Full Changelog: v0.1.1...v0.1.2

v0.1.1

01 Mar 09:12

Choose a tag to compare

Bug Fixes

  • Fix PDF loading — Convert Buffer to Uint8Array for pdfjs-dist compatibility
  • Fix web demo file upload — Preserve file extension after multer upload so autoLoader can detect format
  • Increase LLM request timeout — Default timeout raised from 2 min to 5 min; now configurable via timeout option on OpenAICompatibleLLM

Other

  • Web demo now uses published treedex@0.1.1 from npm instead of local file link

Full Changelog: v0.1.0...v0.1.1

v0.1.0 — Initial Release

01 Mar 02:54

Choose a tag to compare

TreeDex v0.1.0

Tree-based, vectorless document RAG framework.

Highlights

  • Tree-based indexing — preserves document hierarchy (chapters, sections, subsections)
  • 18+ LLM backends — Gemini, OpenAI, Claude, Groq, Together AI, Fireworks, DeepSeek, Ollama, and any OpenAI-compatible endpoint
  • Zero vector dependencies — no embeddings, no vector DB, just JSON
  • Exact page attribution — every answer traces back to source pages
  • 4 document formats — PDF, TXT, HTML, DOCX

Install

pip install treedex

Quick Start

from treedex import TreeDex, GeminiLLM

llm = GeminiLLM(api_key="YOUR_KEY")
index = TreeDex.from_file("document.pdf", llm=llm)
result = index.query("What is the main argument?")
print(result.context)
print(result.pages_str)

What's Included

  • treedex/ — Core library (pdf_parser, tree_builder, loaders, llm_backends, prompts, core)
  • examples/ — Quick start examples + sample index
  • tests/ — Full test suite
  • benchmarks/ — TreeDex vs ChromaDB vs Naive comparison (auto-run in CI)
  • assets/ — SVG charts auto-generated from real benchmarks