Skip to content

weberankit/DocAiBackend

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌲 Agentic RAG System

Vectorless • Reasoning-Based • Human-Like Retrieval

🔗 Visit this


📌 Overview

A scalable, agentic document intelligence system built on top of a locally deployed PageIndex instance for document parsing, designed to process long documents and enable reasoning-driven retrieval instead of vector similarity search.


🧠 What This Project Actually Does

❌ Most RAG Systems ✅ This System
Chunk documents Builds a structured tree index (like a Table of Contents)
Store embeddings Uses LLM reasoning to navigate the tree
Retrieve by similarity Fetches only what is needed, when needed

⚡ Core Idea

Instead of asking:

"Which chunk is similar?"

We ask:

"Where should I look, and why?"


🏗️ System Architecture

1. User Layer

  • Upload document (PDF, long text)
  • Ask questions via chat

2. Validation Layer

  • File validation
  • Security checks
  • Format normalization

3. Async Processing (Queue System)

Upload goes into queue → Background processing:

  • Parsing — handled by locally deployed PageIndex
  • Structuring
  • Index generation

4. 🌲 PageIndex Tree Generation (Locally Deployed)

Documents are parsed and indexed by a self-hosted PageIndex instance, which converts them into a hierarchical tree structure:

{
  "title": "Section",
  "summary": "...",
  "nodes": []
}

Storage:

  • Tree Nodes → MongoDB
  • Raw Pages → S3

5. 🧠 Agentic Query System (LLM Reasoning)

Step 1: Intent Understanding

LLM decides:

  • Is this a simple question?
  • Does it require document reasoning?

Step 2: 🌲 Tree Navigation (Core Innovation)

Instead of vector search:

  • Traverse tree like a human
  • Section → Subsection → Page
  • Use summaries to guide decisions

Step 3: ⚡ Smart Retrieval Strategy

Scenario Action
Simple query Answer directly
Node-level sufficient Fetch structured nodes
Deep reasoning needed Fetch raw pages from S3

Step 4: 🔍 Cross-Node Reasoning

  • Combine multiple nodes
  • Use cross-page context
  • Perform multi-step reasoning

6. Response Generation

Relevant Nodes + Raw Context → LLM → Final Answer

🚀 Why This Is Different

❌ Traditional RAG Problems ✅ This System Solves That
Chunking breaks context No Vector DB
Embeddings miss true relevance No Chunking
Hard to explain retrieval Reasoning-Based Retrieval
Expensive at scale Explainable (traceable path in tree)
Human-like navigation

🧩 Key Features

  • 🌲 Tree-Based Indexing
  • 🧠 LLM as Decision Engine
  • ⚡ Adaptive Data Fetching
  • 🔄 Cross-Page Reasoning
  • 📦 Scalable Processing

🛠️ Tech Stack

Layer Technology
Backend Node.js / Express
Document Parser PageIndex (locally deployed)
Queue BullMQ / RabbitMQ
Storage S3 (raw documents), MongoDB (tree index)
LLM OpenAI / local models
Architecture Agentic workflow

📊 Conceptual Flow

User Upload → Queue → PageIndex (Local) → Tree Index Creation
                                                    ↓
                  User Query → LLM Reasoning → Tree Traversal
                                                    ↓
                                        Fetch Nodes / Raw Pages
                                                    ↓
                                              Final Answer

🔬 Inspiration & Core Dependency

  • PageIndex — used as the local document parsing engine
  • Agentic retrieval systems
  • Human expert document navigation patterns

💡 Positioning

❌ This is NOT ✅ This IS
A chatbot 🧠 A reasoning-first retrieval system
A simple RAG pipeline for long, complex documents

About

A scalable, agentic document intelligence system inspired by PageIndex, designed to process long documents and enable reasoning-driven retrieval instead of vector similarity search.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors