Skip to content

param20h/PDF-Assistant-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

682 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
title Document AI Analyst
emoji 🧠
colorFrom indigo
colorTo purple
sdk docker
app_port 7860
pinned true
license mit
short_description Enterprise Agentic RAG — upload PDFs and chat with AI

██████╗ ██████╗ ███████╗     █████╗ ███████╗███████╗██╗███████╗████████╗ █████╗ ███╗   ██╗████████╗
██╔══██╗██╔══██╗██╔════╝    ██╔══██╗██╔════╝██╔════╝██║██╔════╝╚══██╔══╝██╔══██╗████╗  ██║╚══██╔══╝
██████╔╝██║  ██║█████╗      ███████║███████╗███████╗██║███████╗   ██║   ███████║██╔██╗ ██║   ██║
██╔═══╝ ██║  ██║██╔══╝      ██╔══██║╚════██║╚════██║██║╚════██║   ██║   ██╔══██║██║╚██╗██║   ██║
██║     ██████╔╝██║         ██║  ██║███████║███████║██║███████║   ██║   ██║  ██║██║ ╚████║   ██║
╚═╝     ╚═════╝ ╚═╝         ╚═╝  ╚═╝╚══════╝╚══════╝╚═╝╚══════╝   ╚═╝   ╚═╝  ╚═╝╚═╝  ╚═══╝   ╚═╝

                        ██████╗  █████╗  ██████╗
                        ██╔══██╗██╔══██╗██╔════╝
                        ██████╔╝███████║██║  ███╗
                        ██╔══██╗██╔══██║██║   ██║
                        ██║  ██║██║  ██║╚██████╔╝
                        ╚═╝  ╚═╝╚═╝  ╚═╝ ╚═════╝

Enterprise Agentic Retrieval-Augmented Generation System


FastAPI Next.js Python PostgreSQL ChromaDB HuggingFace Celery Docker License: MIT


Upload · Embed · Retrieve · Chat — A production-grade AI document assistant built end-to-end with an agentic RAG pipeline, streaming responses, and per-user data isolation.


## 🌟 GirlScript Summer of Code 2026

This project is an official participant in GirlScript Summer of Code 2026 (GSSoC'26) and welcomes contributions from the community.


Features · Tech Stack · Getting Started · Architecture · RAG Pipeline · API Reference · Deployment · Contributing


🤝 Contributors

Thanks to all the amazing people who have contributed to PDF-Assistant-RAG! 🎉


🌟 Want to join them? Check out CONTRIBUTING.md for contribution guidelines and look for good first issues to get started!



🌟 Overview

PDF-Assistant-RAG is a complete, production-ready AI document assistant that lets users upload complex PDFs, financial reports, legal contracts, and research papers — then chat with an AI that provides accurate, cited answers powered by a multi-stage Retrieval-Augmented Generation pipeline.

The system uses hybrid search (vector + BM25) with Reciprocal Rank Fusion and cross-encoder reranking to find the most relevant document chunks, streams AI-generated answers token-by-token, and highlights exact source citations with page numbers — all inside a modern Next.js frontend with JWT-secured per-user data isolation.


🏗️ Architecture

Contributor note: see docs/ARCHITECTURE.md for a route-by-route system map, request-flow diagrams, and Swagger/OpenAPI documentation guidance.

graph TD
    subgraph Frontend["Frontend (Next.js 16)"]
        UI["Dashboard UI (React + Zustand)"]
        Chat["Chat Panel (SSE Streaming)"]
        Viewer["PDF Viewer"]
    end

    subgraph Backend["Backend (FastAPI 0.115+)"]
        API["API Router (/api/v1)"]
        Auth["Auth (JWT + bcrypt)"]
        DB[(PostgreSQL / SQLite)]
        Redis[(Redis)]

        subgraph RAG["RAG Pipeline"]
            Upload["Celery Ingestion Task"]
            Embed["Local Embeddings (all-MiniLM-L6-v2)"]
            EmbedCache["Embedding Cache (Redis + LRU)"]
            BM25["BM25 Index"]
            Retriever["Hybrid Retriever (Vector + BM25 + RRF)"]
            Rerank["Cross-Encoder Reranker (BGE-v2-m3)"]
            Agent["Agent / Generator"]
        end
    end

    subgraph Storage["Storage"]
        Chroma[(ChromaDB)]
        Uploads[("File Storage")]
    end

    subgraph External["External Services"]
        HF["HuggingFace Inference API (Qwen2.5-72B)"]
    end

    UI <-->|REST / Auth| API
    Chat <-->|SSE Streaming| API
    Viewer -->|Serve PDF| API
    API <--> Auth
    API <--> DB
    API --> Upload
    Upload --> Embed
    Embed --> EmbedCache
    Embed -->|Store Vectors| Chroma
    Upload --> BM25
    API <--> Retriever
    Retriever -->|Semantic Search| Chroma
    Retriever -->|Keyword Search| BM25
    Retriever --> Rerank
    Rerank --> Agent
    Agent <-->|LLM Generation| HF
    Upload -->|Store Files| Uploads
    Redis <-->|Task Queue| Upload
Loading

🔄 System Flow Overview

  1. User uploads a document via the Next.js frontend.
  2. FastAPI queues a Celery ingestion task backed by Redis.
  3. The worker chunks the document, generates local embeddings (cached via Redis/LRU), builds a BM25 index, and stores vectors in ChromaDB.
  4. At query time, hybrid search merges vector and BM25 results via Reciprocal Rank Fusion.
  5. A cross-encoder reranker refines the top candidates.
  6. The agent assembles a prompt and calls the HuggingFace Inference API.
  7. Streamed SSE tokens are returned to the frontend chat panel.

🛠 Tech Stack

Backend

Technology Purpose
FastAPI Async web framework + routing
Python 3.11 Runtime environment
PostgreSQL / SQLite Relational database (SQLAlchemy ORM)
JWT + bcrypt Authentication & password hashing
ChromaDB Local vector store (embeddings)
HuggingFace Inference API LLM answer generation
sentence-transformers Local embedding model (all-MiniLM-L6-v2)

Frontend

Technology Purpose
Next.js 14 React framework (App Router)
TypeScript Frontend language
Tailwind CSS Utility-first styling

AI / ML Pipeline

Technology Purpose
sentence-transformers (all-MiniLM-L6-v2) Generates vector embeddings for document chunks
ChromaDB Stores + retrieves embeddings locally
HuggingFace Inference API Generates answers from retrieved context
BAAI/bge-reranker-v2-m3 Cross-encoder reranking for retrieval quality
Knowledge Graph (GraphRAG) Entity extraction + relationship graphs
PyMuPDF + pdfplumber + python-docx Document text extraction

DevOps & Tooling

Technology Purpose
Docker Multi-Stage Containerised deployment
GitHub Actions CI/CD (E2E, security, deploy)
Playwright E2E + visual regression tests
Prometheus + Grafana Metrics & observability
HuggingFace Spaces Production deployment

✨ Key Features

🆕 Recent Updates

  • 🤖 Discord Bot Integration
  • ⚡ Celery + Redis Background PDF Processing
  • 📧 Email Verification Workflow
  • 🧠 RAGAS Evaluation Pipeline
  • 🚀 Response Caching with Redis
  • 🐳 Optimized Docker Deployment

👤 Users

  • 🔐 JWT-secured register, login & email verification
  • 📄 Upload PDF, DOCX, TXT, and Markdown
  • 🌐 URL ingestion via web crawler
  • 💬 Ask questions in natural language
  • 🌊 Streaming AI responses token-by-token
  • 📚 Inline source citations with page numbers
  • 📥 Export chat as Markdown, TXT, or PDF
  • 🗂️ Per-user complete data isolation

🤖 RAG Pipeline

  • 🔪 Smart recursive text chunking (configurable size & overlap)
  • 🧠 Local embeddings — no data leaves your machine
  • Embedding cache (Redis + LRU) — skip redundant computation
  • 🔍 Hybrid search — vector + BM25 merged via RRF
  • 🏆 Cross-encoder reranking for precision answers
  • 🖼️ Image caption extraction from PDF figures
  • 🔗 URL extraction from PDF link annotations
  • 🗺️ Knowledge graph (GraphRAG) per document

⚙️ Engineering

  • 🚀 Async FastAPI with SSE streaming
  • 🗄️ PostgreSQL metadata + ChromaDB vectors
  • 🔄 Celery + Redis async ingestion pipeline
  • 🐳 Multi-stage Docker with CPU & GPU profiles
  • 📊 Prometheus metrics + Grafana dashboard
  • 🩺 Deep health endpoint — DB, Redis, Celery, ChromaDB
  • 🔒 Rate limiting, CORS, file validation, JWT expiry
  • 🧪 Playwright E2E + visual regression tests

📁 Project Structure

PDF-Assistant-RAG/
│
├── backend/
│   ├── app/
│   │   ├── main.py                 # FastAPI app — lifespan, middleware, routers
│   │   ├── config.py               # Pydantic settings (env vars)
│   │   ├── models.py               # SQLAlchemy ORM models
│   │   ├── schemas.py              # Pydantic request/response schemas
│   │   ├── database.py             # Engine, session, migrations
│   │   ├── auth.py                 # JWT helpers
│   │   ├── tasks.py                # Celery task definitions
│   │   │
│   │   ├── routes/
│   │   │   ├── auth.py             # Register, login, OAuth
│   │   │   ├── documents.py        # Upload, list, delete, status
│   │   │   ├── chat.py             # Ask, stream, history, export
│   │   │   ├── health.py           # Deep health check endpoint
│   │   │   ├── admin.py            # Admin stats
│   │   │   └── workspaces.py       # Workspace management
│   │   │
│   │   ├── rag/
│   │   │   ├── chunker.py          # PDF/DOCX/TXT extraction + chunking
│   │   │   ├── embeddings.py       # Local embeddings + Redis/LRU cache
│   │   │   ├── vectorstore.py      # ChromaDB operations
│   │   │   ├── bm25.py             # BM25 index per document
│   │   │   ├── retriever.py        # Hybrid search + RRF + reranking
│   │   │   ├── reranker.py         # Cross-encoder reranker
│   │   │   ├── vision.py           # Image caption extraction
│   │   │   ├── url_extractor.py    # PDF URL/link extraction
│   │   │   ├── graph_builder.py    # Knowledge graph (GraphRAG)
│   │   │   ├── agent.py            # LLM answer generation
│   │   │   └── summarizer.py       # Document summarisation
│   │   │
│   │   └── services/
│   │       └── document_ingestion.py  # End-to-end ingestion pipeline
│   │
│   ├── tests/                      # pytest test suite
│   ├── requirements.txt
│   └── migrate_add_extracted_urls.py
│
├── frontend/
│   ├── src/
│   │   ├── app/                    # Next.js App Router pages
│   │   ├── components/             # React components
│   │   ├── store/                  # Zustand state stores
│   │   ├── lib/                    # API client, auth, utilities
│   │   └── services/               # API service layer
│   ├── e2e/                        # Playwright E2E + snapshot tests
│   ├── next.config.ts
│   └── playwright.config.ts
│
├── docs/
│   └── ARCHITECTURE.md
│
├── .github/
│   └── workflows/
│       ├── ci.yml                  # Backend CI
│       ├── e2e.yml                 # Playwright E2E + visual regression
│       ├── deploy.yml              # Docker build (main branch)
│       └── devsecops.yml           # Security scans
│
├── docker-compose.yml              # CPU + GPU + debug profiles + log rotation
├── Dockerfile                      # Multi-stage backend build
├── frontend/Dockerfile             # Multi-stage frontend build (nginx)
└── .env.example

🚀 Getting Started

Prerequisites


1. Clone the Repository

git clone https://github.com/param20h/PDF-Assistant-RAG.git
cd PDF-Assistant-RAG

2. Configure Environment

cp .env.example .env

Edit .env:

SECRET_KEY=your-strong-random-secret
DATABASE_URL=postgresql://pdf_rag_user:pdf_rag_pass@localhost:5432/pdf_rag
HF_TOKEN=hf_your_huggingface_token_here
UPLOAD_DIR=./data/uploads
CHROMA_PERSIST_DIR=./data/chroma_db
CELERY_BROKER_URL=redis://localhost:6379/0
CELERY_RESULT_BACKEND=redis://localhost:6379/1

Get your free HuggingFace token at huggingface.co/settings/tokens

Email Verification Setup (optional)

FRONTEND_URL=http://localhost:3000
MAIL_USERNAME=your_smtp_username
MAIL_PASSWORD=your_smtp_or_gmail_app_password
MAIL_FROM=your_sender_email@example.com
MAIL_SERVER=smtp.gmail.com
MAIL_PORT=587
MAIL_STARTTLS=True
MAIL_SSL_TLS=False

Without SMTP settings, registration returns a local verification link so contributors can test without email credentials.

3. Run with Docker (recommended)

# CPU-only (no GPU needed)
docker compose --profile cpu up --build

# GPU-accelerated (requires NVIDIA Container Toolkit)
docker compose --profile gpu up --build

# Also start pgAdmin at http://localhost:5050
docker compose --profile cpu --profile debug up --build
Service URL
Frontend http://localhost:3000
Backend API http://localhost:7860
API Docs http://localhost:7860/docs
pgAdmin http://localhost:5050 (debug profile)

4. Run Locally (without Docker)

# Backend
cd backend
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 7860

# Celery worker (separate terminal)
celery -A app.celery_app.celery_app worker --loglevel=info

# Frontend (separate terminal)
cd frontend
npm install
npm run dev

5. Set up crawl4ai (URL Upload Feature — optional)

crawl4ai-setup

🧠 RAG Pipeline

                    ┌─────────────────────────────────────────────┐
                    │         PDF / DOCX / TXT / MD Upload        │
                    └───────────────────┬─────────────────────────┘
                                        │
                                        ▼
                    ┌─────────────────────────────────────────────┐
                    │   PyMuPDF / pdfplumber / python-docx Parser │
                    │   + Image caption extraction                │
                    │   + PDF URL/link annotation extraction      │
                    └───────────────────┬─────────────────────────┘
                                        │
                                        ▼
                    ┌─────────────────────────────────────────────┐
                    │      Recursive Character Text Splitter      │
                    │   chunk_size=1000  |  overlap=200           │
                    └───────────────────┬─────────────────────────┘
                                        │
                                        ▼
                    ┌─────────────────────────────────────────────┐
                    │  all-MiniLM-L6-v2  (local embeddings)       │
                    │  384-dim · Redis + LRU cache (24h TTL)      │
                    └──────────────┬──────────────────────────────┘
                                   │
                    ┌──────────────┴──────────────┐
                    ▼                             ▼
         ┌──────────────────┐         ┌─────────────────────┐
         │ ChromaDB vectors │         │  BM25 keyword index │
         │ (per-user coll.) │         │  (per-document .pkl)│
         └──────────────────┘         └─────────────────────┘

                          ── At Query Time ──

  User Question ──▶ Embed (cached) ──▶ Vector Search (Top-K=20)
                         │
                         ├──▶ BM25 Search (Top-K=20)
                         │
                         ▼
              Reciprocal Rank Fusion (RRF, k=60)
                         │
                         ▼
          BGE-Reranker-v2-m3 Cross-Encoder (Top-K=8)
                         │
                         ▼
        Prompt Assembly (system + context + question)
                         │
                         ▼
        Qwen2.5-72B-Instruct (HF Inference API)
                         │
                         ▼
        Streamed SSE tokens ──▶ Frontend ChatPanel

📡 API Reference

Method Endpoint Auth Description
POST /api/v1/auth/register Create a new user account
POST /api/v1/auth/login Login and receive JWT tokens
GET /api/v1/auth/me Get current user profile
POST /api/v1/documents/upload Upload PDF/DOCX/TXT and enqueue ingestion (202)
POST /api/v1/documents/urlupload Crawl a URL and ingest as document
GET /api/v1/documents/ List documents (pagination + ?q= name filter)
GET /api/v1/documents/{id} Get document metadata (incl. extracted URLs)
GET /api/v1/documents/{id}/status Poll ingestion progress
DELETE /api/v1/documents/{id} Soft-delete document
POST /api/v1/chat/ask/stream Ask a question (SSE streaming)
GET /api/v1/chat/history/{doc_id} Get chat history for a document
DELETE /api/v1/chat/history/{doc_id} Clear chat history
GET /api/v1/chat/export/{doc_id} Export transcript as MD / TXT / PDF
GET /api/v1/chat/sessions List chat sessions
POST /api/v1/chat/sessions Create chat session
GET /api/v1/health/status Deep health check (DB, Redis, Celery, ChromaDB)
GET /api/health Basic liveness check

Full interactive docs at /docs (Swagger UI) when running locally.


📦 Environment Variables

Variable Required Default Description
SECRET_KEY JWT signing secret. Generate: python -c "import secrets; print(secrets.token_urlsafe(32))"
HF_TOKEN HuggingFace API token for LLM inference
DATABASE_URL sqlite:///./data/app.db SQLAlchemy connection string (SQLite or PostgreSQL)
CELERY_BROKER_URL redis://localhost:6379/0 Redis broker for Celery
CELERY_RESULT_BACKEND redis://localhost:6379/1 Redis backend for Celery results
REDIS_URL Redis URL for response + embedding cache
UPLOAD_DIR ./data/uploads File storage directory
CHROMA_PERSIST_DIR ./data/chroma_db ChromaDB persistence directory
EMBEDDING_MODEL sentence-transformers/all-MiniLM-L6-v2 Local embedding model
EMBEDDING_CACHE_TTL 86400 Embedding cache TTL in seconds (24h)
LLM_MODEL Qwen/Qwen2.5-72B-Instruct HuggingFace model for answer generation
LLM_TEMPERATURE 0.3 LLM sampling temperature
RERANKER_MODEL BAAI/bge-reranker-v2-m3 Cross-encoder reranker model
USE_HYBRID_SEARCH True Enable BM25 + vector hybrid search
RRF_K 60 RRF smoothing constant
CHUNK_SIZE 1000 Characters per document chunk
CHUNK_OVERLAP 200 Overlap between consecutive chunks
TOP_K_RETRIEVAL 20 Candidates retrieved from vector store
TOP_K_RERANK 8 Final chunks after reranking
VISION_PROVIDER Set to openai to use GPT-4o-mini for image captions
OPENAI_API_KEY Required when VISION_PROVIDER=openai
ENVIRONMENT development Set to production to lock CORS
FRONTEND_URL http://localhost:3000 Public frontend URL for OAuth + email links
NEXT_PUBLIC_API_URL http://localhost:7860 Backend URL injected at frontend build time

📜 Scripts

Backend (backend/)

Command Description
uvicorn app.main:app --reload Start FastAPI with hot reload
celery -A app.celery_app.celery_app worker --loglevel=info Start Celery worker
python migrate_add_extracted_urls.py Run URL extraction column migration
python scripts/run_ragas_eval.py --user-id <id> Run RAGAS evaluation (vector vs GraphRAG)

Frontend (frontend/)

Command Description
npm run dev Start Next.js dev server
npm run build Production build
npm run test Run Vitest unit tests
npm run test:e2e Run Playwright E2E tests
npx playwright test e2e/snapshots.spec.ts --update-snapshots Regenerate visual regression baselines

Docker

Command Description
docker compose --profile cpu up --build Full stack — CPU only
docker compose --profile gpu up --build Full stack — GPU accelerated
docker compose --profile debug up Also start pgAdmin at http://localhost:5050
docker compose down Stop all containers

GPU profile requires NVIDIA Container Toolkit.


🌐 Deployment

HuggingFace Spaces

  1. Fork this repo and create a new Space at huggingface.co/new-space (SDK: Docker)
  2. Set Space secrets: HF_TOKEN, SECRET_KEY, DATABASE_URL
  3. Push to the hf remote:
git remote add hf https://<username>:<HF_TOKEN>@huggingface.co/spaces/<username>/<space-name>
git push hf main

Self-Hosted / VPS

docker compose --profile cpu up -d --build
# App at http://your-server:7860
# Frontend at http://your-server:3000

🤝 Contributing

This project is participating in GirlScript Summer of Code! We welcome contributors of all skill levels.

Branch Strategy:

Branch Purpose
main Production — HuggingFace deployed (admin only)
dev All contributor PRs target here
feature/* / fix/* / docs/* Your working branches
# Always branch from dev
git checkout -b feature/my-feature upstream/dev

Quick links:


📄 License

Distributed under the MIT License. See LICENSE for more information.



Built with 💙 by the open-source community

If you found this project helpful, please give it a ⭐ — it helps contributors discover it!


Stack


⬆ Back to top

About

PDF-Assistant-RAG is a complete, production-ready AI document assistant that lets users upload complex PDFs, financial reports, legal contracts, and research papers — then chat with an AI that provides accurate, cited answers powered by a multi-stage Retrieval-Augmented Generation pipeline.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors