| title | Document AI Analyst |
|---|---|
| emoji | 🧠 |
| colorFrom | indigo |
| colorTo | purple |
| sdk | docker |
| app_port | 7860 |
| pinned | true |
| license | mit |
| short_description | Enterprise Agentic RAG — upload PDFs and chat with AI |
██████╗ ██████╗ ███████╗ █████╗ ███████╗███████╗██╗███████╗████████╗ █████╗ ███╗ ██╗████████╗
██╔══██╗██╔══██╗██╔════╝ ██╔══██╗██╔════╝██╔════╝██║██╔════╝╚══██╔══╝██╔══██╗████╗ ██║╚══██╔══╝
██████╔╝██║ ██║█████╗ ███████║███████╗███████╗██║███████╗ ██║ ███████║██╔██╗ ██║ ██║
██╔═══╝ ██║ ██║██╔══╝ ██╔══██║╚════██║╚════██║██║╚════██║ ██║ ██╔══██║██║╚██╗██║ ██║
██║ ██████╔╝██║ ██║ ██║███████║███████║██║███████║ ██║ ██║ ██║██║ ╚████║ ██║
╚═╝ ╚═════╝ ╚═╝ ╚═╝ ╚═╝╚══════╝╚══════╝╚═╝╚══════╝ ╚═╝ ╚═╝ ╚═╝╚═╝ ╚═══╝ ╚═╝
██████╗ █████╗ ██████╗
██╔══██╗██╔══██╗██╔════╝
██████╔╝███████║██║ ███╗
██╔══██╗██╔══██║██║ ██║
██║ ██║██║ ██║╚██████╔╝
╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝
Upload · Embed · Retrieve · Chat — A production-grade AI document assistant built end-to-end with an agentic RAG pipeline, streaming responses, and per-user data isolation.
## 🌟 GirlScript Summer of Code 2026
This project is an official participant in GirlScript Summer of Code 2026 (GSSoC'26) and welcomes contributions from the community.
Features · Tech Stack · Getting Started · Architecture · RAG Pipeline · API Reference · Deployment · Contributing
Thanks to all the amazing people who have contributed to PDF-Assistant-RAG! 🎉
🌟 Want to join them? Check out CONTRIBUTING.md for contribution guidelines and look for good first issues to get started!
PDF-Assistant-RAG is a complete, production-ready AI document assistant that lets users upload complex PDFs, financial reports, legal contracts, and research papers — then chat with an AI that provides accurate, cited answers powered by a multi-stage Retrieval-Augmented Generation pipeline.
The system uses hybrid search (vector + BM25) with Reciprocal Rank Fusion and cross-encoder reranking to find the most relevant document chunks, streams AI-generated answers token-by-token, and highlights exact source citations with page numbers — all inside a modern Next.js frontend with JWT-secured per-user data isolation.
Contributor note: see docs/ARCHITECTURE.md for a route-by-route system map, request-flow diagrams, and Swagger/OpenAPI documentation guidance.
graph TD
subgraph Frontend["Frontend (Next.js 16)"]
UI["Dashboard UI (React + Zustand)"]
Chat["Chat Panel (SSE Streaming)"]
Viewer["PDF Viewer"]
end
subgraph Backend["Backend (FastAPI 0.115+)"]
API["API Router (/api/v1)"]
Auth["Auth (JWT + bcrypt)"]
DB[(PostgreSQL / SQLite)]
Redis[(Redis)]
subgraph RAG["RAG Pipeline"]
Upload["Celery Ingestion Task"]
Embed["Local Embeddings (all-MiniLM-L6-v2)"]
EmbedCache["Embedding Cache (Redis + LRU)"]
BM25["BM25 Index"]
Retriever["Hybrid Retriever (Vector + BM25 + RRF)"]
Rerank["Cross-Encoder Reranker (BGE-v2-m3)"]
Agent["Agent / Generator"]
end
end
subgraph Storage["Storage"]
Chroma[(ChromaDB)]
Uploads[("File Storage")]
end
subgraph External["External Services"]
HF["HuggingFace Inference API (Qwen2.5-72B)"]
end
UI <-->|REST / Auth| API
Chat <-->|SSE Streaming| API
Viewer -->|Serve PDF| API
API <--> Auth
API <--> DB
API --> Upload
Upload --> Embed
Embed --> EmbedCache
Embed -->|Store Vectors| Chroma
Upload --> BM25
API <--> Retriever
Retriever -->|Semantic Search| Chroma
Retriever -->|Keyword Search| BM25
Retriever --> Rerank
Rerank --> Agent
Agent <-->|LLM Generation| HF
Upload -->|Store Files| Uploads
Redis <-->|Task Queue| Upload
- User uploads a document via the Next.js frontend.
- FastAPI queues a Celery ingestion task backed by Redis.
- The worker chunks the document, generates local embeddings (cached via Redis/LRU), builds a BM25 index, and stores vectors in ChromaDB.
- At query time, hybrid search merges vector and BM25 results via Reciprocal Rank Fusion.
- A cross-encoder reranker refines the top candidates.
- The agent assembles a prompt and calls the HuggingFace Inference API.
- Streamed SSE tokens are returned to the frontend chat panel.
| Technology | Purpose | |
|---|---|---|
| Next.js 14 | React framework (App Router) | |
| TypeScript | Frontend language | |
| Tailwind CSS | Utility-first styling |
- 🤖 Discord Bot Integration
- ⚡ Celery + Redis Background PDF Processing
- 📧 Email Verification Workflow
- 🧠 RAGAS Evaluation Pipeline
- 🚀 Response Caching with Redis
- 🐳 Optimized Docker Deployment
|
|
|
PDF-Assistant-RAG/
│
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI app — lifespan, middleware, routers
│ │ ├── config.py # Pydantic settings (env vars)
│ │ ├── models.py # SQLAlchemy ORM models
│ │ ├── schemas.py # Pydantic request/response schemas
│ │ ├── database.py # Engine, session, migrations
│ │ ├── auth.py # JWT helpers
│ │ ├── tasks.py # Celery task definitions
│ │ │
│ │ ├── routes/
│ │ │ ├── auth.py # Register, login, OAuth
│ │ │ ├── documents.py # Upload, list, delete, status
│ │ │ ├── chat.py # Ask, stream, history, export
│ │ │ ├── health.py # Deep health check endpoint
│ │ │ ├── admin.py # Admin stats
│ │ │ └── workspaces.py # Workspace management
│ │ │
│ │ ├── rag/
│ │ │ ├── chunker.py # PDF/DOCX/TXT extraction + chunking
│ │ │ ├── embeddings.py # Local embeddings + Redis/LRU cache
│ │ │ ├── vectorstore.py # ChromaDB operations
│ │ │ ├── bm25.py # BM25 index per document
│ │ │ ├── retriever.py # Hybrid search + RRF + reranking
│ │ │ ├── reranker.py # Cross-encoder reranker
│ │ │ ├── vision.py # Image caption extraction
│ │ │ ├── url_extractor.py # PDF URL/link extraction
│ │ │ ├── graph_builder.py # Knowledge graph (GraphRAG)
│ │ │ ├── agent.py # LLM answer generation
│ │ │ └── summarizer.py # Document summarisation
│ │ │
│ │ └── services/
│ │ └── document_ingestion.py # End-to-end ingestion pipeline
│ │
│ ├── tests/ # pytest test suite
│ ├── requirements.txt
│ └── migrate_add_extracted_urls.py
│
├── frontend/
│ ├── src/
│ │ ├── app/ # Next.js App Router pages
│ │ ├── components/ # React components
│ │ ├── store/ # Zustand state stores
│ │ ├── lib/ # API client, auth, utilities
│ │ └── services/ # API service layer
│ ├── e2e/ # Playwright E2E + snapshot tests
│ ├── next.config.ts
│ └── playwright.config.ts
│
├── docs/
│ └── ARCHITECTURE.md
│
├── .github/
│ └── workflows/
│ ├── ci.yml # Backend CI
│ ├── e2e.yml # Playwright E2E + visual regression
│ ├── deploy.yml # Docker build (main branch)
│ └── devsecops.yml # Security scans
│
├── docker-compose.yml # CPU + GPU + debug profiles + log rotation
├── Dockerfile # Multi-stage backend build
├── frontend/Dockerfile # Multi-stage frontend build (nginx)
└── .env.example
Python 3.11+
Node.js 20+
Docker + Docker Compose (recommended)
HuggingFace API token — huggingface.co/settings/tokens (free)
git clone https://github.com/param20h/PDF-Assistant-RAG.git
cd PDF-Assistant-RAGcp .env.example .envEdit .env:
SECRET_KEY=your-strong-random-secret
DATABASE_URL=postgresql://pdf_rag_user:pdf_rag_pass@localhost:5432/pdf_rag
HF_TOKEN=hf_your_huggingface_token_here
UPLOAD_DIR=./data/uploads
CHROMA_PERSIST_DIR=./data/chroma_db
CELERY_BROKER_URL=redis://localhost:6379/0
CELERY_RESULT_BACKEND=redis://localhost:6379/1Get your free HuggingFace token at huggingface.co/settings/tokens
FRONTEND_URL=http://localhost:3000
MAIL_USERNAME=your_smtp_username
MAIL_PASSWORD=your_smtp_or_gmail_app_password
MAIL_FROM=your_sender_email@example.com
MAIL_SERVER=smtp.gmail.com
MAIL_PORT=587
MAIL_STARTTLS=True
MAIL_SSL_TLS=FalseWithout SMTP settings, registration returns a local verification link so contributors can test without email credentials.
# CPU-only (no GPU needed)
docker compose --profile cpu up --build
# GPU-accelerated (requires NVIDIA Container Toolkit)
docker compose --profile gpu up --build
# Also start pgAdmin at http://localhost:5050
docker compose --profile cpu --profile debug up --build| Service | URL |
|---|---|
| Frontend | http://localhost:3000 |
| Backend API | http://localhost:7860 |
| API Docs | http://localhost:7860/docs |
| pgAdmin | http://localhost:5050 (debug profile) |
# Backend
cd backend
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 7860
# Celery worker (separate terminal)
celery -A app.celery_app.celery_app worker --loglevel=info
# Frontend (separate terminal)
cd frontend
npm install
npm run devcrawl4ai-setup ┌─────────────────────────────────────────────┐
│ PDF / DOCX / TXT / MD Upload │
└───────────────────┬─────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ PyMuPDF / pdfplumber / python-docx Parser │
│ + Image caption extraction │
│ + PDF URL/link annotation extraction │
└───────────────────┬─────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Recursive Character Text Splitter │
│ chunk_size=1000 | overlap=200 │
└───────────────────┬─────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ all-MiniLM-L6-v2 (local embeddings) │
│ 384-dim · Redis + LRU cache (24h TTL) │
└──────────────┬──────────────────────────────┘
│
┌──────────────┴──────────────┐
▼ ▼
┌──────────────────┐ ┌─────────────────────┐
│ ChromaDB vectors │ │ BM25 keyword index │
│ (per-user coll.) │ │ (per-document .pkl)│
└──────────────────┘ └─────────────────────┘
── At Query Time ──
User Question ──▶ Embed (cached) ──▶ Vector Search (Top-K=20)
│
├──▶ BM25 Search (Top-K=20)
│
▼
Reciprocal Rank Fusion (RRF, k=60)
│
▼
BGE-Reranker-v2-m3 Cross-Encoder (Top-K=8)
│
▼
Prompt Assembly (system + context + question)
│
▼
Qwen2.5-72B-Instruct (HF Inference API)
│
▼
Streamed SSE tokens ──▶ Frontend ChatPanel
| Method | Endpoint | Auth | Description |
|---|---|---|---|
POST |
/api/v1/auth/register |
❌ | Create a new user account |
POST |
/api/v1/auth/login |
❌ | Login and receive JWT tokens |
GET |
/api/v1/auth/me |
✅ | Get current user profile |
POST |
/api/v1/documents/upload |
✅ | Upload PDF/DOCX/TXT and enqueue ingestion (202) |
POST |
/api/v1/documents/urlupload |
✅ | Crawl a URL and ingest as document |
GET |
/api/v1/documents/ |
✅ | List documents (pagination + ?q= name filter) |
GET |
/api/v1/documents/{id} |
✅ | Get document metadata (incl. extracted URLs) |
GET |
/api/v1/documents/{id}/status |
✅ | Poll ingestion progress |
DELETE |
/api/v1/documents/{id} |
✅ | Soft-delete document |
POST |
/api/v1/chat/ask/stream |
✅ | Ask a question (SSE streaming) |
GET |
/api/v1/chat/history/{doc_id} |
✅ | Get chat history for a document |
DELETE |
/api/v1/chat/history/{doc_id} |
✅ | Clear chat history |
GET |
/api/v1/chat/export/{doc_id} |
✅ | Export transcript as MD / TXT / PDF |
GET |
/api/v1/chat/sessions |
✅ | List chat sessions |
POST |
/api/v1/chat/sessions |
✅ | Create chat session |
GET |
/api/v1/health/status |
❌ | Deep health check (DB, Redis, Celery, ChromaDB) |
GET |
/api/health |
❌ | Basic liveness check |
Full interactive docs at
/docs(Swagger UI) when running locally.
| Variable | Required | Default | Description |
|---|---|---|---|
SECRET_KEY |
✅ | — | JWT signing secret. Generate: python -c "import secrets; print(secrets.token_urlsafe(32))" |
HF_TOKEN |
✅ | — | HuggingFace API token for LLM inference |
DATABASE_URL |
❌ | sqlite:///./data/app.db |
SQLAlchemy connection string (SQLite or PostgreSQL) |
CELERY_BROKER_URL |
❌ | redis://localhost:6379/0 |
Redis broker for Celery |
CELERY_RESULT_BACKEND |
❌ | redis://localhost:6379/1 |
Redis backend for Celery results |
REDIS_URL |
❌ | — | Redis URL for response + embedding cache |
UPLOAD_DIR |
❌ | ./data/uploads |
File storage directory |
CHROMA_PERSIST_DIR |
❌ | ./data/chroma_db |
ChromaDB persistence directory |
EMBEDDING_MODEL |
❌ | sentence-transformers/all-MiniLM-L6-v2 |
Local embedding model |
EMBEDDING_CACHE_TTL |
❌ | 86400 |
Embedding cache TTL in seconds (24h) |
LLM_MODEL |
❌ | Qwen/Qwen2.5-72B-Instruct |
HuggingFace model for answer generation |
LLM_TEMPERATURE |
❌ | 0.3 |
LLM sampling temperature |
RERANKER_MODEL |
❌ | BAAI/bge-reranker-v2-m3 |
Cross-encoder reranker model |
USE_HYBRID_SEARCH |
❌ | True |
Enable BM25 + vector hybrid search |
RRF_K |
❌ | 60 |
RRF smoothing constant |
CHUNK_SIZE |
❌ | 1000 |
Characters per document chunk |
CHUNK_OVERLAP |
❌ | 200 |
Overlap between consecutive chunks |
TOP_K_RETRIEVAL |
❌ | 20 |
Candidates retrieved from vector store |
TOP_K_RERANK |
❌ | 8 |
Final chunks after reranking |
VISION_PROVIDER |
❌ | — | Set to openai to use GPT-4o-mini for image captions |
OPENAI_API_KEY |
❌ | — | Required when VISION_PROVIDER=openai |
ENVIRONMENT |
❌ | development |
Set to production to lock CORS |
FRONTEND_URL |
❌ | http://localhost:3000 |
Public frontend URL for OAuth + email links |
NEXT_PUBLIC_API_URL |
❌ | http://localhost:7860 |
Backend URL injected at frontend build time |
| Command | Description |
|---|---|
uvicorn app.main:app --reload |
Start FastAPI with hot reload |
celery -A app.celery_app.celery_app worker --loglevel=info |
Start Celery worker |
python migrate_add_extracted_urls.py |
Run URL extraction column migration |
python scripts/run_ragas_eval.py --user-id <id> |
Run RAGAS evaluation (vector vs GraphRAG) |
| Command | Description |
|---|---|
npm run dev |
Start Next.js dev server |
npm run build |
Production build |
npm run test |
Run Vitest unit tests |
npm run test:e2e |
Run Playwright E2E tests |
npx playwright test e2e/snapshots.spec.ts --update-snapshots |
Regenerate visual regression baselines |
| Command | Description |
|---|---|
docker compose --profile cpu up --build |
Full stack — CPU only |
docker compose --profile gpu up --build |
Full stack — GPU accelerated |
docker compose --profile debug up |
Also start pgAdmin at http://localhost:5050 |
docker compose down |
Stop all containers |
GPU profile requires NVIDIA Container Toolkit.
- Fork this repo and create a new Space at huggingface.co/new-space (SDK: Docker)
- Set Space secrets:
HF_TOKEN,SECRET_KEY,DATABASE_URL - Push to the
hfremote:
git remote add hf https://<username>:<HF_TOKEN>@huggingface.co/spaces/<username>/<space-name>
git push hf maindocker compose --profile cpu up -d --build
# App at http://your-server:7860
# Frontend at http://your-server:3000This project is participating in GirlScript Summer of Code! We welcome contributors of all skill levels.
Branch Strategy:
| Branch | Purpose |
|---|---|
main |
Production — HuggingFace deployed (admin only) |
dev |
All contributor PRs target here |
feature/* / fix/* / docs/* |
Your working branches |
# Always branch from dev
git checkout -b feature/my-feature upstream/devQuick links:
Distributed under the MIT License. See LICENSE for more information.
Built with 💙 by the open-source community
If you found this project helpful, please give it a ⭐ — it helps contributors discover it!