PDF Summarizer & RAG Q&A System

A comprehensive tool to extract, clean, chunk, summarize, and perform Question-Answering (RAG) on PDF documents. It uses Google Gemini for generation and embeddings, and FAISS for vector storage.

✨ Features

Document Processing: Extracts text from PDFs, cleans noise (headers/footers), and chunks text intelligently.
AI Summarization: Generates concise summaries of document sections using Gemini 2.5 Flash.
RAG Q&A: Ask questions about your document and get answers based on accurate context retrieval.
Dual Interfaces:
- API: FastAPI backend for integration.
- Web UI: Modern Streamlit interface.

📋 Requirements

Python: Version 3.8 or higher.
Dependencies:
```
pip install -r requirements.txt
```

🔑 Configuration

Environment Variables: Create a .env file in the root directory:
```
GEMINI_API_KEY=your_api_key_here
```
Streamlit Secrets (For hosting UI): If deploying to Streamlit Cloud, add GEMINI_API_KEY to your app's secrets.

🚀 How to Run

1. Web UI (Recommended)

This is the dynamic interface where you can upload ANY PDF.

cd src
/Users/hemant/.pyenv/versions/3.10.16/bin/python -m streamlit run streamlit_app.py

Upload: Drag & drop any PDF in the sidebar.
Chat: Ask questions about the uploaded document immediately.

2. Output Data Pipeline (Manual)

If you want to process the default file (file/data.pdf) without the UI:

cd src
/Users/hemant/.pyenv/versions/3.10.16/bin/python main.py

Outputs: output/cleaned.txt, output/index.faiss, output/metadata.pkl

3. API Server (FastAPI)

For backend integration (uses the processed data from step 2 or UI).

cd src
/Users/hemant/.pyenv/versions/3.10.16/bin/python -m uvicorn api:app --reload --port 8000

Test Endpoint:

curl -X POST "http://127.0.0.1:8000/ask" \
     -H "Content-Type: application/json" \
     -d '{"question": "What is the main topic?"}'

🔄 System Architecture

graph TD
    A[📄 Input PDF] -->|extractor.py| B(📝 Raw Text)
    B -->|cleaner.py| C{Clean Data?}
    C -->|Remove Gibberish/Headers| D[🧹 Cleaned Text]
    D -->|chunker.py| E[🧩 Text Chunks]

    subgraph "Vector Search (RAG)"
    E -->|embedding.py| K[📐 Vector Embeddings]
    K --> M[🗄️ FAISS Index]
    M --> N[🔍 Retrieval System]
    end

    subgraph "Interfaces"
    N --> P[🖥️ Streamlit UI]
    N --> Q[🔌 FastAPI]
    end

    Q & P --> R[🤖 Gemini Answer]

📂 Project Structure

src/streamlit_app.py: Dynamic Web UI (Upload & Chat).
src/main.py: Pipeline orchestrator (Extract -> Chunk -> Index).
src/rag_core.py: Shared logic for RAG initialization and retrieval.
src/api.py: FastAPI application.
src/embedding/: Handles Gemini Embeddings and FAISS indexing.
src/summarization/: Summarization logic modules.
Dockerfile: Configuration for Docker deployment.
output/: Stores generated artifacts (index, metadata, cleaned text).

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
file		file
output		output
src		src
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Summarizer & RAG Q&A System

✨ Features

📋 Requirements

🔑 Configuration

🚀 How to Run

1. Web UI (Recommended)

2. Output Data Pipeline (Manual)

3. API Server (FastAPI)

🔄 System Architecture

📂 Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Summarizer & RAG Q&A System

✨ Features

📋 Requirements

🔑 Configuration

🚀 How to Run

1. Web UI (Recommended)

2. Output Data Pipeline (Manual)

3. API Server (FastAPI)

🔄 System Architecture

📂 Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages