📄 RAG - Powered PDF Assistant

Semantic search system powered by LangChain, PostgreSQL (pgVector) and Gemini.

This project implements a complete Retrieval-Augmented Generation (RAG) pipeline capable of ingesting a PDF document, storing embeddings in PostgreSQL with pgVector, and answering questions strictly based on the document content via CLI.

🎓 This repository contains an academic project developed during my MBA in Software Engineering with AI at Full Cycle.

🎯 Objective

Build a system capable of:

Ingestion: Read a PDF file and store its content as vector embeddings in PostgreSQL using pgVector.
Search: Allow users to ask questions via CLI and receive answers based only on the document content.

If the answer is not explicitly present in the document, the system responds:

"I do not have enough information to answer your question."

🏗 System Architecture

flowchart TD

    A[PDF Document] --> B[PyPDFLoader]
    B --> C[RecursiveCharacterTextSplitter]
    C --> D[Generate Embeddings]
    D --> E[(PostgreSQL + pgVector)]

    F[User Question CLI] --> G[Question Embedding]
    G --> H[Similarity Search k=10]
    H --> E
    H --> I[Retrieve Top Chunks]
    I --> J[Prompt Template]
    J --> K[LLM Gemini]
    K --> L[Answer Returned to CLI]

🧠 RAG Flow Explained

PDF is loaded
Text is split into chunks (1000 chars, 150 overlap)
Each chunk is converted into embeddings
Vectors are stored in PostgreSQL (pgVector)
User question is embedded
Top 10 similar chunks are retrieved
Prompt enforces strict context-based response
LLM generates final answer

🧰 Tech Stack

Language: Python
Framework: LangChain
Database: PostgreSQL + pgVector
Containerization: Docker & Docker Compose
Embeddings Models:
- Gemini → gemini-embedding-001
LLM Models:
- Gemini → gemini-2.5-flash-lite

📂 Project Structure

├── data/pdf
│   ├── document.pdf
├── src/
│   ├── chat.py
│   ├── ingest.py
│   ├── llm_manager.py
│   ├── search.py
│   └── utils.py
├── docker-compose.yml
├── requirements.txt
├── .env.example
└── README.md

⚙️ Setup Instructions

1️⃣ Clone Repository

git clone <your-repo-url>
cd <repo-name>

2️⃣ Create Virtual Environment

python3 -m venv venv
source venv/bin/activate

3️⃣ Install Dependencies

pip install -r requirements.txt

4️⃣ Configure Environment Variables

Create .env based on .env.example

Example:

GOOGLE_API_KEY=your_google_key
GOOGLE_EMBEDDING_MODEL=models/embedding-001
GOOGLE_CHAT_MODEL='gemini-2.5-flash-lite'
DATABASE_URL=postgresql+psycopg://postgres:postgres@localhost:5432/rag
PG_VECTOR_COLLECTION_NAME=company_revenue_rag
PDF_PATH=./data/pdf/document.pdf

🐳 Run PostgreSQL + pgVector

docker compose up -d

🚀 Run PDF Ingestion

python src/ingest.py

✔ Loads PDF ✔ Splits into chunks (1000 / 150 overlap) ✔ Generates embeddings ✔ Stores vectors in PostgreSQL

💬 Start CLI Chat

python src/chat.py

Example:

QUESTION: What is the revenue of the company Beta IA LTDA??

ANSWER: R$ 40.733.987,34

Out-of-context example:

QUESTION: What is the capital of France?

ANSWER: I do not have enough information to answer your question.

🔐 Prompt Safety Enforcement

The system strictly:

Uses only retrieved context
Rejects external knowledge
Prevents hallucinations
Avoids opinion-based answers
Returns fixed fallback response when necessary

🧪 Requirements Fulfilled

✔ Chunk size = 1000 ✔ Overlap = 150 ✔ Similarity search (k=10) ✔ PostgreSQL + pgVector storage ✔ CLI interaction ✔ Strict prompt enforcement

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 RAG - Powered PDF Assistant

🎯 Objective

🏗 System Architecture

🧠 RAG Flow Explained

🧰 Tech Stack

📂 Project Structure

⚙️ Setup Instructions

1️⃣ Clone Repository

2️⃣ Create Virtual Environment

3️⃣ Install Dependencies

4️⃣ Configure Environment Variables

🐳 Run PostgreSQL + pgVector

🚀 Run PDF Ingestion

💬 Start CLI Chat

🔐 Prompt Safety Enforcement

🧪 Requirements Fulfilled

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data/pdf		data/pdf
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📄 RAG - Powered PDF Assistant

🎯 Objective

🏗 System Architecture

🧠 RAG Flow Explained

🧰 Tech Stack

📂 Project Structure

⚙️ Setup Instructions

1️⃣ Clone Repository

2️⃣ Create Virtual Environment

3️⃣ Install Dependencies

4️⃣ Configure Environment Variables

🐳 Run PostgreSQL + pgVector

🚀 Run PDF Ingestion

💬 Start CLI Chat

🔐 Prompt Safety Enforcement

🧪 Requirements Fulfilled

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages