A modular FastAPI project that implements a Retrieval-Augmented Generation (RAG) API for PDF and TXT documents. The system stores document chunks in FAISS, retrieves the top 3 relevant chunks for each question, and generates answers grounded strictly in those retrieved chunks.
POST /uploadfor PDF and TXT ingestionPOST /queryfor grounded question answeringsentence-transformers/all-mpnet-base-v2embeddings- Local FAISS vector store with persisted chunk metadata
- Strict anti-hallucination fallback when no relevant information is found
- Basic in-memory rate limiting by client IP
- Pydantic request validation
- Modular project layout matching the PRD
app/
api/
services/
models/
utils/
data/
raw/
vector_store/
docs/
explanation.md
run.py
README.md
requirements.txt
flowchart TD
A["User Upload"] --> B["Text Extraction"]
B --> C["Chunking (300 chars / 75 overlap)"]
C --> D["Embeddings:sentence-transformers/all-mpnet-base-v2"]
D --> E["FAISS Vector Store"]
F["User Query"] --> G["Query Embedding"]
G --> H["Similarity Search (Top 3)"]
E --> H
H --> I["Grounded Answer Generation"]
I --> J["API Response"]
- FastAPI
- Sentence Transformers
- FAISS
- PyPDF
- OpenAI Python SDK
The project is wired to the official OpenAI SDK through environment variables:
OPENAI_API_KEYOPENAI_MODELOPENAI_BASE_URL(optional for OpenAI-compatible providers)
If you do not provide an API key yet, the app still works using an extractive fallback that builds answers only from the retrieved chunks. This keeps the system grounded and avoids blocking development.
python -m venv .venv
.venv\Scripts\Activate.ps1pip install -r requirements.txtCopy .env.example to .env and fill in values when you are ready:
OPENAI_API_KEY=your_api_key_here
OPENAI_MODEL=gpt-4.1-mini
OPENAI_BASE_URL=
RATE_LIMIT_MAX_REQUESTS=30
RATE_LIMIT_WINDOW_SECONDS=60
MIN_SIMILARITY_THRESHOLD=0.35python run.pyThe service starts on http://localhost:8000.
Interactive docs are available at:
http://localhost:8000/docshttp://localhost:8000/redoc
Accepts a single PDF, TXT, MD, DOC, DOCX, HTML, CSV, JSON, XML file, extracts text, chunks it, embeds it, and stores the result in FAISS.
Example response:
{
"message": "Document processed successfully",
"chunks_created": 120
}Accepts a question and returns a grounded answer plus the retrieved chunks.
Request:
{
"question": "What is the main topic of the document?"
}Response:
{
"answer": "The document discusses ...",
"retrieved_chunks": [
"chunk1",
"chunk2",
"chunk3"
]
}If there is no relevant evidence in the vector store, the system returns:
{
"answer": "No relevant information found in documents",
"retrieved_chunks": []
}- Chunk size 500 / overlap 50: balances context retention with retrieval precision.
- Top K = 3: matches the PRD and keeps prompt context focused.
- Background processing: upload work runs in a worker thread via
asyncio.to_thread(...), so the FastAPI event loop stays responsive. - Similarity filtering: low-confidence matches are discarded instead of forcing an answer.
More evaluation details are documented in docs/explanation.md.
faiss-cpu and sentence-transformers depend on compiled packages. If installation is difficult on your local Windows Python version, use Python 3.11 or 3.12, which is usually the safest path for ML tooling compatibility.