An end-to-end, production-ready Retrieval-Augmented Generation (RAG) system built to handle complex financial documents (SEC EDGAR HTML filings).
This project goes beyond standard semantic search by utilizing advanced retrieval techniques—specifically a Self-Querying Retriever and a custom Multi-Query Retriever pipeline—backed by a Metadata-Aware Vector Database (Qdrant). This architecture achieves unmatched precision on computationally tricky queries (e.g., fetching data strictly from a specific company or year).
Standard RAG fails when a user asks: "What were Apple's supply chain risks in 2022?" The embedding model gets confused by the strict noun ("Apple") and integer ("2022") constraints.
The Solution: We use a high-parameter LLM (llama-3.3-70b-versatile) to translate the natural language question into an exact structured Qdrant metadata payload filter before performing the vector search.
- Benefit: 100% precision on filtered queries. If the user asks for 2022 Apple data, the database physically ignores all other documents.
Semantic search relies on distance metrics, which means slight phrasing differences can cause the database to miss the best source document. The Solution: Before querying, an LLM rewrites the user's base question from multiple distinct semantic perspectives. All variations are searched simultaneously, and the results are deduplicated.
- Benefit: Drastically higher recall rates. We configure this pipeline to extract the Top K=2 most unique and relevant chunk proofs, guaranteeing a highly-focused, hallucination-free answer.
- Self-Querying Retrieval: Translates natural language user questions into exact structured Qdrant metadata payload filters.
- Multi-Query Retrieval: Synthesizes multiple query variations to maximize context recall and deduplicates for the Top K=2 best proofs.
- Conversational Memory (History-Aware): An intelligent pre-processing chain condenses follow-up questions containing pronouns (e.g., "What were the risks in 2023?") into standalone queries before searching.
- Robust HTML Ingestion: Processes highly-complex SEC EDGAR filings cleanly by utilizing LangChain's
HTMLHeaderTextSplitter. It preserves the semantic<h>tagging hierarchy rather than arbitrarily slicing paragraphs. - Dynamic Multi-Dataset Parsing: Automatically extracts metadata (
Company,Year,Form Type) directly from HTML file names and content during the ingestion phase. - Prompt Injection Guardrails: Features a pre-execution security layer that intercepts adversarial prompts or "jailbreak" attempts before they hit the database.
- Automated RAG Evaluation: Includes an evaluation script (
eval.py) utilizing the Ragas framework to mathematically score Faithfulness, Answer Relevancy, and Context Precision against a golden dataset.
- Vector Database: Qdrant (Local persistent disk)
- Embeddings: HuggingFace
all-MiniLM-L6-v2(Local, fast, free) - LLM Engine: Groq API (
llama-3.1-8b-instantfor general Q&A,llama-3.3-70b-versatilefor complex Lisp generation) - Frameworks: LangChain, LangChain Classic
- App UI: Streamlit
- Evaluation: Ragas, Datasets
Metadata-Aware-Rag/
├── app.py # Main Streamlit Chatbot application
├── ingest.py # HTML ingestion, metadata extraction, and Qdrant population
├── eval.py # Ragas evaluation script
├── requirements.txt # Python dependencies
├── .env # Environment variables (GROQ_API_KEY)
└── data/ # Directory to hold raw HTML SEC filingsClone the repository and set up a Python virtual environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtCopy the .env.example file to .env (or create a new .env file) and add your Groq API key:
GROQ_API_KEY=your_groq_api_key_here- Download SEC EDGAR filings in HTML format (e.g., Apple 10-K, Tesla 10-Q).
- Place the
.htmlfiles into thedata/directory. - Run the ingestion pipeline to chunk the data and populate the Qdrant database:
python ingest.pyBoot up the Streamlit User Interface:
streamlit run app.pyTo evaluate the system's performance metrics against the included "Golden Dataset":
python eval.pyThis requires the Groq LLM to act as a "Judge" and will output a detailed ragas_evaluation_report.csv file scoring Hallucination instances, Answer Relevancy, and Context boundaries.