A local-first AI pipeline that extracts verified medical facts from FDA drug labels (PDFs) and indexes them into a searchable semantic knowledge base. It uses small, efficient LLMs (Gemma-2 2B) running on CPU to parse complex unstructured data into structured, cited intelligence.
- Local Privacy: Runs entirely offline using
ollama. No data leaves your machine. - Dual-Engine Storage:
- Relational (SQLite): Stores high-fidelity facts, audit trails, and performance metrics.
- Semantic (ChromaDB): Enables conceptual search (e.g., "Find drugs with renal risks") using vector embeddings.
- Hallucination Guardrails: A deterministic
QuoteVerifierensures every extracted fact is backed by an exact substring match in the source text. - Audit Trail: Every LLM thought, prompt, and latency metric is logged for full reproducibility.
The system follows a two-stage RAG (Retrieval-Augmented Generation) pipeline:
- Extraction Layer: PDFs ⮕ LLM ⮕ SQLite (Facts + Quotes)
- Semantic Layer: SQLite ⮕ Embedding Model (
all-MiniLM-L6-v2) ⮕ ChromaDB
-
Prerequisites
- Python 3.10+
- Ollama installed and running.
- Poetry (Python dependency manager).
-
Installation
# Install dependencies via Poetry poetry install # Pull the LLM ollama pull gemma2:2b
-
Usage Flow
Step 1: Extract Facts
Process your PDFs to populate the SQLite audit store.
poetry run python -m src.main data/raw_pdfs/keytruda.pdf
Step 2: Build Knowledge Base (New)
Convert extracted facts into semantic vectors for searching.
poetry run python -m src.scripts.build_knowledge_base
Step 3: Semantic Query
Ask the agent conceptual questions.
poetry run python -m src.scripts.query_agent
- Orchestration: Python 3.10 + Poetry
- LLM: Gemma-2 2B (via Ollama)
- Databases: SQLite (Audit/Facts), ChromaDB (Vector Store)
- Embeddings: Sentence-Transformers (
all-MiniLM-L6-v2) - UI: Rich (Terminal formatting)