Medical RAG – Reliable Medical Document Processing Under Resource Constraints

A system for reducing hallucinations and improving reliability in low-resource LLM pipelines for medical document processing.

Overview

Large Language Models (LLMs) struggle with hallucinations and unreliable outputs, especially in low-resource environments where small models must be used.

This project explores a practical question:

How can we build reliable, deterministic medical NLP pipelines when model capacity is severely constrained?

Medical RAG is a backend system that:

Processes unstructured clinical documents (lab reports, notes)
Extracts high-signal structured medical entities
Tries to reduce hallucination by constraining LLM inputs
Produces verifiable, structured outputs

Reason for saying tries to is because different models behave differently as you will notice in the test results folder

The system prioritizes:

Reliability over fluency
Determinism over generative freedom
System-level correctness over model capability

Core Problem

Naive LLM-based pipelines produced:

Hallucinated medical facts
Irrelevant explanations
Inconsistent outputs across runs

This was amplified when using:

Small models (e.g. Qwen 0.5B)
CPU-only environments
No fine-tuning capability

Key Insight

Instead of relying on the LLM to interpret raw documents:

Reduce the problem space before generation.

The system:

Extracts critical entities first (NER)
Filters noise aggressively
Feeds only structured, high-signal inputs into the LLM

This shifts the pipeline from:

“generate everything” to “generate only what is already grounded”

Tradeoffs of this Approach

Filtering noise means reducing the amount of context, some of which is important
Smaller LLMs(0.5B) have problems parsing & understanding structured data formats like json
Increases complexity

System Architecture

A modular, multi-stage pipeline:

Document → OCR → NER → Entity Filtering → Chunking → Embeddings → Retrieval → Constrained Generation

Design Principles

Deterministic preprocessing before generation
Strict separation between extraction and generation stages
Async pipeline for scalable ingestion
CPU-efficient design for low-resource environments

Full architecture

Reliability Mechanisms

To move beyond naive RAG:

Structured extraction before generation
Controlled input space for LLMs
JSON-based output schemas
Custom evaluation checks for critical entities

This ensures:

Consistency across runs
Reduced hallucination
Verifiable outputs

Backend System

FastAPI for API layer
sqlite for structured data
Redis RQ for async processing
JSON schemas for validation and versioning

System designed for:

high-throughput ingestion
modular extension
production deployment

Setup

Performance

Reduced pipeline latency: 70s → 30s
Reduced inference latency: 5.2s → 150ms
Optimized for CPU-only environments (no GPU)

Benchmarks:

Research & Engineering Contributions

This project demonstrates:

Practical mitigation of hallucination in small LLMs
Hybrid pipeline design (symbolic + neural components)
Reliability-focused RAG architecture
Engineering tradeoffs in constrained environments

Limitations

No fine-tuning (relies on pre-trained models)
Evaluation currently rule-based (not human-validated)
Limited to specific medical document formats

Future Work

Learned evaluation models for output validation
Adaptive retrieval strategies
Integration with larger models when compute allows
Clinical validation with domain experts

Author

George Njunge Backend & AI Systems Engineer Focused on building reliable AI systems under real-world constraints

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.deepeval		.deepeval
apps		apps
docs		docs
infra/docker		infra/docker
ml_core		ml_core
research		research
test_results		test_results
.hintrc		.hintrc
README.md		README.md
requirements.txt		requirements.txt
samplePmedReport_md.json		samplePmedReport_md.json
samplePmedReport_md.md		samplePmedReport_md.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical RAG – Reliable Medical Document Processing Under Resource Constraints

Overview

Core Problem

Key Insight

Tradeoffs of this Approach

System Architecture

Design Principles

Reliability Mechanisms

Backend System

Performance

Research & Engineering Contributions

Limitations

Future Work

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Medical RAG – Reliable Medical Document Processing Under Resource Constraints

Overview

Core Problem

Key Insight

Tradeoffs of this Approach

System Architecture

Design Principles

Reliability Mechanisms

Backend System

Performance

Research & Engineering Contributions

Limitations

Future Work

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages